Building a Highly Available Swarm Cluster on OpenStack

Deploying a robust, fault-tolerant cluster isn't just about spinning up instances; it's about engineering resilience at every layer. In this log, I'm documenting the architecture and configuration of a production-grade Docker Swarm cluster running on OpenStack.

The Architecture

Our foundation requires at least three manager nodes to maintain quorum. Losing a single manager should not degrade the cluster state. Worker nodes scale horizontally across different availability zones to prevent a single point of failure.

The manager tier handles cluster state, scheduling, and service reconciliation. Workers are provisioned in groups per zone, so a complete AZ outage only removes a fraction of total capacity.

Network Configuration

Proper overlay networking is critical. We utilize Traefik as the ingress controller, routing traffic efficiently to the appropriate services. The configuration is declared in a docker-compose.yml deployed as a stack.

version: "3.8"

services:
  traefik:
    image: traefik:v2.9
    command:
      - "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.swarmMode=true"
      - "--entrypoints.web.address=:80"
    ports:
      - "80:80"
      - "8080:8080"
    networks:
      - proxy
    deploy:
      placement:
        constraints: [node.role == manager]

networks:
  proxy:
    external: true

The overlay network ensures containers on separate physical hosts can communicate as if they are on the same LAN, while Traefik's Swarm mode provider dynamically discovers services as they are deployed.

Automated Scaling

Using swarmctl alongside custom Prometheus metrics allows us to dynamically provision worker nodes when CPU load exceeds our baseline threshold.

$ swarmctl cluster update \
    --autoscale-cpu-threshold=75 \
    --autoscale-max-nodes=10 \
    --cluster-name=prod-swarm-alpha

When the system detects a sustained spike, the OpenStack API is triggered to boot a new instance, execute a cloud-init script containing the swarm join token, and register the node to the cluster within minutes. This ensures our services remain responsive even under unpredictable loads. We also have alerting tied into our Slack channels for real-time visibility.

What's Next

The next iteration involves integrating Consul for service discovery and moving secrets management to Vault. Both bring operational maturity the cluster currently lacks — expect a follow-up post once the migration settles.