When Packets Disappear: Debugging an MTU Mismatch in a Hybrid OpenStack Docker Swarm
A real-world troubleshooting story about silent packet drops, floating IPs, and why the obvious fix is not always the right one.
Background: Why We Have a Hybrid Setup
Our infrastructure runs on a hybrid model — some nodes live inside an OpenStack private cloud, and some live on bare metal servers outside of it. This was not an accident or an oversight. It was a deliberate architectural decision born out of caution.
OpenStack is powerful, but like any cloud platform, it can have outages, networking hiccups, or capacity issues — especially when it is still being evaluated in production for the first time. We wanted a safety net. If the OpenStack environment had a bad day, our core orchestration layer would still be standing.
So our Docker Swarm cluster looks like this:
Outside OpenStack (bare metal):
- 2 Swarm Manager nodes
- Keepalived (for VIP failover)
- HAProxy (load balancer)
- Traefik (reverse proxy and service router)
- 1 Data node (persistent storage workloads)
- 1 Monitoring node (observability stack)
Inside OpenStack:
- 1 Production node (email API and other prod services)
- 1 Test node (test workloads)
- ASG worker nodes (auto-scaled up and down based on CPU and memory)
Traffic flows like this for any external request:
Internet → VIP → HAProxy → Traefik → Service container
Services talk to each other using their Traefik URLs, for example https://email-service/api/send. Traefik handles the routing based on labels attached to each Docker service.
This setup has been working well. Until one day, emails stopped sending.
The Problem: Emails Timing Out Silently
A service running on the test node was making POST requests to the email service. The email service is hosted on the production node inside OpenStack. The requests were timing out on the calling side, and the email service was not receiving anything at all — no logs, no errors, nothing. It was as if the requests were vanishing into thin air.
Meanwhile, the monitoring node — which also sends emails (a daily digest with AI-generated summaries and HTML content) — was working perfectly fine.
That asymmetry was the first interesting clue.
First Hypothesis: Docker Overlay MTU Problem
The initial instinct was a classic Docker Swarm networking issue. When Docker creates overlay networks (the virtual networks that let containers on different physical nodes talk to each other), it assumes the underlying network can carry standard Ethernet frames of 1500 bytes.
But OpenStack's virtual network adds its own wrapper around every packet. Technologies like VXLAN or Geneve are used to tunnel traffic between virtual machines, and that tunnelling eats into the available space:
Think of it like putting a letter inside an envelope, and then putting that envelope inside a bigger envelope to mail it. The outer envelope takes up space, so the inner letter has to be smaller.
- Standard Ethernet MTU: 1500 bytes
- VXLAN overhead: ~50 bytes
- Effective MTU on OpenStack: ~1450 bytes
If Docker thinks it can send 1500-byte packets but the network can only carry 1450, oversized packets get silently dropped. No error. No ICMP "too big" message. Just gone.
This is called an MTU mismatch, and it is a well-known pain point in containerised environments running on top of virtualised networks.
The standard fix for this is:
- Tell Docker to use a smaller MTU in
/etc/docker/daemon.json - Recreate the overlay networks with the correct MTU
- Recreate the ingress network
{
"mtu": 1400
}
But this approach had a serious problem for our environment.
Why the Standard Fix Was Not Viable
We have a lot of services deployed across many nodes. Recreating the Docker ingress network requires all nodes to temporarily lose port routing. Recreating overlay networks means services get restarted. With dozens of services and several nodes, this would mean significant downtime.
We needed to think more carefully before touching anything.
Digging Deeper: A Ping Test Reveals the Truth
Before making any changes, we ran a diagnostic test on the test node — the one where emails were failing:
docker run --rm alpine ping -c 5 -s 1200 8.8.8.8
Result: 5/5 packets received. Fine.
docker run --rm alpine ping -c 5 -s 1472 8.8.8.8
Result: 0/5 packets received. 100% loss.
This is significant. The -s flag sets the packet payload size. Adding 28 bytes for the ICMP and IP headers, a payload of 1472 bytes makes a total packet size of exactly 1500 bytes — the standard Ethernet MTU.
So anything at or near a full-size Ethernet frame was being completely dropped when leaving the OpenStack node. This confirmed there was an MTU problem, but the question was: where exactly was it happening, and why was it only affecting the test node and not the monitoring node?
The Key Asymmetry: Inside vs Outside OpenStack
Let us look at what was different between the working and failing cases:
| Source | Destination | Result |
|---|---|---|
| Monitoring node (outside OpenStack) | Email service (inside OpenStack) | ✅ Works |
| Old test node (outside OpenStack) | Email service (inside OpenStack) | ✅ Works |
| New test node (inside OpenStack) | Email service (inside OpenStack) | ❌ Fails |
The common variable is not the payload size. The monitoring node sends large HTML emails and they go through fine. The common variable is where the request originates. Everything originating from inside OpenStack to the email service was failing.
This pointed away from a Docker overlay problem and toward a network path problem specific to OpenStack.
The Real Traffic Path: A Surprising Discovery
Here is where the architecture revealed something unexpected. The OpenStack nodes join the Swarm like this:
docker swarm join --token "$TOKEN" "$MANAGER" \
--advertise-addr "$FLOATING_IP" \
--listen-addr 0.0.0.0:2377
The --advertise-addr is set to the node's floating IP — its external, publicly routable IP address. This was necessary because the Swarm managers live outside OpenStack, and the only way for an OpenStack node to reach them is via the external network.
But this has a side effect. Every other node in the Swarm — including other OpenStack nodes on the same internal subnet — now thinks the only way to reach that node is via its floating IP. So when the test node talks to the email service, even though they are on the same internal OpenStack network, the traffic takes this path:
Test node (inside OpenStack)
→ exits via floating IP through OpenStack router (NAT)
→ hits external network
→ HAProxy
→ Traefik
→ re-enters OpenStack via prod node floating IP (NAT)
→ Email service container
Two nodes sitting on the same internal subnet are taking a round trip through the external network to talk to each other. And each time a packet crosses the OpenStack network boundary through NAT, it picks up more overhead.
The monitoring node works because it is already outside OpenStack. Its traffic only crosses the boundary once — going in. No double NAT, less encapsulation pressure on each packet.
Confirming With Interface Inspection
Running ip link show on the test node made the mismatch immediately visible:
ens3: MTU 1450 ← OpenStack network interface
docker0: MTU 1500 ← Docker bridge, unaware
docker_gwbridge: MTU 1500 ← Docker gateway bridge, also unaware
veth*: MTU 1500 ← All container interfaces, also unaware
The host network interface ens3 is correctly at 1450 — OpenStack set it that way to account for VXLAN overhead. But every Docker interface on the same node is at 1500. Docker was never told about the OpenStack constraint.
So when a container builds a packet, it thinks it has 1500 bytes to work with. That packet travels through the veth interface, through the Docker gateway bridge, and then hits ens3 — which can only carry 1450 bytes. The oversized packet hits the wall and is silently dropped.
The Iptables Fix That Worked
Before understanding all of this fully, an iptables rule was applied on the test node:
sudo iptables -t mangle -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
And the emails started going through immediately.
What Does This Actually Do?
To understand this fix, you need to know a little about how TCP connections work.
When two machines want to talk over TCP (the protocol used for HTTP, HTTPS, and most internet traffic), they start with a handshake. During this handshake, both sides announce the largest chunk of data they are willing to receive at once. This is called the Maximum Segment Size (MSS).
Think of it like two people agreeing on how many items to pass at once down a conveyor belt. If you agree on small batches, nothing gets dropped even if the belt has a narrow section somewhere in the middle.
The iptables rule intercepts the very first packet of every TCP connection (the SYN packet), and rewrites the MSS value to something smaller. Both sides then negotiate based on that smaller value, and the entire connection uses smaller chunks from the start. The oversized packet problem never occurs because the data is broken into pieces that fit.
The --set-mss 1360 hardcodes the MSS to 1360 bytes. It works, but a smarter version uses --clamp-mss-to-pmtu instead:
iptables -t mangle -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
This tells the kernel to calculate the correct MSS automatically based on the actual outgoing interface MTU (1450 on ens3), rather than using a hardcoded value. If the network MTU ever changes, the rule adapts automatically.
Why This Fix and Not the Docker MTU Fix?
This is the important question. We could have:
- Changed
daemon.jsonto set Docker MTU to 1400 - Recreated all overlay networks
- Recreated the ingress network
But that approach would have caused significant downtime across all services for a problem that only affects outbound TCP from OpenStack nodes. It is the right fix if your Docker overlay traffic between nodes is dropping. It is overkill — and risky — when the actual problem is a specific outbound path.
The iptables TCPMSS approach:
- Touches nothing else in the stack
- Requires no service restarts
- Requires no network recreation
- Only affects outbound TCP SYN packets from that node
- Is invisible to services and containers
We confirmed this by checking the iptables rule counters after applying it:
sudo iptables -t mangle -L FORWARD -n -v --line-numbers
Chain FORWARD (policy ACCEPT 5894K packets, 2970M bytes)
num pkts bytes target prot opt in out source destination
1 6 360 TCPMSS 6 -- * * 0.0.0.0/0 0.0.0.0/0 tcp flags:0x06/0x02 TCPMSS clamp to PMTU
Only 6 packets — just the email test traffic. Browser traffic serving the client app was not going through the rule at all. The fix was surgical.
Making It Persistent: The Manual Node
The iptables rule applied manually disappears on reboot. For the manually provisioned test node, the fix is:
# Remove the old hardcoded rule
sudo iptables -t mangle -D FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
# Add the smarter adaptive rule
sudo iptables -t mangle -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
# Install persistence
sudo apt-get install -y iptables-persistent
sudo netfilter-persistent save
iptables-persistent saves the current rules to disk and restores them automatically on every boot.
Making It Automatic: The ASG Nodes
The bigger concern was the Auto Scaling Group. Our OpenStack ASG spins up new worker nodes automatically when load increases. Each new node is an OpenStack VM and would have the same MTU mismatch out of the box. If a service happened to land on a new ASG node and made outbound HTTP calls, it would silently fail — and we might not notice until something like an email timeout surfaced it.
The fix belongs in the user_data cloud-init script that runs on every new node at boot. In our Heat template, right after the Swarm join:
echo "Joining swarm at ${MANAGER} advertising ${FLOATING_IP}..."
docker swarm join --token "$TOKEN" "$MANAGER" \
--advertise-addr "$FLOATING_IP" \
--listen-addr 0.0.0.0:2377
echo "Swarm join complete."
# ── Fix MTU mismatch between Docker (1500) and OpenStack interface (1450) ──
echo "--- Applying MTU fix ---"
echo "Host interface MTU before fix:"
ip link show ens3 | grep mtu
echo "Applying TCPMSS clamp rule..."
iptables -t mangle -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
echo "TCPMSS rule applied."
echo "Verifying rule:"
iptables -t mangle -L FORWARD -n -v --line-numbers
echo "Installing iptables-persistent..."
DEBIAN_FRONTEND=noninteractive apt-get install -y iptables-persistent
echo "iptables-persistent installed."
echo "Saving rules..."
netfilter-persistent save
echo "Rules saved."
echo "--- MTU fix complete ---"
The DEBIAN_FRONTEND=noninteractive flag is important. Without it, apt-get install iptables-persistent will pause and wait for interactive input asking whether to save current IPv4 and IPv6 rules — something that cannot happen in an automated script. The flag suppresses all prompts.
Every new ASG node now gets the fix automatically at boot, and the log at /var/log/swarm-setup.log will contain a full trace of the MTU fix running, so you can verify it without SSHing into the node.
What We Did Not Need to Do
It is worth being explicit about this. The following changes that are commonly suggested for MTU problems in Docker Swarm were not needed for our specific situation:
- ❌ Changing
daemon.jsonMTU - ❌ Recreating overlay networks
- ❌ Recreating the ingress network
- ❌ Changing host interface MTU with
ip link set - ❌ Draining any nodes
- ❌ Any service restarts
The reason is that our problem was not in the Docker overlay between nodes. It was in outbound TCP from containers on OpenStack nodes going through a double-NAT path. The TCPMSS clamp fixed it at exactly the right layer.
Lessons Learned
1. Trace the actual traffic path before deciding where to fix. MTU problems in hybrid environments are rarely a single-layer issue. Our traffic was going: container → Docker gateway bridge → OpenStack interface → external network → HAProxy → Traefik → back into OpenStack. Understanding that path was what led us to the right fix.
2. Asymmetry in failures is a signal, not noise. The fact that the monitoring node worked but the test node did not was the most important clue. Same destination, same service, different result. That asymmetry pointed directly at the source node's network path being different — which led us to the floating IP and double-NAT discovery.
3. The standard fix is not always the right fix. The Docker daemon MTU approach is correct for overlay network MTU mismatches. But applying it blindly would have caused unnecessary downtime and not addressed the root cause.
4. Bake infrastructure fixes into provisioning, not just running nodes. Fixing the running node is only half the job. If your ASG spins up ten new nodes tomorrow and they all have the same problem, you will be chasing the same fire. The fix belongs in the provisioning script.
5. Silent drops are the hardest bugs. No error. No ICMP response. No log entry on the receiving side. Just a timeout on the sender. These are the bugs that can send you chasing application code, DNS, TLS, or service configuration for hours before you think to check MTU.
Summary
| What we thought the problem was | Docker overlay MTU mismatch |
|---|---|
| What the problem actually was | Docker (MTU 1500) vs OpenStack interface (MTU 1450) mismatch on outbound TCP from OpenStack nodes |
| Why it only affected OpenStack nodes | Outside nodes have real 1500 MTU interfaces with no mismatch |
| Why monitoring worked but test node failed | Monitoring node is outside OpenStack, only crosses the boundary once |
| The fix | TCPMSS iptables clamp on each OpenStack node |
| Where the fix lives | Manually on existing nodes, baked into Heat user_data for ASG nodes |
| What we avoided | Any Docker network changes, downtime, service restarts |
The infrastructure is hybrid by design and will stay that way until OpenStack proves itself reliable enough to trust fully. In the meantime, understanding exactly how packets move through a mixed environment — and where they can silently disappear — is what keeps things running.