← Return to Selected Works

Legacy Monitor

AF-SOUTH-1SHA-256: J0K1L2ARCHIVED

Legacy Monitor was the first attempt at building an observability layer for the cluster. It was a Python script that ran as a cron job, queried the OpenStack API, and wrote metrics to a flat CSV file.

Why It Was Retired

The architecture was fundamentally incompatible with scaling. Every metric query was synchronous and blocking. Adding a new hypervisor node increased the cron runtime linearly. At 40 nodes it took longer to collect metrics than the cron interval — meaning metrics were always stale.

The final straw was a 6-hour outage that the monitor completely failed to surface because it had silently timed out on an unresponsive node while marking it healthy.

What It Taught Us

The failure modes of Legacy Monitor directly shaped Production Engine's design: async metric collection from the start, dead-man's-switch alerts for collector health, and immutable audit logs for every state change.

Status

Archived. The codebase is preserved for historical reference but the binary is not running. Replaced by Production Engine in Q2 2025.