Operations & Observability

Operational excellence hinges on reliable deployment workflows, comprehensive telemetry, and repeatable recovery procedures. This chapter documents the tooling, metrics, logging, and runbooks that support Cirrus CDN in production.

Deployment Workflows

Local Development

just up – Builds and launches the Docker Compose stack.
just down / just down-no-volumes – Tears down containers with or without volume removal.
just pytest – Executes backend tests (uv run pytest -q).
just quicktest – Runs a filtered suite excluding long-running ACME/DNS tests.
just fmt – Formats Python codebase (autoflake, isort, black).

Production Deployment

just deploy – Invokes Ansible playbooks (ansible/), using INVENTORY and PLAYBOOK environment variables loaded via dotenv.
Playbooks should orchestrate infrastructure provisioning, secret injection, and configuration templating consistent with this white paper.

Runtime Supervision

Process Supervision – Use an init system (systemd, Kubernetes) to supervise API, worker, beat, OpenResty, and NSD processes. Compose is suited for local dev only.
Health Checks – Monitor:
- API /healthz (ensures Redis availability).
- OpenResty http://<node>:9145/healthz (used by Celery health checks).
- Celery worker liveness (e.g., ping tasks).

Metrics & Dashboards

Prometheus (prometheus/prometheus.dev.yml) scrapes OpenResty metrics every 5 seconds at 127.0.0.1:9145.
Key Metrics:
- nginx_http_requests_total{host,status} – Request rates by host/status.
- nginx_http_request_duration_seconds – Latency histogram.
- nginx_cache_status_total{status} – Cache behavior (HIT/MISS/STALE/BYPASS).
- nginx_upstream_errors_total, nginx_upstream_timeouts_total – Backend health.
- nginx_upstream_response_seconds – Upstream RTT.
- nginx_ssl_handshake_errors_total{phase} – TLS handshake issues.
Grafana – Ship Grafana dashboards (provisioning under grafana/) to visualize cache efficiency, origin health, and request trends. Enable auth per organizational policy.
Celery Metrics – Currently absent; recommended enhancements include emitting task durations via StatsD/Prometheus exporters (see the Appendices).

Sample Dashboard (Reference Layout)

Panels map directly to Lua/NGINX metrics initialized in openresty/conf/init_worker.lua and updated by log_metrics.lua. Add recording rules for p95/p99 latency to stabilize panels and alerts.

Logging

API & Workers – Log to stdout/stderr. Capture acme_* events, errors from Redis, and zone rebuilds.
OpenResty Access & Error Logs – Emitted to stdout/stderr only and forwarded by the Loki Docker logging driver; apply retention and privacy controls at the aggregator.
Celery Logs – cirrus-worker and cirrus-beat containers log to stdout; integrate with log aggregation (ELK, Loki) in production.

Log Ingestion (Integration Examples)

Query Loki for OpenResty streams and, if needed, export to downstream systems via LogQL or the Loki API.
Enrich with node labels and environment tags in Loki (via external labels) for BI queries.
For privacy: redact query strings and sanitize headers at the shipper.

Alerting Recommendations

Signal	Alert Condition	Response
Cache hit ratio drop	Sudden drop below threshold (e.g., less than 40%)	Inspect upstream availability, caching rules.
Upstream errors	Spike in `nginx_upstream_errors_total`	Check origin health, network latency.
ACME failure	`acme_fail` log or `cdn:acme:{domain}` status `failed`	Investigate DNS alignment, acme-dns reachability.
Node deactivation	Health task reports `down` state	Validate node; optionally disable or remove via `/api/v1/nodes`.
Prometheus scrape failure	Missing metrics from a node	Confirm OpenResty health endpoint availability.

Alert–Response Workflow

Backup & Recovery

Redis Persistence – AOF (--appendonly yes) ensures data durability. Implement backups (RDB snapshots, managed service backups) and test restore procedures.
Certificates – Since certificates reside in Redis, backups capture them automatically. Plan for secure storage and rotation.
DNS Zone State – Recomputed from Redis; no additional backup required if Redis is intact.
Configuration – Infrastructure-as-code (Ansible, Dockerfiles) should be version-controlled; document manual adjustments.

Troubleshooting Playbooks

Domain Returns 404 – Check cdn:dom:{domain} exists; inspect OpenResty logs for router: no conf for host. If config exists, ensure origins array is populated.
TLS Handshake Failure – Ensure cdn:cert:{domain} contains valid PEM entries; review ssl_loader logs for parse/set failures.
ACME Issuance Stuck – Inspect cdn:acme:lock:{domain} and cdn:acme:task:{domain}; if lingering beyond TTL, clear keys and requeue; verify _acme-challenge CNAME.
Node Missing from DNS – Confirm active flag is "1" in cdn:node:{id}; review health check logs for failures.
Metrics Missing – Validate NGX_METRICS_ALLOW includes Prometheus source; ensure port 9145 reachable; inspect OpenResty error log for Lua errors.

Scale Parameters & SLOs (Reference)

These values are environment-dependent; validate with load testing in your target footprint.

Concurrent connections per node: ~ worker_processes * worker_connections (default auto * 1024). Increase NGX_WORKER_CONNECTIONS to raise limits.
QPS per node (cached traffic): Primarily CPU-bound; with cache HITs, thousands to tens of thousands QPS are typical on modern 4–8 vCPU hosts.
Cache hit ratio (steady state): Target ≥ 80% for cacheable workloads; investigate rule scope if below 50%.
Latency reduction: Cache HITs should eliminate origin RTT; p95 latency should approach network RTT + Nginx processing (< 20–40 ms on LAN).
TLS handshakes: Keep error rates in nginx_ssl_handshake_errors_total near zero; spikes indicate cert/key issues or client incompatibilities.

Testing Strategy

Automated tests under control-plane/tests/ cover API behavior, ACME flows, and DNS health scenarios. Run just fresh-test before major releases to ensure a clean environment.
Integration tests rely on the full Docker stack (just up + pytest), exercising acme-dns and OpenResty interactions.

Robust observability and operational hygiene keep Cirrus CDN resilient. The white paper concludes with the Appendices for quick lookup.

Deployment Workflows​

Local Development​

Production Deployment​

Runtime Supervision​

Metrics & Dashboards​

Sample Dashboard (Reference Layout)​

Logging​

Log Ingestion (Integration Examples)​

Alerting Recommendations​

Alert–Response Workflow​

Backup & Recovery​

Troubleshooting Playbooks​

Scale Parameters & SLOs (Reference)​

Testing Strategy​