Skip to main content

Operations & Observability

Operational excellence hinges on reliable deployment workflows, comprehensive telemetry, and repeatable recovery procedures. This chapter documents the tooling, metrics, logging, and runbooks that support Cirrus CDN in production.

Deployment Workflows

Local Development

  • just up – Builds and launches the Docker Compose stack.
  • just down / just down-no-volumes – Tears down containers with or without volume removal.
  • just pytest – Executes backend tests (uv run pytest -q).
  • just quicktest – Runs a filtered suite excluding long-running ACME/DNS tests.
  • just fmt – Formats Python codebase (autoflake, isort, black).

Production Deployment

  • just deploy – Invokes Ansible playbooks (ansible/), using INVENTORY and PLAYBOOK environment variables loaded via dotenv.
  • Playbooks should orchestrate infrastructure provisioning, secret injection, and configuration templating consistent with this white paper.

Runtime Supervision

  • Process Supervision – Use an init system (systemd, Kubernetes) to supervise API, worker, beat, OpenResty, and NSD processes. Compose is suited for local dev only.
  • Health Checks – Monitor:
    • API /healthz (ensures Redis availability).
    • OpenResty http://<node>:9145/healthz (used by Celery health checks).
    • Celery worker liveness (e.g., ping tasks).

Metrics & Dashboards

  • Prometheus (prometheus/prometheus.dev.yml) scrapes OpenResty metrics every 5 seconds at 127.0.0.1:9145.
  • Key Metrics:
    • nginx_http_requests_total{host,status} – Request rates by host/status.
    • nginx_http_request_duration_seconds – Latency histogram.
    • nginx_cache_status_total{status} – Cache behavior (HIT/MISS/STALE/BYPASS).
    • nginx_upstream_errors_total, nginx_upstream_timeouts_total – Backend health.
    • nginx_upstream_response_seconds – Upstream RTT.
    • nginx_ssl_handshake_errors_total{phase} – TLS handshake issues.
  • Grafana – Ship Grafana dashboards (provisioning under grafana/) to visualize cache efficiency, origin health, and request trends. Enable auth per organizational policy.
  • Celery Metrics – Currently absent; recommended enhancements include emitting task durations via StatsD/Prometheus exporters (see the Appendices).

Sample Dashboard (Reference Layout)

Panels map directly to Lua/NGINX metrics initialized in openresty/conf/init_worker.lua and updated by log_metrics.lua. Add recording rules for p95/p99 latency to stabilize panels and alerts.

Logging

  • API & Workers – Log to stdout/stderr. Capture acme_* events, errors from Redis, and zone rebuilds.
  • OpenResty Access & Error Logs – Emitted to stdout/stderr only and forwarded by the Loki Docker logging driver; apply retention and privacy controls at the aggregator.
  • Celery Logscirrus-worker and cirrus-beat containers log to stdout; integrate with log aggregation (ELK, Loki) in production.

Log Ingestion (Integration Examples)

  • Query Loki for OpenResty streams and, if needed, export to downstream systems via LogQL or the Loki API.
  • Enrich with node labels and environment tags in Loki (via external labels) for BI queries.
  • For privacy: redact query strings and sanitize headers at the shipper.

Alerting Recommendations

SignalAlert ConditionResponse
Cache hit ratio dropSudden drop below threshold (e.g., less than 40%)Inspect upstream availability, caching rules.
Upstream errorsSpike in nginx_upstream_errors_totalCheck origin health, network latency.
ACME failureacme_fail log or cdn:acme:{domain} status failedInvestigate DNS alignment, acme-dns reachability.
Node deactivationHealth task reports down stateValidate node; optionally disable or remove via /api/v1/nodes.
Prometheus scrape failureMissing metrics from a nodeConfirm OpenResty health endpoint availability.

Alert–Response Workflow

Backup & Recovery

  • Redis Persistence – AOF (--appendonly yes) ensures data durability. Implement backups (RDB snapshots, managed service backups) and test restore procedures.
  • Certificates – Since certificates reside in Redis, backups capture them automatically. Plan for secure storage and rotation.
  • DNS Zone State – Recomputed from Redis; no additional backup required if Redis is intact.
  • Configuration – Infrastructure-as-code (Ansible, Dockerfiles) should be version-controlled; document manual adjustments.

Troubleshooting Playbooks

  • Domain Returns 404 – Check cdn:dom:{domain} exists; inspect OpenResty logs for router: no conf for host. If config exists, ensure origins array is populated.
  • TLS Handshake Failure – Ensure cdn:cert:{domain} contains valid PEM entries; review ssl_loader logs for parse/set failures.
  • ACME Issuance Stuck – Inspect cdn:acme:lock:{domain} and cdn:acme:task:{domain}; if lingering beyond TTL, clear keys and requeue; verify _acme-challenge CNAME.
  • Node Missing from DNS – Confirm active flag is "1" in cdn:node:{id}; review health check logs for failures.
  • Metrics Missing – Validate NGX_METRICS_ALLOW includes Prometheus source; ensure port 9145 reachable; inspect OpenResty error log for Lua errors.

Scale Parameters & SLOs (Reference)

These values are environment-dependent; validate with load testing in your target footprint.

  • Concurrent connections per node: ~ worker_processes * worker_connections (default auto * 1024). Increase NGX_WORKER_CONNECTIONS to raise limits.
  • QPS per node (cached traffic): Primarily CPU-bound; with cache HITs, thousands to tens of thousands QPS are typical on modern 4–8 vCPU hosts.
  • Cache hit ratio (steady state): Target ≥ 80% for cacheable workloads; investigate rule scope if below 50%.
  • Latency reduction: Cache HITs should eliminate origin RTT; p95 latency should approach network RTT + Nginx processing (< 20–40 ms on LAN).
  • TLS handshakes: Keep error rates in nginx_ssl_handshake_errors_total near zero; spikes indicate cert/key issues or client incompatibilities.

Testing Strategy

  • Automated tests under control-plane/tests/ cover API behavior, ACME flows, and DNS health scenarios. Run just fresh-test before major releases to ensure a clean environment.
  • Integration tests rely on the full Docker stack (just up + pytest), exercising acme-dns and OpenResty interactions.

Robust observability and operational hygiene keep Cirrus CDN resilient. The white paper concludes with the Appendices for quick lookup.