Skip to main content

Automation & Certificate Lifecycle

Celery powers Cirrus CDN’s asynchronous workflows: ACME certificate issuance, renewal scans, and node health checks. This chapter dissects task scheduling, locking strategies, and external integrations driven by control-plane/src/cirrus/celery_app.py.

Celery Configuration

celery_app.py defines a Celery instance with Redis as both broker and result backend:

  • Broker URL defaults to redis://[password@]host:port/0.
  • Result backend defaults to database 1.
  • Serializers are JSON-only; timezone defaults to UTC.

The Celery beat schedule includes:

  • acme_renewal_scan – Cron-scheduled (default hourly at minute 0). Runs cirrus.acme.scan_and_renew.
  • cname_node_health – Periodic task scheduled via celery.schedules.schedule, interval derived from node health settings (NODE_HEALTH_INTERVAL_SECS).

Queues are configurable via environment variables (ACME_RENEW_QUEUE, CNAME_HEALTH_QUEUE).

ACME Certificate Issuance

Task Flow

acme_issue_task (Celery task name cirrus.acme.issue_certificate) orchestrates issuance:

  1. Generates a per-task token (Celery request.id or random hex).
  2. Calls _acme_issue_task_async(domain, token) inside asyncio.run.
  3. Acquires a Redis lock (cdn:acme:lock:{domain}) with TTL (ACME_LOCK_TTL, default 900 seconds). Skips if lock exists.
  4. Persists task ID in cdn:acme:task:{domain} for operator visibility.
  5. Marks ACME status as "running" in cdn:acme:{domain}.
  6. Ensures ACME registration exists by calling ensure_acme_registered (acme_common.py), which interacts with acme-dns via httpx.AsyncClient.
  7. Optionally enforces _acme-challenge CNAME readiness (ENFORCE_ACME_CNAME_CHECK, WAIT_FOR_CNAME, CNAME_WAIT_SECS).
  8. Loads or generates an ACME account (cdn:acmeacct:global) and a domain private key (cdn:acmecertkey:{domain}).
  9. Issues the certificate using sewer via issue_certificate_with_sewer.
  10. Stores the resulting fullchain PEM and private key in cdn:cert:{domain}, updates ACME status to "issued", and caches issuance timestamp.
  11. Unlocks by deleting cdn:acme:task:{domain} and the lock key (if owned).
caution

Locks (cdn:acme:lock:{domain}) prevent concurrent issuance; ensure TTL (ACME_LOCK_TTL) reflects worst-case runtime to avoid premature contention.

Certificate Renewal Scans

acme_scan_and_renew_task (task name cirrus.acme.scan_and_renew) executes:

  1. Acquires a global scan lock (cdn:acme:renew:scan_lock) to prevent overlapping scans.
  2. Iterates over domains from cdn:domains, filtering for those with use_acme_dns01.
  3. Skips domains currently locked or queued for issuance.
  4. Evaluates if the certificate is expiring soon using is_cert_expiring_soon (threshold ACME_RENEW_BEFORE_DAYS, default 30 days).
  5. Enqueues issuance tasks (up to ACME_RENEW_MAX_PER_SCAN, default 10), marking Redis status as "queued".
  6. Records skipped domains (locked or not due) and collects per-domain errors.
  7. Releases the scan lock, ensuring cleanup even on exceptions.

Node Health Checks

cname_health_check_task (task name cirrus.cname.health_check) runs on the interval configured in NodeHealthSettings:

  • Calls _cname_health_check_task_async, which executes perform_health_checks from cname/health.py.
  • For each node, attempts HTTP GET to http://<ip>:<port>/healthz (IPv6 addresses bracketed).
  • Increments failure counters and deactivates nodes when fails_to_down threshold is met; reactivates upon succs_to_up.
  • Publishes cdn:cname:dirty when node activation state flips, triggering DNS updates.
  • Returns an array of results with node IDs, statuses (healthy, failed, down, recovered, no-address), and optional error messages.

Redis Utilities

Helper functions use redis.asyncio.Redis created via _create_async_redis():

  • perform_health_checks operates on the Redis client supplied by the caller; helpers such as _cname_health_check_task_async close the connection once the check completes.
  • ACME tasks wrap Redis interactions in try/finally to ensure connections close even on error.

Locking & Concurrency Controls

  • Domain Lockscdn:acme:lock:{domain} prevents simultaneous issuance tasks per domain.
  • Task Keyscdn:acme:task:{domain} aids operator visibility and prevents duplicates.
  • Scan Lockcdn:acme:renew:scan_lock ensures single renewal sweep across workers.
  • Pub/Sub Eventspublish_zone_dirty (in cname/service.py) is invoked whenever domain/node changes require DNS refresh, ensuring eventual consistency across components.

Error Handling & Retries

  • Celery uses default retry policy (no automatic retries). Failures are logged and surfaced via Redis status keys, allowing operators to investigate before re-triggering tasks.
  • Certificate issuance catches all exceptions, updates status to "failed", and ensures locks are released to avoid indefinite blocking.
  • Renewal scans log upstream exceptions and include error messages in the result payload for dashboards or alerting.

Observability Hooks

  • Logging: The API logs queueing via acme_queued, while Celery tasks emit acme_start, acme_done, and acme_fail. Renewal scans log acme_auto_renew_queued and acme_auto_renew_error.
  • Redis Keys: Operators can inspect cdn:acme:{domain} to monitor status transitions (init, registered, queued, running, issued, failed).
  • Metrics: While Celery does not emit Prometheus metrics out of the box, logs and Redis data offer visibility. See Operations & Observability for potential enhancements.

Automation keeps certificates valid and node inventories accurate without manual intervention. See DNS & Traffic Engineering for how the DNS layer consumes this automation data to steer clients toward healthy edge nodes.