Automation & Certificate Lifecycle
Celery powers Cirrus CDN’s asynchronous workflows: ACME certificate issuance, renewal scans, and node health checks. This chapter dissects task scheduling, locking strategies, and external integrations driven by control-plane/src/cirrus/celery_app.py.
Celery Configuration
celery_app.py defines a Celery instance with Redis as both broker and result backend:
- Broker URL defaults to
redis://[password@]host:port/0. - Result backend defaults to database
1. - Serializers are JSON-only; timezone defaults to UTC.
The Celery beat schedule includes:
acme_renewal_scan– Cron-scheduled (default hourly at minute 0). Runscirrus.acme.scan_and_renew.cname_node_health– Periodic task scheduled viacelery.schedules.schedule, interval derived from node health settings (NODE_HEALTH_INTERVAL_SECS).
Queues are configurable via environment variables (ACME_RENEW_QUEUE, CNAME_HEALTH_QUEUE).
ACME Certificate Issuance
Task Flow
acme_issue_task (Celery task name cirrus.acme.issue_certificate) orchestrates issuance:
- Generates a per-task token (Celery
request.idor random hex). - Calls
_acme_issue_task_async(domain, token)insideasyncio.run. - Acquires a Redis lock (
cdn:acme:lock:{domain}) with TTL (ACME_LOCK_TTL, default 900 seconds). Skips if lock exists. - Persists task ID in
cdn:acme:task:{domain}for operator visibility. - Marks ACME status as
"running"incdn:acme:{domain}. - Ensures ACME registration exists by calling
ensure_acme_registered(acme_common.py), which interacts with acme-dns viahttpx.AsyncClient. - Optionally enforces
_acme-challengeCNAME readiness (ENFORCE_ACME_CNAME_CHECK,WAIT_FOR_CNAME,CNAME_WAIT_SECS). - Loads or generates an ACME account (
cdn:acmeacct:global) and a domain private key (cdn:acmecertkey:{domain}). - Issues the certificate using
sewerviaissue_certificate_with_sewer. - Stores the resulting fullchain PEM and private key in
cdn:cert:{domain}, updates ACME status to"issued", and caches issuance timestamp. - Unlocks by deleting
cdn:acme:task:{domain}and the lock key (if owned).
- Sequence
- Status State Machine
Locks (cdn:acme:lock:{domain}) prevent concurrent issuance; ensure TTL (ACME_LOCK_TTL) reflects worst-case runtime to avoid premature contention.
Certificate Renewal Scans
acme_scan_and_renew_task (task name cirrus.acme.scan_and_renew) executes:
- Acquires a global scan lock (
cdn:acme:renew:scan_lock) to prevent overlapping scans. - Iterates over domains from
cdn:domains, filtering for those withuse_acme_dns01. - Skips domains currently locked or queued for issuance.
- Evaluates if the certificate is expiring soon using
is_cert_expiring_soon(thresholdACME_RENEW_BEFORE_DAYS, default 30 days). - Enqueues issuance tasks (up to
ACME_RENEW_MAX_PER_SCAN, default 10), marking Redis status as"queued". - Records skipped domains (locked or not due) and collects per-domain errors.
- Releases the scan lock, ensuring cleanup even on exceptions.
Node Health Checks
cname_health_check_task (task name cirrus.cname.health_check) runs on the interval configured in NodeHealthSettings:
- Calls
_cname_health_check_task_async, which executesperform_health_checksfromcname/health.py. - For each node, attempts HTTP GET to
http://<ip>:<port>/healthz(IPv6 addresses bracketed). - Increments failure counters and deactivates nodes when
fails_to_downthreshold is met; reactivates uponsuccs_to_up. - Publishes
cdn:cname:dirtywhen node activation state flips, triggering DNS updates. - Returns an array of results with node IDs, statuses (
healthy,failed,down,recovered,no-address), and optional error messages.
Redis Utilities
Helper functions use redis.asyncio.Redis created via _create_async_redis():
perform_health_checksoperates on the Redis client supplied by the caller; helpers such as_cname_health_check_task_asyncclose the connection once the check completes.- ACME tasks wrap Redis interactions in
try/finallyto ensure connections close even on error.
Locking & Concurrency Controls
- Domain Locks –
cdn:acme:lock:{domain}prevents simultaneous issuance tasks per domain. - Task Keys –
cdn:acme:task:{domain}aids operator visibility and prevents duplicates. - Scan Lock –
cdn:acme:renew:scan_lockensures single renewal sweep across workers. - Pub/Sub Events –
publish_zone_dirty(incname/service.py) is invoked whenever domain/node changes require DNS refresh, ensuring eventual consistency across components.
Error Handling & Retries
- Celery uses default retry policy (no automatic retries). Failures are logged and surfaced via Redis status keys, allowing operators to investigate before re-triggering tasks.
- Certificate issuance catches all exceptions, updates status to
"failed", and ensures locks are released to avoid indefinite blocking. - Renewal scans log upstream exceptions and include error messages in the result payload for dashboards or alerting.
Observability Hooks
- Logging: The API logs queueing via
acme_queued, while Celery tasks emitacme_start,acme_done, andacme_fail. Renewal scans logacme_auto_renew_queuedandacme_auto_renew_error. - Redis Keys: Operators can inspect
cdn:acme:{domain}to monitor status transitions (init,registered,queued,running,issued,failed). - Metrics: While Celery does not emit Prometheus metrics out of the box, logs and Redis data offer visibility. See Operations & Observability for potential enhancements.
Automation keeps certificates valid and node inventories accurate without manual intervention. See DNS & Traffic Engineering for how the DNS layer consumes this automation data to steer clients toward healthy edge nodes.