Leadup
AirFund services rely on an internal Vault instance to store sensitive data (credentials, API keys, etc.). Services connect to Vault using TLS certificates stored in Kubernetes secrets. These certificates are managed by an in-house Kubernetes operator, while the root CA is rotated manually once a year via an automated script. This script normally triggers the operator to regenerate all dependent certificates and secrets seamlessly.
Fault
The root CA certificate expired. When the automated rotation script was executed, it completed without error, but the Vault operator failed to regenerate service certificates. As a result, services could not authenticate with Vault, causing an outage in the Digital Subscription platform.
Detection
- 15:06 – Alert received: Kafka broker service down.
- 15:15 – On-call engineer acknowledged the alert and began investigating.
- 15:17 – Investigation revealed services were unable to connect to Vault due to an expired certificate.
- 15:18 – Engineer launched the automated root CA rotation script.
- 15:19 – Script completed successfully, but certificates were not regenerated.
- 15:20 – Incident declared; investigation focused on the Vault operator.
Root causes
- The automated root CA rotation script completed but did not ensure downstream regeneration of service certificates.
- The Vault operator failed to recreate certificates due to missing roles and invalid CA metadata.
Mitigation and resolution
- 15:25 – Engineer identified missing roles and recreated them manually.
- 15:31 – Vault operator raised errors about invalid CA metadata.
- 15:52 – Engineer manually re-imported the root CA.
- 15:54 – Vault operator successfully regenerated certificates.
- 15:55 – Impacted services restarted and became available.
Lessons learnt
- Rotation Validation Gap – The rotation script should validate not only its own execution but also that certificates and secrets are successfully regenerated.
- Operator Robustness – The Vault operator needs better error handling to recover gracefully from missing roles or invalid metadata.
- CRD Maintenance – The operator relies on outdated CRDs. Over successive Kubernetes upgrades, stricter schema validation and ArgoCD synchronization pruned required fields, ultimately breaking the operator’s behavior during certificate rotation.
Improvement Areas:
- Add post-rotation validation steps.
- Strengthen monitoring around Vault operator health and certificate lifecycle.
- Update CRDs to align with current Kubernetes schema rules.
- Introduce dry-run or staging tests before production certificate rotations.