Production Readiness
Use this checklist before deploying Docker Secret Operator in a production environment.
Pre-Deployment Checklist
Infrastructure Requirements
| Requirement |
Minimum |
Notes |
| OS |
Ubuntu 18.04+ / RHEL 8+ |
systemd required for Agent Mode |
| Docker |
v20.10+ |
docker info must succeed |
| Docker Compose |
v2.0+ |
Required for dso up |
| RAM |
2 GB available |
Agent uses ~50–100 MB at rest |
| Disk |
500 MB free |
State files, logs |
| Network |
HTTPS to provider |
No inbound port required (polling mode) |
# Verify Docker version
docker --version # should be 20.10+
# Verify Compose version
docker compose version # should be 2.x
# Check available memory
free -h
Security Configuration
- ✅ Use IAM roles or managed identities — not static access keys
- ✅ Apply least-privilege IAM/RBAC policies (only
GetSecretValue on specific paths)
- ✅ Configure firewall rules (deny by default, allow only HTTPS to provider)
- ✅ Enable TLS certificate validation (never set
verify_tls: false)
- ✅ Enable audit logging in secret provider (CloudTrail, Activity Log, Vault audit)
Operational Setup
- ✅ Configure rotation schedule in provider (event-driven recommended)
- ✅ Set up monitoring and alerting for rotation failures
- ✅ Configure log aggregation (syslog, CloudWatch, Datadog, etc.)
- ✅ Document recovery procedures for your operator team
- ✅ Test automatic rollback scenario manually before going live
Testing Requirements
# 1. Verify agent starts cleanly
sudo systemctl start dso-agent
sudo systemctl status dso-agent
# 2. Verify health endpoint
curl http://localhost:8081/health
# → {"status":"ok","provider":"aws"}
# 3. Check logs are clean
sudo journalctl -u dso-agent -n 50
# 4. Test rotation manually
sudo dso rotation trigger
# 5. Verify rollback works — stop container during health check
# and verify the old container is restored automatically
Deployment Best Practices
Rolling Deployment
For multi-host deployments, update agents one host at a time:
# On each host, in sequence:
sudo systemctl stop dso-agent
# ... update binary ...
sudo systemctl start dso-agent
sleep 30 # verify health before moving to next host
Monitoring & Alerting
Set up alerts for these events:
| Alert |
Threshold |
Severity |
| Rotation failure |
1 failure |
Warning |
| Agent not responding |
2 min |
Critical |
| Health check timeout |
>30s |
Warning |
| Provider unavailable |
1 failure |
Critical |
# Quick health check for monitoring scripts
curl -sf http://localhost:8081/health | jq .status
Backup & Recovery
# Back up DSO state files
sudo cp /var/lib/dso/state.json /backup/dso-state-$(date +%Y%m%d).json
sudo cp /var/lib/dso/checkpoint.json /backup/dso-checkpoint-$(date +%Y%m%d).json
- Keep provider credentials in a separate secure vault
- Document manual recovery steps for complete provider failure
- Test recovery procedure quarterly
Compliance Checklist
PCI-DSS
- ✅ Secrets never on disk as plaintext (Requirement 3.4)
- ✅ Audit logging enabled and reviewed (Requirement 10.2)
- ✅ TLS for all network communication (Requirement 4.1)
- ✅ Credential rotation every 90 days or less (Requirement 8.6)
HIPAA
- ✅ Encryption at rest and in transit (§164.312)
- ✅ Access logs for audit trail (§164.312(b))
- ✅ Automatic rollback on failure — data integrity (§164.310(d))
SOC 2
- ✅ Audit logging with timestamps
- ✅ Access control and authentication
- ✅ Availability monitoring
- ✅ Automatic recovery from failures
Performance Reference
Rotation Timing
| Stage |
Typical Duration |
| Lock acquisition |
100–500ms |
| Secret fetch from provider |
200ms–1s |
| New container creation |
1–3s |
| Health check |
1–5s (app-dependent) |
| Atomic swap |
~100ms |
| Old container cleanup |
1–2s |
| Total |
3–12 seconds |
Resource Usage
| Resource |
At Rest |
During Rotation |
| Memory |
50–100 MB |
150–300 MB |
| CPU |
Minimal |
Temporary spike |
| Disk I/O |
Minimal |
State file writes |
| Network |
0 (polling idle) |
1–5 API calls |
Scaling Considerations
Single Host (Recommended)
Single agent manages all containers on one host with automatic lock management and crash recovery. This is the most common and recommended configuration.
Multi-Host
- Each host runs an independent agent
- Agents coordinate via lock file and state timestamps
- Stale locks auto-break after 5 minutes
- No distributed consensus required
Next Steps
- Complete this checklist before going to production
- Schedule runbook training with your operator team
- Set up monitoring and alerting dashboards
- Plan and test disaster recovery procedures
- Read Observability for metrics setup