DSO Operational Guide (Day-2 Operations)
Overview
This guide covers operating DSO after initial setup, including monitoring, maintenance, troubleshooting, and scaling considerations.
Daily Operations
1. Health Checks
Morning Health Check:
# Comprehensive environment validation
docker dso doctor --level full
# Current status snapshot
docker dso status
# Review recent logs
docker dso system logs -n 30
Quick Status Anytime:
# Single-line health check
docker dso doctor
# Real-time monitoring (dev)
docker dso status --watch
2. Monitoring (v3.5+)
Setup Real-Time Monitoring:
# In one terminal: Watch status
docker dso status --watch
# In another terminal: Follow logs
docker dso system logs -f
# In third terminal: Monitor containers
docker ps --filter name=app --format "table {{.Names}} {{.Status}}"
Key Metrics to Watch:
- Cache hit rate (should be >90% for normal operation)
- Rotation success rate (should be 100% or very close)
- Queue depth (should be <100)
- Container health status (all HEALTHY)
- Provider latency (should be <5 seconds for most operations)
- Lock contention (high contention >1s indicates scaling issues)
v3.5 Observability Features:
- Per-rotation trace IDs for end-to-end correlation
- Provider latency monitoring (tracks min/max/avg response times)
- Lock contention detection (alerts on slow acquisitions)
- Health check diagnostics (captures exit codes and output)
- Circuit breaker status (tracks provider failure isolation)
View details in status output:
docker dso status --json | jq '.observability'
3. Configuration Management
Regular Config Reviews:
# View current configuration
docker dso config show
# Validate syntax
docker dso config validate
# Edit if needed
docker dso config edit
# Changes take effect on next rotation or manual restart
sudo docker dso system restart
Common Config Changes:
# Increase cache for high-throughput applications
agent:
cache:
max_size: 1Gi # from 500Mi
# Speed up rotation timeout for faster failover
rotation:
timeout: 15s # from 30s
# Change polling for less frequent checks
watch:
polling_interval: 15m # from 5m
v3.5 Automatic Recovery
DSO v3.5 automatically recovers from agent crashes:
What Happens on Agent Restart:
- Detects incomplete rotations older than 5 minutes
- Automatically cleans up orphaned containers (using naming patterns:
_dso_backup_,_dso_new_) - Validates original container state
- Resumes normal operations
Operator Actions:
- None required for most scenarios
- Check logs for automatic recovery confirmation:
docker dso system logs | grep "Automatic recovery" - For critical errors, review recovery procedures documentation
View Recovery Status:
docker dso status --json | jq '.recovery'
See Recovery Procedures for manual recovery steps.
Troubleshooting
Issue: Rotation Failures
Symptoms: docker dso status shows failed rotations
Diagnosis:
# Check recent rotation errors
docker dso system logs -p err --since 1h
# Check container health
docker ps --filter health=unhealthy
# Validate provider connectivity
docker dso doctor --level full
Solutions:
# 1. Check provider health
docker dso doctor | grep -A 2 "provider"
# 2. If provider is down, wait and retry
docker dso system restart
# 3. If still failing, check logs
docker dso system logs -f -p err
# 4. Manual recovery
# - Restore from backup
# - Or manually update secret in provider
# - Then restart agent
Issue: High Cache Miss Rate
Symptoms: Cache hit rate <80%, frequent provider lookups
Diagnosis:
# Check cache status
docker dso status | grep -i cache
# Monitor cache behavior
docker dso system logs | grep cache_miss
Solutions:
# 1. Increase cache TTL
sudo nano /etc/dso/dso.yaml
# Change: cache.ttl: 1h → 4h
# 2. Increase cache size
# Change: cache.max_size: 500Mi → 1Gi
# 3. Reduce rotation frequency if possible
sudo docker dso system restart
Issue: Agent Service Won't Start
Symptoms: systemctl status dso-agent shows failed
Diagnosis:
# Check service status
systemctl status dso-agent -l
# Check recent logs
journalctl -u dso-agent -n 50
# Validate configuration
docker dso config validate
Solutions:
# 1. Fix configuration errors
sudo docker dso config validate
# If errors, fix and validate again
# 2. Check permissions
ls -la /etc/dso /var/lib/dso /run/dso
# 3. Check disk space
df -h /var/lib/dso
# 4. Restart service
sudo systemctl restart dso-agent
# 5. Monitor startup
journalctl -u dso-agent -f
Issue: Container Rotation Slow
Symptoms: Rotations take >60 seconds
Diagnosis:
# Check rotation timing
docker dso system logs | grep "rotation complete"
# Identify slow step
docker dso system logs | grep -E "health|verify|swap"
# Check health check config
docker inspect <container> --format '{{json .State.Health}}'
Solutions:
# 1. Reduce health check timeout
sudo nano /etc/dso/dso.yaml
# Change: rotation.timeout: 30s → 15s
# 2. Reduce container health check frequency
docker-compose.yml:
healthcheck:
interval: 10s # from 30s
timeout: 5s # from 10s
# 3. Optimize application startup
# Reduce app initialization time
Issue: Provider Connection Issues
Symptoms: Doctor shows provider as "unavailable" or "unhealthy"
Diagnosis:
# Check provider status
docker dso doctor --level full | grep -A 3 provider
# Test provider connectivity directly
docker dso doctor
# Check network connectivity
ping <provider-host>
curl <provider-endpoint>
Solutions:
# 1. Verify provider configuration
docker dso config show | grep -A 5 providers
# 2. Test provider access
# - Vault: vault status
# - AWS: aws secretsmanager list-secrets
# - Azure: az keyvault secret list
# 3. Check network rules
# - DNS resolution working?
# - Firewall allowing access?
# - Credentials valid?
# 4. Restart agent if fixed
sudo docker dso system restart
Maintenance
Regular Tasks
Daily:
- Review status:
docker dso doctor - Check recent logs:
docker dso system logs -n 20
Weekly:
- Review rotation history:
docker dso system logs --since 7d | grep rotation - Check cache effectiveness:
docker dso status | grep cache - Validate configuration:
docker dso config validate
Monthly:
- Review operational metrics
- Test failover/recovery procedures
- Update documentation
- Review security logs
Backup & Recovery
Backup State:
# Backup configuration
sudo cp /etc/dso/dso.yaml /backup/dso-config-$(date +%Y%m%d).yaml
# Backup state directory
sudo tar -czf /backup/dso-state-$(date +%Y%m%d).tar.gz /var/lib/dso/state/
# Backup vault (if using local provider)
tar -czf /backup/dso-vault-$(date +%Y%m%d).tar.gz ~/.dso/vault.enc
Restore State:
# 1. Stop agent
sudo docker dso system disable
# 2. Restore state files
sudo tar -xzf /backup/dso-state-20240512.tar.gz -C /
# 3. Restart agent
sudo docker dso system enable
# 4. Verify
docker dso doctor
docker dso status
Upgrade Procedure
For Agent Binary:
# 1. Download new version
curl -Lo /tmp/dso-new https://releases.dso.dev/dso-v1.1.0-linux-amd64
# 2. Verify checksum
sha256sum /tmp/dso-new
# Compare with official checksum
# 3. Backup current binary
sudo cp /usr/local/bin/dso /usr/local/bin/dso-v1.0.0
# 4. Install new binary
sudo install -m 755 /tmp/dso-new /usr/local/bin/dso
# 5. Restart agent
sudo docker dso system restart
# 6. Verify upgrade
docker dso version
docker dso doctor
For Configuration:
# 1. Check compatibility
docker dso config validate
# 2. Make changes
sudo nano /etc/dso/dso.yaml
# 3. Validate new config
docker dso config validate
# 4. Apply changes
sudo docker dso system restart
# 5. Monitor for issues
docker dso system logs -f
Performance Tuning
For High-Volume Environments
Increase Cache:
agent:
cache:
ttl: 8h
max_size: 2Gi
Reduce Polling:
watch:
polling_interval: 30m
debounce_window: 10s
Optimize Rotation:
rotation:
timeout: 45s
parallelism: 3 # if supported
Monitor Performance Impact
# Check resource usage
docker stats --no-stream dso-agent # if running as container
# Check system load
top | grep dso
# Monitor disk I/O
iostat -x 1 5
# Monitor network
netstat -s | grep -E "TCP|UDP"
Scaling Considerations
Single Host Limits
- Containers managed: Up to ~1000 per host (Docker limit)
- Secret rotations per second: ~10 (depends on provider)
- Cache memory: Configurable, typically 500MB-2GB
- Concurrent rotations: Limited by health check timeouts
Multi-Host Deployment
DSO is designed for single-host deployment. For multiple hosts:
- Deploy independent agent per host
- Share provider (Vault, AWS, etc.)
- Use provider-level deduplication
- Consider rate limiting on backend
Bottleneck Analysis
CPU-bound: Secret resolution from complex providers
- Solution: Increase cache TTL
Memory-bound: Large secret cache
- Solution: Reduce cache size, increase TTL
I/O-bound: Frequent state writes
- Solution: Reduce rotation frequency, batch updates
Network-bound: Provider communication
- Solution: Use webhooks instead of polling, increase TTL
Alerting & Notifications
Key Alerts to Set Up
# Rotation failures
docker dso system logs | grep -i "rotation failed" | wc -l
# Provider unavailability
docker dso doctor | grep -i "unhealthy|unavailable"
# Cache exhaustion
docker dso status | awk '/Cache/ {print $NF}' | grep -E "99|100"
# Service crash detection
systemctl is-active dso-agent
Example alerting script:
#!/bin/bash
ROTATION_ERRORS=$(docker dso system logs --since 1h | grep "rotation failed" | wc -l)
if [ $ROTATION_ERRORS -gt 5 ]; then
# Alert: high rotation failure rate
echo "ALERT: $ROTATION_ERRORS rotations failed in last hour"
fi
Best Practices
Always validate before applying
docker dso config validatebefore restartdocker dso doctorafter changes
Monitor continuously
- Daily:
docker dso doctor - Weekly:
docker dso system logs - Monthly: full audit
- Daily:
Test recovery procedures
- Practice state restoration monthly
- Test failover scenarios
- Document recovery runbooks
Keep audit logs
- Maintain 30+ days of logs
- Archive rotation history
- Track configuration changes
Document changes
- Record all config modifications
- Document scaling decisions
- Update runbooks as needed
For runtime details, see runtime.md.
For system architecture, see architecture.md.
For CLI reference, see cli.md.