Enterprise Systems
Distributed Storage Reliability Tooling
Tooling and runbook automation for a multi-tenant distributed storage system serving regulated industries.
About this project
On-call engineers handled the same storage-tier symptoms repeatedly with manual remediation — slow, error-prone, and burning out the team.
Solution
Built a runbook automation layer with safe-by-default operations, dry-run mode, and audit logging; codified the top 30 incident classes.
Technology
- Go
- Python
- Prometheus
- Ansible
- PagerDuty API
Impact
On-call pages dropped 58%; first-action time on remaining pages fell to under 90 seconds.