I embed with teams as an SRE to keep their systems running. Incident response, SLOs, and safe change are the day-to-day.
Concrete write-ups of the incident response, SLO and safe-change work I do on client systems.
A personal OSS effort, kept separate from client work. I build what would speed things up on the ground and open-source it to make SRE better for everyone.
Write-ups on how these open-source tools work and why I built them.
Incident response, SLOs, safe change, observability. If something around the reliability of a long-running system is weighing on you, tell me what's going on. I start by getting a clear read on where things actually stand.