Role Overview
Were hiring a Senior Site Reliability Engineer to own and scale the infrastructure behind our courtroom transcription platform. This is not a routine ops role - youll work on high-availability Kubernetes clusters, manage complex deployments with ArgoCD, and ensure reliability for a system processing sensitive, real-time data. Youll collaborate with a small team of elite builders and be the go-to expert for keeping our platform robust, secure, and fast.
Key Responsibilities
- Deploy, manage, and optimize Kubernetes clusters in production environments.
- Operate and maintain ArgoCD for GitOps-based deployments.
- Troubleshoot and iron out performance, reliability, and scaling issues across our clusters.
- Build and maintain observability (metrics, logging, alerting) to catch and resolve issues proactively.
- Collaborate with backend and product teams to ensure smooth, reliable releases.
- Define and enforce infrastructure best practices, focusing on security, scalability, and resilience.
Qualifications
- 10+ years of experience in production infrastructure, reliability, or DevOps roles.
- Proven experience deploying and managing Kubernetes clusters at scale.
- Experience maintaining CI/CD with GitHub actions.
- Hands-on expertise with ArgoCD (setup, tuning, troubleshooting).
- Solid foundation in Linux systems, networking, and container internals.
- Experience with monitoring/alerting stacks (Prometheus, Grafana, Loki, etc.).
- Comfortable diving into complex problems and quickly stabilizing systems.
Bonus:
- Experience with GCP.
- Contributions to open-source infrastructure or reliability tooling.