MeteorOps

Comprehensive SRE Audit & Infrastructure Optimization

Conduct a thorough Site Reliability Engineering audit to identify gaps, reduce incidents, and improve system reliability across our production infrastructure.

Goals:

→Identify and document critical reliability gaps in current infrastructure
→Reduce mean time to recovery (MTTR) by implementing better observability
→Establish SLOs and error budgets for top 5 critical services

KubernetesTerraformPrometheusGrafanaELK StackAWSGitOpsAnsible

SaaS company in San Francisco, CA, Series B-funded at the Scaling stage. Work is Remote. Optimization type project.

This is why we're looking for help:

Our platform has been experiencing increased incidents and we need an expert to identify gaps in our observability, incident response, and infrastructure resilience.

Apply to this project

Submit your application and we'll get back to you soon.