Staff Site Reliability Engineer


Posted by Max Blaze


Pittsburgh, PA

FTE only


You will…

  • Collaborate with internal teams to identify sources of instability in distributed systems and drive operational excellence
  • Own core infrastructure (i.e manage, diagnose, and debug large-scale distributed systems in production)
  • Provide system design consulting, develop software platforms/frameworks, and conduct launch reviews and root cause analysis
  • Maintain and document sustainable postmortem/incident response practices
  • Understand and resolve potential threats to performance or security
  • Monitor and measure latency, availability and overall system health, once live
  • Advocate for and implement changes that improve reliability, scalability, and velocity
  • Monitor and stress test systems to collect metrics for tuning and capacity planning
  • Reduce the burden of toil with iterative development of tooling and automation
  • Collaborate with engineering teams to release new features and become an authority on our services
  • Participate in on-call rotation

You have…

  • Bachelor’s Degree in Computer Science
  • 5+ years of experience within site reliability engineering/devops of a product with millions of users
  • Experience analyzing and troubleshooting large-scale distributed systems
  • Proven knowledge of C, C++, Java, Kotlin, Python or Go
  • Fluency in networking protocols, such as TCP/IP, HTTP, SSL, DNS, etc
  • An understanding of containerization toolsets and container orchestration technologies (Docker, Mesos, Kubernetes, Nomad, etc)
  • Effective communication skills and understanding of best practices around tools/methodologies for Infrastructure, Automation, Capacity Planning, etc.
  • Ability to be on-call for critical incident responses

How to Apply

Please log in or sign up to view this posting's application instructions.