Staff Site Reliability Engineer

Posted about 5 years ago by Max Blaze

Company Details

Duolingo

Pittsburgh, PA

FTE only

You will…

Collaborate with internal teams to identify sources of instability in distributed systems and drive operational excellence
Own core infrastructure (i.e manage, diagnose, and debug large-scale distributed systems in production)
Provide system design consulting, develop software platforms/frameworks, and conduct launch reviews and root cause analysis
Maintain and document sustainable postmortem/incident response practices
Understand and resolve potential threats to performance or security
Monitor and measure latency, availability and overall system health, once live
Advocate for and implement changes that improve reliability, scalability, and velocity
Monitor and stress test systems to collect metrics for tuning and capacity planning
Reduce the burden of toil with iterative development of tooling and automation
Collaborate with engineering teams to release new features and become an authority on our services
Participate in on-call rotation

You have…

Bachelor’s Degree in Computer Science
5+ years of experience within site reliability engineering/devops of a product with millions of users
Experience analyzing and troubleshooting large-scale distributed systems
Proven knowledge of C, C++, Java, Kotlin, Python or Go
Fluency in networking protocols, such as TCP/IP, HTTP, SSL, DNS, etc
An understanding of containerization toolsets and container orchestration technologies (Docker, Mesos, Kubernetes, Nomad, etc)
Effective communication skills and understanding of best practices around tools/methodologies for Infrastructure, Automation, Capacity Planning, etc.
Ability to be on-call for critical incident responses

Please log in or sign up to view this posting's application instructions.