company_logo

Full Time Job

Principal Sre

Epic Games

Bellevue, WA 03-19-2024
 
  • Paid
  • Full Time
Job Description
LIVEOPS

What We Do

The Epic LiveOps team provides the best possible experience for our players. We dive deep into the data to understand player needs, minimize disruption, and manage Epic's incident response process.

Epic's Live Operations Engineering team is focused on building services, processes, and tooling to improve the reliability of our platforms, games, and services with a key focus on:

Technical Leadership in Incidents

Taking a leadership role in major incidents, driving cross-functional prioritization and implementation of incident mitigations, and facilitating internal status communications to an executive audience.

Post Incident Review

There is always an interesting form of something not working as we expect. We focus on how we learn from these production surprises and improve our systems and processes to be more reliable over time. We work with a diverse set of development teams to help them better understand incidents, risks present in their designs, and how to increase reliability.

Production, Event, and Launch Readiness

We run large-scale production events and we work with many teams on readiness and operational excellence. We own key elements of service and product launches and game events.

Development focused on Reliability

While we assist with incidents and readiness, we also build and maintain processes, tools, systems, and services that can improve Epic's reliability and our ability to respond to the unexpected.

What You'll Do

Epic Games is hiring a Principal Site Reliability Engineer for the Live Operations Engineering team focusing on building services and tooling to improve reliability for our platforms, games, and services. This role will focus on helping development teams with operational excellence and service ownership as well as driving the resolution of incidents impacting our players and customers.

In the role of a Site Reliability Engineer, you will tackle problems that impact the reliability of our products as a whole. Part of this role is analyzing gaps or risk areas for our key services, determining the best course of action, and engineering a solution. You will participate in post-incident reviews, readiness programs, and engineering efforts. This role is expected to have breadth over depth, but depth in designing and running reliable systems and infrastructure at scale.

At Epic we embrace a Service Owner (you build it, you run it) mentality. In this role, we are stewards of operational excellence and we are service owners. We develop processes, tools, systems, and services.

In this role, you will
• Build tooling to make service ownership easier
• Facilitate following ups and implement learnings from incidents
• Work across the organization to help distribute learnings or help in understanding the entire ecosystem
• Deep dive into systems to understand risk and communicate this outward to teams or leadership
• Assist in fixing things that are broken - our scope is the entire company
• Connect the dots between groups for experience or knowledge sharing around areas of operational risk
• Maintain a strong focus on long-term systemic change Provide operational recommendations to teams while also getting our ''hands dirty''

What we're looking for
• Someone who can operate and continually improve major incident management processes and tooling, from initial identification through to resolution and influencing systems design principles across the company
• A leader who will drive the resolution of complex technical issues that may cross-team and departmental boundaries
• A skilled operator who can implement mature system telemetry and monitoring standards deeply embedded in critical services
• Experience in developing systems and services which will help us with operational excellence and automated workflows
• Someone who can guide the development of best practices across our organization and tools
• A guiding force with service owners in risk reduction and major launch preparation
• A provider of deep insight into reliable systems design and cloud infrastructure configuration

Note to Recruitment Agencies: Epic does not accept any unsolicited resumes or approaches from any unauthorized third party (including recruitment or placement agencies) (i.e., a third party with whom we do not have a negotiated and validly executed agreement). We will not pay any fees to any unauthorized third party. Further details on these matters can be found here.

Jobcode: Reference SBJ-gp3vk9-3-133-79-70-42 in your application.

Company Profile
Epic Games

Founded in 1991, Epic Games is a leading interactive entertainment company and provider of 3D engine technology. Epic operates Fortnite, one of the world’s largest games with over 350 million accounts and 2.5 billion friend connections. Epic also develops Unreal Engine, which powers the world’s leading games and is also adopted across industries such as film and television, architecture, automotive, manufacturing, and simulation.