Job Description
ONLINE INFRASTRUCTURE
What We Do
We enable Epic's online services teams to build, deploy, and manage services that are used by more than half a billion players around the world. Our mission is to provide world class tools and platforms to improve the experience of our developers and make it easier, faster, and safer to build, operate, and scale their applications. We operate at massive scale as one of the largest cloud computing users in the world.
What You'll Do
Our Observability team is looking for a Senior SRE to help us build and operate the infrastructure our teams rely on to keep our platforms, games, and online services running. Our Observability team works across all of Epic to implement industry best practices and develop new monitoring capabilities. As an SRE on Observability, you will tackle problems that impact how we understand and operate our products at scale. This team is responsible for company-wide metrics, logging, exception handling, and dashboarding solutions. In this role, you will build and operate the systems that process and transport the large volumes of telemetry data generated by services at Epic.
In this role, you will
• Service Ownership - At Epic we embrace a Service Owner (You build it, you run it) mentality. In this role, you will work together with other members of the Observability team to operate the infrastructure our developers depend on to operate their own services.
• Develop and Ship - You will work to modernize key portions of our observability infrastructure. Building new data processing pipelines for telemetry data as well as writing software to automate processes and generate new insights.
• Collaborate - You will work with teams across Epic as an observability subject matter expert to provide guidance on observability best practices.
What we're looking for
• Experience with executing meaningful change in a fast-paced interrupt driven environment.
• Self-starter, you approach challenges creatively and methodically, seeing them through to final resolution.
• Ability to adapt and be effective in new situations within a highly dynamic environment.
• Experience working with large scale systems in AWS, mostly deployed via Kubernetes.
• Comfortable in a very terraform heavy environment, both reviewing PRs as well as contributing yourself.
• Are familiar with application/service monitoring strategies and technologies, examples include OpenTelemetry, Prometheus, Grafana, FluentD, New Relic, Datadog, Grafana, Sentry, and Sumo Logic.
Note to Recruitment Agencies: Epic does not accept any unsolicited resumes or approaches from any unauthorized third party (including recruitment or placement agencies) (i.e., a third party with whom we do not have a negotiated and validly executed agreement). We will not pay any fees to any unauthorized third party. Further details on these matters can be found here.
Jobcode: Reference SBJ-r13e8x-3-144-253-195-42 in your application.