Description

Job Description

Overview

Join our company as we transform and innovate. Our Digital Platforms & Services organization delivers reliable, scalable, and resilient digital solutions that support critical scientific and business outcomes across our global enterprise.

We are seeking a Staff Reliability Engineer with strong technical expertise in Site Reliability Engineering (SRE), Observability, and Resilience. In this role, you will partner with engineering teams to implement and operationalize reliability practices, ensuring systems are designed, built, and operated with reliability in mind.

You will contribute to the adoption of enterprise reliability standards, support the implementation of Service Level Objectives, and help improve system performance and availability. This role is hands-on, execution-focused, and plays a key part in advancing reliability maturity across the organization.

Responsibilities

Reliability Engineering Execution

Partner with application and platform teams to embed reliability into system design, development, and operations
Support implementation and operationalization of Service Level Objectives and reliability indicators
Contribute to improving observability coverage across logs, metrics, traces, and events
Apply reliability patterns such as fault isolation, failover, and recovery mechanisms in collaboration with engineering teams

Operational Excellence

Participate in and support improvements to the incident lifecycle, including detection, response, root cause analysis, and follow-up actions
Assist in identifying reliability risks and performance bottlenecks and contribute to remediation efforts
Support continuous improvement initiatives focused on reducing incident volume and improving system stability

Standards & Adoption

Apply established enterprise standards for observability, resilience engineering, and Service Level Objectives
Support adoption of reliability practices across teams through hands-on guidance and collaboration
Contribute feedback to help evolve reliability frameworks and tooling

Automation & Tooling

Develop and enhance automation for incident response, monitoring, and operational workflows
Leverage existing platforms (e.g., observability tools, incident management systems) to improve efficiency and visibility
Utilize AI-enabled capabilities where appropriate to support diagnostics and operational workflows under defined governance

Collaboration

Work closely with product, platform, and ITSM teams to align on reliability improvements
Participate in cross-team initiatives focused on improving system resilience and operational maturity
Contribute to knowledge sharing within the reliability engineering community

Qualifications

Required

Experience in one or more of the following: system integration, software development, system administration, or operations engineering
Familiarity with software development life cycle (SDLC) and production support models
Understanding of monitoring, observability, and performance optimization concepts
Experience supporting applications in cloud and/or on-premises environments
Working knowledge of CI/CD pipelines and deployment practices
Basic understanding of incident management and root cause analysis processes
Knowledge of system reliability principles, including availability and performance engineering
Strong problem-solving skills with a focus on continuous improvement
Ability to collaborate effectively across engineering and operations teams

Preferred

Experience with observability platforms and reliability tooling ecosystems
Exposure to Service Level Objectives and reliability metrics frameworks
Familiarity with automation and scripting (e.g., Python, Bash, or similar)
Understanding of resilience patterns and distributed systems concepts
Awareness of AI-assisted operational tools and workflows

Required Skills:

Bash (Programming Language), Data Engineering, Data Visualization, Design Applications, Incident Management, Incident Response, Monitoring Control, Performance Optimizations, Production Support, Python (Programming Language), Reliability Engineering, Software Configurations, Software Development, Software Development Life Cycle (SDLC), Software Integration, Software Lifecycle Management (SLM), Solution Architecture, System Administration, System Designs, System Integration, Testing

Preferred Skills:

Current Employees apply HERE

Current Contingent Workers apply HERE

Search Firm Representatives Please Read Carefully
Merck & Co., Inc., Rahway, NJ, USA, also known as Merck Sharp & Dohme LLC, Rahway, NJ, USA, does not accept unsolicited assistance from search firms for employment opportunities. All CVs / resumes submitted by search firms to any employee at our company without a valid written search agreement in place for this position will be deemed the sole property of our company. No fee will be paid in the event a candidate is hired by our company as a result of an agency referral where no pre-existing agreement is in place. Where agency agreements are in place, introductions are position specific. Please, no phone calls or emails.

Employee Status:

Regular

Relocation:

VISA Sponsorship:

Travel Requirements:

Flexible Work Arrangements:

Hybrid

Shift:

Valid Driving License:

Hazardous Material(s):

Job Posting End Date:

05/30/2026

*A job posting is effective until 11:59:59PM on the day BEFORE the listed job posting end date. Please ensure you apply to a job posting no later than the day BEFORE the job posting end date.

Staff Reliability Engineer

Description

Overview

Responsibilities

Reliability Engineering Execution

Operational Excellence

Standards & Adoption

Automation & Tooling

Collaboration

Qualifications

Required

Preferred

Similar Jobs

Like this job? Get alerts for similar ones

Merck & Co.

Pipeline

Career Resources