Description

At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche, where every voice matters.

The Position

AI Platform Engineer

Role Overview

Own the full lifecycle of a production AI/ML platform spanning on-prem GPU clusters, multi-cloud infrastructure (AWS, Alibaba Cloud), and service delivery. This role bridges datacenter hardware, platform engineering, and AI service operations in a GxP-regulated pharmaceutical environment. You will work closely with global engineering teams, solution architects, and business stakeholders across time zones.

Key Responsibilities

Infrastructure Engineering (On-Prem & Cloud)

• Own OS baseline: REDHAT Satellite management, custom Base ISO lifecycle

• Integration with enterprise storage systems, which are managed by the Roche Storage team

• GPU server BOM selection,and hardware qualification

• Architect cloud resource strategy: Reserved Instance planning, cost optimization across AWS and Alibaba Cloud

• Cloud Accounts (AWS and Alibaba) Post Previsioning, configuration, and management, for Platform and Platform managed use case accounts

Infrastructure as Code (IaC)

• Develop and maintain Ansible scripts for automated server management (Provision, Decommission, Configuration)

• Build and operate AMI Bakery pipelines for immutable image delivery

• Orchestrate multi-cloud server deployments (AWS, Alibaba Cloud) via IaC

• Automate Kubernetes cluster provisioning and management

• Develop and harden custom IAC Scripts

MLOps Platform Engineering

• Manage full cluster lifecycle: provisioning, upgrades, scaling, disaster recovery

• Own 30+ platform components across the following domains:

◦ GPU & Device Management

◦ AI Workload Orchestration: Engineering for Kubernetes Scheduling and SLURM Scheduling

◦ Networking: Connectivity Engineering inside Kubernents Cluster

◦ Storage: Storage integration with multiple types, including object storage and block storage.

◦ Observability: Design and implement observability dashboards via Prometheus, Grafana, OpenTelemetry, etc.

◦ Security & PKI: PKI management for the entire platform. Implement dev-sec ops practice into dev-ops lifecycle

◦ Platform Engineering on Data configuration, API, Training/Influencing frameworks, Pipelines, and Toolsets

◦ Build and maintain CI/CD pipelines; Build and maintain Github/Gitlab templates ◦ Support AI Use Cases on Engineering tasks.

◦ Accountable for troubleshooting platform related issues including leading the troubleshooting across different services that belong to different teams.

AI Platform Services

• Deploy and operate AI Gateway (Portkey Data Plane) with full IaC coverage

• Execute on-prem model lifecycle management

• Develop and maintain workspace auto provisioning scripts

• Integrate AI safety guardrails

• Build and implement FinOps process

• Support AI Use Cases on Engineering tasks.

• Accountable for troubleshooting platform related issues including leading the troubleshooting across different services that belong to different teams.

Compliance & Process

• Author and maintain system design documents

• Documentation and approval workflow management (via Veeva Quality doc, Markdown in Gitlab for project documents, runbooks, and user manuals.)

• Manage workloads in Jira

Requirements

Must Have — Technical

• 8+ years in production Linux systems engineering, with deep RHEL expertise (Satellite, Kickstart, custom ISO builds)

• 5+ years operating Kubernetes in production at scale (500+ nodes or 5,000+ pods), including cluster lifecycle management, Disaster recovery, and multi-tenant isolation

• Expert-level IaC proficiency: Ansible (custom module/plugin development), Terraform (provider development, state management at scale)

• Hands-on GPU cluster experience: NVIDIA driver lifecycle, MIG/vGPU partitioning, CUDA compatibility matrix management, GPU health monitoring

• Strong networking fundamentals: L2/L3 design, VLAN segmentation, BGP basics, IPAM at datacenter scale; experience with high-performance fabrics (RDMA/RoCE/InfiniBand) for distributed training

• Deep AWS experience (VPC architecture, EC2 placement groups, EKS, IAM policy design) with production workloads

• Helm Chart development: authoring complex charts with subcharts, hooks, and CRD lifecycle management

• CI/CD pipeline ownership: end-to-end container image build, vulnerability scanning, artifact promotion, and GitOps-based deployment

• Business-level English proficiency (written and spoken).

• Cross-functional collaboration. You will work at the intersection of infrastructure, security, compliance, and data science teams. Ability to translate between deeply technical infrastructure concerns and business/compliance requirements is essential — not just "teamwork," but the ability to drive alignment across groups with competing priorities.

•You need to lead troubleshooting in real-time, communicate status clearly to stakeholders, and conduct blameless post-mortems afterward.

• Experience operating and customizing AI/ML serving platforms (Seldon Core, KServe/KFServing, Triton Inference Server) in production

• Service mesh expertise (Istio: traffic management, mTLS, authorization policies) at scale

• Full-stack observability design: Prometheus federation, Grafana dashboard-as-code, ELK/OpenSearch log pipelines, OpenTelemetry instrumentation

• Production experience with multi-cloud orchestration (AWS + Alibaba Cloud), including cross-cloud networking and unified IaC

• Familiarity with GxP/CSV compliance in pharmaceutical or life sciences — change control, validation protocols, audit trail requirements

• Experience with AI Gateway / LLM routing systems (Portkey, LiteLLM, or equivalent)

• FinOps practice: GPU cost modeling, chargeback/showback implementation, Could Resource cost optimization

• Contributions to open-source infrastructure projects (CNCF ecosystem preferred)

• Experience mentoring junior engineers or leading small infrastructure teams (2–5 people)

• Track record of building internal developer platforms or self-service tooling for ML/data science teams

Nice to Have

• Familiarity with pharmaceutical IT service management (ServiceNow ITSM, VeevaQuality Doc)

• Prior experience in a platform team serving internal ML/data science customers (100+ users)

Who we are

A healthier future drives us to innovate. Together, more than 100’000 employees across the globe are dedicated to advance science, ensuring everyone has access to healthcare today and for generations to come. Our efforts result in more than 26 million people treated with our medicines and over 30 billion tests conducted using our Diagnostics products. We empower each other to explore new possibilities, foster creativity, and keep our ambitions high, so we can deliver life-changing healthcare solutions that make a global impact.

Let’s build a healthier future, together.

Roche is an Equal Opportunity Employer.

AI platform engineer