🌎
This job posting isn't available in all website languages

Senior Site Reliability Engineer (HPC/Cloud)

📁
Architect (Level: Manager)
📅
CREQ255288 Requisition #

Key Responsibilities:

  • Respond to and resolve operational incidents, identify root causes for critical issues, and implement strategies to prevent recurrence and improve platform resiliency. 

  • Proactively create and manage monitoring, logging, and alerting systems to ensure high availability, performance, and visibility across all services. 

  • Take a Site Reliability Engineering approach to our services, improving the deployment, monitoring and incident response end-to-end.  

  • Solve complex technical problems, with SCP applications, infrastructure and end user’s use of the services. 

  • Administer platform tools like Ansible, Vault, Consul, Prometheus, and Grafana to support core functions like configuration management, secrets management, monitoring, and observability.  

  • Mentor and coach junior engineers in the team, fostering a collaborative and high-performing culture. 

  • Drive automation for deployment and management processes using GitOps workflows as well as CI/CD pipelines.

 

Essential Knowledge, Skills, and Experience:

  • Experienced administering, maintaining and troubleshooting a Linux environment 

  • Competent in automation and bash scripting 

  • Highly customer focused; able to explain IT technical concepts in a manner which non-IT experts can understand 

  • Hands-on experience working in a DevOps team and using agile methodologies 

 

Plus some of the following areas of expertise:

  • Hands-on knowledge of a range of scientific and HPC applications such as simulation software, bioinformatics tools or 3D data visualization packages 

  • Experience administering and optimizing SLURM

  • Experience deploying and administering OpenStack 

  • Experience with configuration automation and infrastructure as code (e.g. Ansible, Hashicorp Terraform, AWS CloudFormation, Amazon Cloud Developer Kit) 

  • Experience deploying infrastructure and code to public cloud, especially AWS  

  • Experience with software distribution frameworks such as Easybuild or Spack 

  • Familiarity with container runtimes such as Docker, Singularity or enroot 

  • Experience with frameworks for regression tests and benchmarks for HPC applications, like Reframe HPC

Previous Job Searches

Similar Listings

Chennai, Tamil Nadu, India

📁 Architect (Level: Manager)

Requisition #: CREQ253647

Chennai, Tamil Nadu, India

📁 Architect (Level: Manager)

Requisition #: CREQ258480

Chennai, Tamil Nadu, India

📁 Architect (Level: Manager)

Requisition #: CREQ258935