Site Reliability Engineering – Snr Engineer
Knowledge, Experience and Capabilities:
-
Minimum of 6-8 yrs work experience in critical production environments
-
Knowledge and experience with CI/CD pipelines and troubleshooting failed deployments
-
Implementing system and application monitoring for cloud-based applications and SaaS components – setting up appropriate alerts and building dashboards
-
Working knowledge of SQL and troubleshooting by writing queries is key
-
AWS Cloud Infra operations experience on production is needed
-
Understand and demonstrate application of SRE principles, particularly toil reduction, blameless post-mortems, monitoring distributed systems and release engineering
-
Hands-on experience in writing Python scripts and Ansible templates for application deployment automation or other automations is important
-
Ability to diagnose and debug systems at the application level (Salesforce preferred) is beneficial
-
Working experience with Mulesoft as an integration platform on production environments
Qualification:
-
ITIL4 Foundation certification is preferred.
-
SRE Foundation certification via PeopleSoft / DevOps Institute is beneficial
-
AWS Solutions Architect - Associate qualification or alternative is preferred
Role & Responsibilities:
-
Engage in on-call and critical operations support activities while leading blameless post-mortems
-
Direct liaison with customers remotely and face-to-face for stakeholder management
-
Eliminate toil by lowering incident volume, eliminating noise from alerts, automating manual processes and converting workarounds into system features
-
Work with Development, QA and other squads to design, build and rollout reliability features into the applications being delivered
-
On-call support to complement the Production Support / Engineering team as required for major outages
-
Continuous deployment and releases of changes and maintaining CI/CD pipelines
-
Set up monitoring and continuously refine alerts to reduce noise for a newly built Salesforce + AWS + Mulesoft + Data Reporting system
-
Ability to validate and automate any patch rollout process including the resolution of vulnerabilities