🌎
This job posting isn't available in all website languages
📁
Lead Software Engineer
📅
CREQ207787 Requisition #
Experience Minimum 7 years of work experience as an SRE (not Traditional Production Support) covering integration platforms on cloud-based deployments Coding / automation scripting experience in any programming language, particularly for integration tier and middleware Working as a DevOps Engineer or SRE in mission critical applications and infrastructure Working experience with GCP (Google Cloud), particularly with GKE is important Working with AppDynamics and Splunk for monitoring and setting up observability is key CI CD tool chains, setting up and running deployment pipelines and propagating changes on different environments Core Capabilities Maintaining middleware such as Kafka (open source) and MQ as well as application servers (Tomcat) Maintain Hazelcast Data storage platform clusters and Control M job schedulers GCP and private-cloud operational support / administration activities such as provision, capacity management, reliability management, monitoring, restoration, etc Kubernetes cluster management, monitoring and remediation. Knowledge of Docker is important Automating deployments and scripting self-healing workflows based on telemetry Define SLIs and configure SLOs, respond to threshold alerts and optimize monitoring capability Work with code as well as configuration artifacts to debug and fix issues that may arise Knowledge of applying SRE practices to daily operations is key Must be inclined to work on proof-of-concept solutions to optimize reliability such as those incorporating AI models for event correlation and assisted triaging Ability to work in shifts in office is mandatory; this is a 24 / 7 on-desk operation Qualification Computer Science and or Engineering degrees are preferred SRE Foundation certification by DevOps Institute or any other equivalent certification on SRE by a recognized body is mandatory CKA certification GCP Cloud Digital Leader certification at a minimum is mandatory; Cloud Engineer level is a bonus Hazelcast Platform Operations certification badge Role & Responsibilities Work as part of a 24 / 7 on-desk team in shifts that will manage middleware and associated applications that are being consumed globally incident, change, event, problem management Debugging integrations and consumers at the code level Work with CI CD pipelines and automate new change rollouts. Change deployment and sanity testing is part of the scope Set up and configure an observability product, preferably AppDynamics or Splunk for end-to-end traceability and log analytics Be the guardian to ensure high reliability of the applications, middleware, storage platforms, scheduler (and its jobs) and underlying cloud infrastructure Define and set up SLIs as well as SLOs while continuously refining thresholds Set up anomaly detection and auto-remediation workflows Ensure all alerts and incidents within scope are actioned upon before breaching SLOs

Site Reliability Engineering – Snr Engineer 

 

Knowledge, Experience and Capabilities: 

 

  • Minimum of 6-8 yrs work experience in critical production environments 

  • Knowledge and experience with CI/CD pipelines and troubleshooting failed deployments 

  • Implementing system and application monitoring for cloud-based applications and SaaS components – setting up appropriate alerts and building dashboards 

  • Working knowledge of SQL and troubleshooting by writing queries is key 

  • AWS Cloud Infra operations experience on production is needed 

  • Understand and demonstrate application of SRE principles, particularly toil reduction, blameless post-mortems, monitoring distributed systems and release engineering 

  • Hands-on experience in writing Python scripts and Ansible templates for application deployment automation or other automations is important 

  • Ability to diagnose and debug systems at the application level (Salesforce preferred) is beneficial 

  • Working experience with Mulesoft as an integration platform on production environments 

 

Qualification: 

 

  • ITIL4 Foundation certification is preferred. 

  • SRE Foundation certification via PeopleSoft / DevOps Institute is beneficial 

  • AWS Solutions Architect - Associate qualification or alternative is preferred 

Role & Responsibilities: 

  • Engage in on-call and critical operations support activities while leading blameless post-mortems 

  • Direct liaison with customers remotely and face-to-face for stakeholder management 

  • Eliminate toil by lowering incident volume, eliminating noise from alerts, automating manual processes and converting workarounds into system features 

  • Work with Development, QA and other squads to design, build and rollout reliability features into the applications being delivered 

  • On-call support to complement the Production Support / Engineering team as required for major outages 

  • Continuous deployment and releases of changes and maintaining CI/CD pipelines 

  • Set up monitoring and continuously refine alerts to reduce noise for a newly built Salesforce + AWS + Mulesoft + Data Reporting system 

  • Ability to validate and automate any patch rollout process including the resolution of vulnerabilities 

 

Previous Job Searches

Similar Listings

Hyderabad, Andhra Pradesh, India

📁 Lead Software Engineer

Requisition #: CREQ225400

Hyderabad, Andhra Pradesh, India

📁 Lead Software Engineer

Requisition #: CREQ221247

Hyderabad, Andhra Pradesh, India

📁 Lead Software Engineer

Requisition #: CREQ225420