SRE-Production Support Engineering Manager
Pls see below:
- We need a strong profile having good exp in stakeholder & SRE team management.
- Good understanding of Production engineering/ production support projects is a must which includes handling teams working in 24/7 model.
- Good understanding of Incident, change, service req management is a daily routine – so candidate should know how to manage the workload, rotate FTEs as and when required.
- Management of Ad hoc activities such as Vulnerabilities fixes/ patching awareness is required.
- Should be able to lead BAU governance activities Daily, Weekly & Monthly cadence with necessary reporting data.
- Having GCP cloud infra management knowledge, Postgres DB basic knowledge & banking domain experience is a big advantage to the role.
==================================================================================================
Job Description:
- Mandatory experience on SRE (not Traditional Production Support) covering integration platforms on cloud-based deployments.
- Knowledge of applying SRE practices to daily operations is key.
- Ability to manage teams in shifts from office is mandatory; this is a 24x7 on desk operation.
- Computer Science and/or Engineering degrees are preferred.
- Having domain experience in Banking will be a great advantage.
Working Experience/ Awareness:
- 24x7 operations support model for mission critical applications and infrastructure using ServiceNow as the ITSM ticketing tool.
- GCP and private-cloud operational support / administration activities such as provision, capacity management, reliability management, monitoring, restoration, etc.
- Working knowledge on AppDynamics and Splunk for monitoring and setting up observability is key. CI/CD tool chains, setting up and running deployment pipelines and propagating changes on different environments. Maintaining middleware such as Kafka (open source) and MQ as well as application servers (Tomcat).
- Maintain Hazelcast Data storage platform clusters and Control M job schedulers.
- Kubernetes cluster management, monitoring, and remediation. Knowledge of Docker is important.
- Automating deployments and scripting self-healing workflows based on telemetry.
- Work closely with the team to define SLIs and configure SLOs, respond to threshold alerts and optimize monitoring capability.
- Work closely with the team to understand the code as well as configuration artifacts to debug and fix issues that may arise.
- Must be inclined to work on proof of concepts solutions to optimize reliability such as those incorporating AI models for event correlation and assisted triaging.
- Able to lead & drive SRE team to parallelly work on Service or Change Requests, Defect management board, backlog management in agile manner.
Good to have:
- SRE Foundation certification by DevOps Institute or any other equivalent certification on SRE by a recognized body is mandatory.
- CKA certification.
- GCP Cloud Digital Leader certification at a minimum is mandatory; Cloud Engineer level is a bonus.
- Hazelcast Platform Operations certification badge