- 📅
- CREQ260232 Requisition #
- 📅
- Jun 26, 2026 Post Date
About the Role
We are looking for a Site Reliability Engineer/Lead with strong experience in Datadog, observability, monitoring, alerting, incident response, and automation. This role will help improve the reliability, availability, and performance of our applications and infrastructure by building scalable monitoring solutions, reducing operational noise, improving alert quality, and supporting production readiness across teams.
The ideal candidate is hands-on, automation-focused, and comfortable working with application, infrastructure, cloud, and operations teams to drive better reliability outcomes.
Key Responsibilities
Design, build, and maintain observability solutions using Datadog across applications, infrastructure, cloud platforms, and services.
Implement and manage metrics, logs, traces, APM, RUM, synthetics, dashboards, monitors, SLOs, and service catalog capabilities.
Build effective alerting strategies that reduce noise, prevent alert storms, and improve signal quality.
Create and maintain Datadog dashboards for application health, infrastructure performance, business-critical services, and executive-level reporting.
Support incident response by improving detection, triage, escalation, and root cause analysis workflows.
Partner with application and platform teams to define meaningful SLIs, SLOs, and error budgets.
Help onboard new applications and services into Datadog observability standards.
Review existing monitors and alerts to improve thresholds, routing, ownership, and priority mapping.
Support integration of Datadog with enterprise tools and platforms such as Freshservice, Slack, Microsoft Teams, Jira, GitHub, AWS, and Fastly.
Automate observability deployment, configuration, and repeatable onboarding patterns using Pulumi and CI/CD practices.
Work with engineering teams to improve instrumentation for logs, metrics, traces, and application performance monitoring.
Use AI-assisted engineering tools such as Cursor, Claude, or similar AI coding assistants to improve productivity, accelerate troubleshooting, support automation, and assist with observability engineering tasks.
Build runbooks, operational dashboards, and troubleshooting guides for production support teams.
Participate in production readiness reviews, incident reviews, and post-incident improvement planning.