Job Description | Core Responsibilities: - Cloud Infrastructure Management: Deploy, manage, and maintain cloud infrastructure across AWS, Azure, and/or GCP, ensuring compliance for government workloads.
- Infrastructure Automation: Automate infrastructure provisioning using Infrastructure as Code (IaC) tools like Terraform, OpenTofu, or AWS CloudFormation.
- Deployment Pipeline Streamlining: Collaborate with development teams to streamline CI/CD pipelines using tools such as GitLab and OpenTofu for efficient infrastructure and application delivery.
- Performance Optimization: Monitor system performance, participate in capacity planning, and optimize application and infrastructure performance by tuning configurations and identifying bottlenecks.
- Automation Development: Develop scripts and tools to automate routine operations, including patching, scaling, and monitoring.
- Self-Healing Systems: Design and implement self-healing systems that proactively detect and resolve faults.
- Data Integrity & Availability: Manage backup and disaster recovery strategies to ensure data integrity and availability across environments.
- Security & Compliance: Perform regular security audits and vulnerability patching, adhering to government compliance requirements (e.g., FedRAMP, NIST).
Incident Management & Observability: - Real-time Incident Resolution: Respond to and resolve infrastructure incidents and outages in real-time, minimizing disruption.
- Root Cause Analysis (RCA): Conduct RCA for production issues and implement long-term corrective actions.
- On-Call Participation: Participate in an on-call rotation, escalating and coordinating responses to high-severity issues.
- Incident Documentation: Document incidents, responses, and postmortems to capture lessons learned.
- Complex Problem Diagnosis: Diagnose complex infrastructure and application problems, including database performance issues, latency, and service connectivity challenges.
- Comprehensive Logging & Telemetry: Ensure comprehensive logging and telemetry to support incident response, performance tuning, and auditing.
- Observability Improvements: Drive observability improvements by collaborating with Engineering and Platform teams to enhance system reliability and traceability.
Application & Knowledge Management: - Application Incident Leadership: Lead resolution efforts for application-level incidents, ensuring coordinated response across teams.
- Application Lifecycle Management: Oversee application lifecycle management, including version upgrades, security patches, and regional rollouts.
- Knowledge Base Contribution: Contribute to a shared knowledge base, documenting recurring issues and resolution steps.
- Scaling Strategies: Support scaling strategies to meet regional demand, ensuring infrastructure resilience and compliance with service-level objectives (SLOs).
Qualifications: - Must be eligible and willing to submit for U.S. Government security clearances; active clearance is a plus.
- Experience supporting FedRAMP Authorized platforms is highly desirable.
- Minimum 5 years of experience leading and supporting enterprise-level applications in production environments.
- Proven experience in cloud infrastructure provisioning and management on Google Cloud Platform (GCP), Amazon Web Services (AWS), or Microsoft Azure.
- Proficiency in scripting languages such as Python, Bash, or PowerShell for automation and systems management.
- Strong understanding of containerization and orchestration technologies, including Docker, Kubernetes, and Helm.
- Hands-on experience with cloud object storage services such as AWS S3, Google Cloud Storage, or Azure Blob Storage.
- Working knowledge of database and persistence technologies, particularly MongoDB and PostgreSQL.
- Experience supporting and integrating microservices architectures and RESTful APIs.
- Familiarity with incident and service management systems, such as ServiceNow and Jira.
- Experience with SAST/DAST security and compliance tooling, such as Prisma Cloud, CrowdStrike, XSOAR, and Burp Suite.
- Basic understanding of identity and access management (IAM) and SSO technologies, particularly Okta, and application integration practices.
- Excellent troubleshooting skills, especially in complex, distributed, cloud-based environments.
- Strong written and verbal communication skills, with the ability to clearly document procedures, incidents, and solutions.
- Effective at producing support documentation and conducting knowledge transfer or training sessions.
- Demonstrated ability to work independently with minimal supervision in a fast-paced, collaborative, and globally distributed team.
- A motivated, proactive mindset with a commitment to delivering high-quality, secure, and reliable systems.
|