System Architect - AI Reliability Engineer
- 📅
- CREQ261916 Requisition #
- 📅
- Jun 26, 2026 Post Date
Build an automated evaluation harness covering correctness, prompt stability, hallucination, and guardrail enforcement across all production agents.
Stand up an LLM-as-judge framework with calibration tests so the judge itself stays honest over time.
Run red team and adversarial testing: prompt injection, jailbreaks, malformed tool outputs, corrupted upstream data.
Build multi-agent chain tests. Single-agent evals miss the cascade failures that happen when one agent feeds another through MCP.
Wire all of this into CI so failures block merges, with canary checks and rollback validation for production rollouts.
Add cost regression to the pipeline. A prompt change that 3x's token usage should fail CI, not surface in a monthly bill.
Instrument observability: decision logs, token cost, latency, drift detection, HITL engagement telemetry, and a quality dashboard the team actually uses.
Validate that the platform's self-reported confidence scores correlate with real correctness. If they do not, surface it.
Build the audit trail so any agent decision can be reproduced from logs within 24 hours
Should have good experience in technical Production support as a Architect and Designed and implemented automated evaluation frameworks to validate AI agent correctness, stability, hallucination prevention, and guardrail compliance across production environments. Provide regular status updates and reports to senior management on production health, ongoing incidents, and project progress. Should have good communication skills and strong Technical skills to support and lead a team
Build an automated evaluation harness covering correctness, prompt stability, hallucination, and guardrail enforcement across all production agents.
Stand up an LLM-as-judge framework with calibration tests so the judge itself stays honest over time.
Run red team and adversarial testing: prompt injection, jailbreaks, malformed tool outputs, corrupted upstream data.
Build multi-agent chain tests. Single-agent evals miss the cascade failures that happen when one agent feeds another through MCP.
Wire all of this into CI so failures block merges, with canary checks and rollback validation for production rollouts.
Add cost regression to the pipeline. A prompt change that 3x's token usage should fail CI, not surface in a monthly bill.
Instrument observability: decision logs, token cost, latency, drift detection, HITL engagement telemetry, and a quality dashboard the team actually uses.
Validate that the platform's self-reported confidence scores correlate with real correctness. If they do not, surface it.
Build the audit trail so any agent decision can be reproduced from logs within 24 hours
Should have good experience in technical Production support as a Architect and Designed and implemented automated evaluation frameworks to validate AI agent correctness, stability, hallucination prevention, and guardrail compliance across production environments. Provide regular status updates and reports to senior management on production health, ongoing incidents, and project progress. Should have good communication skills and strong Technical skills to support and lead a team