Abstract
Sentinel is iSimplifyMe's production monitoring layer — a fleet of Claude agents running on Anthropic Managed Agents that catch regressions, anomalies, and operational incidents across the iSM stack. It is internal infrastructure, not a customer-facing product, and runs to keep the rest of the platform honest.
Problem
Production AI infrastructure has more silent failure modes than monitorable ones. A Bedrock model that responds with semantically wrong answers passes a 200 OK health check. A retrieval pipeline that surfaces stale data clears every uptime probe.
Manual log review does not scale across a multi-site network with thirty in-production engagements. Status pages tell you what is on; they do not tell you what is wrong.
Approach
The agent topology
Each Sentinel agent is an Anthropic Managed Agents workload with a discrete surveillance scope and a defined cadence. The agent reads from a constrained set of operational signals — logs, recent error events, model-call traces — runs a Claude inference pass to classify the situation, and decides whether the finding warrants escalation.
Slack as the approval gate
When an agent identifies something worth escalating, it posts a Block Kit card to the appropriate channel with the diagnosis, the recommended remediation, and a small set of action buttons. A human reviewer clicks one. Only then does any remediation fire.
The design rule came directly from an April 2026 incident where an unguarded automated drip in the Retell Phone Bridge sent the same follow-up email 48 times to three leads. Sentinel's discipline since then: automated detection is fine, automated remediation requires a human in the loop.
Workload #1: Diagnostics Agent
The Diagnostics Agent shipped April 2026 as Sentinel's first production workload and the first Anthropic Managed Agents deployment at iSM. It investigates client tenant sites that have failed three consecutive uptime checks — running curl, dig, openssl, and Cloudflare 5xx-breakdown probes — then files a markdown bug-report ticket with timeline, root cause, evidence, and recommended fix. Verified cost: $0.06 per incident on synthetic test cases (Sonnet 4.6, ~50 seconds active runtime, ~28k tokens including prompt cache reuse).
Workload #2: GH Triage Agent
The GH Triage Agent shipped April 2026 as Sentinel's second production workload. It polls iSimplifyMe org repository workflow runs every fifteen minutes, detects failures, and runs an inference pass classifying root cause across eight categories: test_flake, regression, infrastructure, auth, dependency, lint_or_typecheck, build_config, and unknown. Output is a structured ticket on the synthetic internal-isimplifyme tenant with a markdown body covering Failure Summary, Classification, Failed Jobs, Recent Commits, and investigator Notes.
Verified cost: $0.065 per run on synthetic test (Sonnet 4.6, ~44 seconds active runtime). Idempotent — once a failed run is investigated, a 24-hour DDB lock prevents re-investigation, so flapping CI does not produce duplicate tickets.
The two workloads share infrastructure: one generic SQS-triggered runner Lambda dispatches the right agent based on a SENTINEL_AGENT_SLUG kickoff message, an atomic conditional-write lock at INCIDENT#OPEN race-protects parallel detection paths, and the same file_ticket and notify_slack tools serve both. Adding a new Sentinel workload is a registry entry plus a detector handler; everything else is shared.
Status
- Sentinel layer shipped April 2026 as MA-based monitoring infrastructure on top of the Apex Client Portal stack.
- Workload #1 (Diagnostics Agent) shipped 2026-04-26; SST migration into the Sentinel namespace shipped 2026-04-29 with race-protected parallel-run lock. Currently observing in canary mode — canary tenant flag flip scheduled 2026-05-03, with a scheduled review agent firing the same day to confirm the rollout against any incidents that fired in the first activation window.
- Workload #2 (GH Triage Agent) shipped 2026-04-29 to staging and production. Detector and runner Lambdas firing every fifteen minutes; staging synthetic test completed end-to-end (the agent picked up a deliberately-failing workflow on the cc-canary repository, classified it as
build_config, recognized from the commit message that it was a synthetic test, filed an exemplary ticket, and posted a precise Slack one-line summary). Currently in a seven-day cc-canary soak through 2026-05-06; broadens to all thirty-seven iSimplifyMe org repositories after the soak verifies clean. - Future Sentinel workloads on the roadmap: an Issue/PR triage agent for inbound GitHub issues, a pipeline-hang detector for the content pipeline, a weekly audit agent for cross-repo health rollups, a Lighthouse regression detector for client sites, a cost-anomaly detector for AWS spend spikes, and an AEO drift / citation surveillance / DNS watcher for the brand-citation infrastructure.