Investigate-only AI agents on AWS Bedrock + Lambda + SQS. iSimplifyMe-built. iSimplifyMe-operated. ~$0.08 per investigation.
What is Sentinel?
Sentinel is iSimplifyMe’s production AI ops layer running on AWS Bedrock plus Lambda plus SQS plus DynamoDB. It detects production anomalies across iSimplifyMe’s 24-tenant AWS fleet, autonomously investigates root cause using AWS-native tools (curl, DNS lookups, GitHub API, CloudWatch queries), and files tickets with written diagnoses at roughly $0.08 per investigation — well under a $0.30 soft / $1.00 hard cost ceiling. Three workloads are live in production as of May 2026: a Diagnostics Agent firing every 5 minutes, a GitHub Triage Agent firing every 15 minutes, and a Pipeline Hang Detector firing every 15 minutes. Migrated to AWS Bedrock Converse API on 2026-05-05.
The setup
iSimplifyMe operates 24 production tenant workloads on AWS — multi-tenant web platforms, client websites refactored from cPanel, native iOS apps, regulated-vertical SaaS, and internal infrastructure. Each workload generates the same labor pattern: a CloudWatch alarm fires, a senior engineer drops what they’re doing, 15–45 minutes of investigation, a ticket gets filed with root cause. Across 24 workloads running 24/7 on a 6-FTE team, this overhead is unsustainable.
Sentinel is the layer that collapses detection plus investigation plus ticketing into one AWS-native pattern. Investigate-only. Cost-ceilinged. Generic detector plus runner.
Architecture
Every Sentinel workload follows the same proven shape: signal → detector Lambda → SQS → Bedrock Claude session → ticket plus Slack notification → admin reviews via Apex tickets queue. Nothing auto-remediates; every action is investigate-only by design.
Detectors: per-workload AWS Lambda functions firing on AWS EventBridge crons. One Cron resource per workload, all pointing at the same generic sentinel-detector.handler with different SENTINEL_AGENT_SLUG env vars. One generic runner: SQS-triggered Lambda dispatching investigation logic via a typed SentinelAgent registry — no per-agent runner code duplication. Atomic locks: DynamoDB attribute_not_exists(pk) conditional writes prevent same-incident-twice during parallel detector runs. Shared tool library: file_ticket , notify_slack , curl_url , dns_lookup , plus per-agent tools.
Bedrock inference layer: the runner Lambda calls Bedrock via ConverseStreamCommand on the us.anthropic.claude-sonnet-4-6 US inference profile (server-streamed responses, low latency, first-class AWS SDK support). IAM scope: bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream on foundation-model and inference-profile ARNs. Production migrated from Anthropic API direct to Bedrock Converse on 2026-05-05.
Why this is harder than it looks
Detection without investigation
Traditional AI ops platforms excel at detecting signals — uptime drops, deploy failures, performance regressions — but stop at the alerting boundary. The expensive labor happens AFTER the alert fires: a senior engineer reads logs, correlates timestamps across services, runs diagnostic queries, identifies root cause, and files a ticket. Across a 24-workload production fleet running 24/7, this overhead is unsustainable for a 6-FTE team. Sentinel collapses detection + investigation into one Bedrock session.
Cost ceilings on LLM ops agents
Naïve LLM-driven ops agents have no cost ceiling. A single agent run can consume hundreds of thousands of tokens if tool use goes pathological. Sentinel enforces $0.30 soft / $1.00 hard cost ceilings per investigation via the Bedrock Converse session abort path. Synthetic tests verified $0.0822–$0.0941 per real investigation, with most sessions completing in 35–45 active seconds. The architecture is cost-conscious at the session level, not the platform level.
Same-incident race conditions
Detectors firing on a cron (every 5 minutes for Diagnostics; every 15 for GH Triage and Pipeline Hang) will race — two parallel detectors can find the same incident before either has filed a ticket. Sentinel uses DynamoDB conditional writes with `attribute_not_exists(pk)` as the atomic lock primitive. The first detector that writes the lock owns the incident; subsequent detector ticks see the lock and skip the duplicate investigation. No distributed locking infrastructure required.
N workloads should not require N Lambdas
A naïve implementation deploys one detector Lambda + one runner Lambda per workload. At 3 workloads that’s 6 Lambdas; at the 8-workload roadmap state that’s 16 Lambdas plus 8 IAM policies plus 8 SQS queues. Sentinel collapses to ONE generic detector Lambda (parameterized via `SENTINEL_AGENT_SLUG` env var per cron) + ONE generic runner Lambda that dispatches by registry lookup. Adding the 4th workload requires a new entry in the typed `SentinelAgent` registry + a new EventBridge cron rule — zero new Lambdas, zero new IAM policies, zero new queues.
What’s running in production today
| Workload | Cron | Tools | Cost |
|---|---|---|---|
| Diagnostics Agent | rate(5 minutes) | curl_url, dns_lookup, get_cf_5xx_breakdown, file_ticket, notify_slack | $0.0822 / investigation |
| GitHub Triage Agent | rate(15 minutes) | get_workflow_run, get_workflow_run_logs, get_recent_commits, get_workflow_file, file_ticket, notify_slack | $0.065 / investigation |
| Pipeline Hang Detector | rate(15 minutes) | singleton-lock probe, owner-tenant correlation, file_ticket, notify_slack | $0.0941 / investigation |
Three workloads live as of May 2026. ~14,850 detector cron ticks per month across all three (verified via CloudWatch Lambda invocation metrics). Sub-$5/month total operating cost at the current 3-workload state. iSimplifyMe-operated. Not handed off.
Build log
- Phase 0
Audit + cost ceiling design (April 2026)
Audit of existing manual investigation workflows across the 24-tenant AWS fleet. Cost-ceiling design: $0.30 soft / $1.00 hard per investigation, with hard session-abort path wired through the Bedrock Converse client. Initial substrate decision: Anthropic Managed Agents (Bedrock-hosted Managed Agents had not yet shipped at the time).
- Phase 1
Diagnostics Agent live in production (April 2026)
Production detector Lambda firing every 5 minutes. SST-wired infrastructure: SentinelQueue (SQS), generic runner Lambda, DynamoDB ticket records, EventBridge cron rules, IAM-scoped Anthropic API perimeter. Synthetic tests on staging passed at ~$0.063 per investigation. 7-day canary observation window confirmed zero false positives before broadening to the full monitored fleet.
- Phase 1.5
GH Triage Agent live (April 2026)
Second workload reusing the same generic detector + runner pattern. Per-workload tool library: GitHub API client (workflow run + logs + recent commits + workflow file). Scoped to a canary repository for 7 days before broadening to all 37 iSimplifyMe repos via a dynamic exclusion list cleanup in DynamoDB. Verified at ~$0.065 per investigation.
- Phase 1.6
Pipeline Hang Detector live (May 2026)
Third workload — detects orphaned content pipeline locks across the 24-tenant fleet (e.g., singleton write-posts lock held past timeout threshold). New tools: singleton-lock probe + owner-tenant correlation. Atomic lock primitive (DynamoDB conditional write) prevents same-incident race conditions during parallel detector ticks. ~$0.0941 per investigation.
- Phase 2
Migrated to AWS Bedrock Converse (2026-05-05)
Substrate migration from Anthropic API direct (Managed Agents) to AWS Bedrock via the `ConverseStreamCommand` API on the `us.anthropic.claude-sonnet-4-6` US inference profile. Production deploy succeeded on 2026-05-05 17:07 UTC. Pre-migration substrate preserved for rollback through 2026-06-04. Synthetic tests on staging verified $0.0822 Diagnostics + $0.0941 Pipeline Hang post-migration with no regression in agent behavior or session cost.
- Phase 3+
Roadmap workloads (architecture supports — same generic detector + runner)
Issue/PR Triage Agent · Lighthouse Regression Detector · Cost Anomaly Detector · AEO Drift / Citation Surveillance · DNS Watcher. Eight total workloads planned. Each new workload requires only a new entry in the typed `SentinelAgent` registry + a new EventBridge cron rule. Zero new Lambdas. Projected operating cost at full 8-workload state: $25–40/month.
Built and operated, not delivered.
Most "AI ops" engagements end with a slide deck about future AI agents.
Sentinel is the layer iSimplifyMe runs against its own 24-tenant production AWS fleet, today. Three workloads firing on EventBridge crons every 5–15 minutes. Investigate-only. Cost-ceilinged. The same architecture is deployable into a client AWS account on the same SST-wired pattern.
Frequently asked questions
How is Sentinel different from Datadog or PagerDuty AI monitoring?
Datadog and PagerDuty detect signals — uptime drops, deploy failures, performance regressions — and route alerts to humans. They stop at the alerting boundary. Sentinel detects the signal AND investigates root cause autonomously using AWS-native tools (curl, DNS lookups, CloudWatch queries, GitHub API calls), then files a ticket with a written diagnosis. The expensive labor that traditional ops platforms pass to senior engineers is what Sentinel does inside a $0.08 Bedrock session.
Why investigate-only and not auto-remediation?
Auto-remediation requires the AI to be right every time about a destructive action. Investigate-only requires it to be right most of the time about a non-destructive diagnosis. The cost asymmetry is enormous. A Sentinel-pattern agent that misdiagnoses costs you the price of a ticket (some senior-engineer time to read and dismiss it). An auto-remediating agent that misdiagnoses can take down production. Investigate-only is the model that scales with current LLM reliability; auto-remediation isn’t.
What does the per-investigation cost include?
The full Bedrock Converse session: input tokens (system prompt plus tools schema plus prior conversation), output tokens, tool-use orchestration, and any retries. Verified across synthetic tests: $0.0822 for a Diagnostics run with DNS plus HTTP plus a CloudFront 5xx breakdown probe, $0.0941 for a Pipeline Hang investigation that walked a singleton lock plus owner-tenant correlation, $0.065 for a GitHub Triage run that pulled workflow logs and recent commits before classifying the failure. Every investigation has a hard $1.00 cost ceiling that kills the session if it’s ever approached. Most run in 35–45 seconds.
Can Sentinel run inside our AWS account, not iSimplifyMe’s?
Yes — that’s the deployment model for client engagements. The Sentinel-pattern infrastructure (detector Lambdas, runner Lambda, SentinelQueue, DynamoDB locks, EventBridge crons, IAM-scoped Bedrock invocation perimeter) is deployable via SST into a client AWS account. The same generic-detector-plus-runner pattern carries forward; per-workload registries adapt to the client’s monitored surface. iSimplifyMe operates it post-deploy unless the client wants ownership transferred — same arrangement as Apex Portal multi-tenant.
How long does Phase 1 take from kickoff?
For a client deployment of the Diagnostics Agent (the most common starting workload), about 3 weeks: 1 week to scope the monitored surface plus author the per-workload tools, 1 week to provision the IAM-scoped Bedrock inference profile plus SST infrastructure plus DynamoDB tables, 1 week of synthetic testing plus a canary observation window before broadening to the full monitored fleet. Subsequent workloads (GH Triage, Pipeline Hang, custom-vertical detectors) reuse the same runner and generic detector pattern — typically 1 week of incremental work per additional workload.
Deploy a Sentinel-pattern AI ops layer
If you run production workloads on AWS and your senior engineers spend hours per week investigating CloudWatch alerts, we can talk through what the Sentinel-pattern deployment looks like for your stack.
- Discovery call30 min · Free · No deck — actual mechanics
- Phase 1 timeline~3 weeks to first workload live in production
- iSM-operatedBedrock, IAM, alerts — all on us