Sentinel — AWS-Native AI Ops Layer in Production

What is Sentinel?

Sentinel is iSimplifyMe’s production AI ops layer running on AWS Bedrock plus Lambda plus SQS plus DynamoDB. It detects production anomalies across iSimplifyMe’s 24-tenant AWS fleet, autonomously investigates root cause using AWS-native tools (curl, DNS lookups, GitHub API, CloudWatch queries), and files tickets with written diagnoses at roughly $0.08 per investigation — well under a $0.30 soft / $1.00 hard cost ceiling. Three workloads are live in production as of May 2026: a Diagnostics Agent firing every 5 minutes, a GitHub Triage Agent firing every 15 minutes, and a Pipeline Hang Detector firing every 15 minutes. Migrated to AWS Bedrock Converse API on 2026-05-05.

The setup

iSimplifyMe operates 24 production tenant workloads on AWS — multi-tenant web platforms, client websites refactored from cPanel, native iOS apps, regulated-vertical SaaS, and internal infrastructure. Each workload generates the same labor pattern: a CloudWatch alarm fires, a senior engineer drops what they’re doing, 15–45 minutes of investigation, a ticket gets filed with root cause. Across 24 workloads running 24/7 on a 6-FTE team, this overhead is unsustainable.

Sentinel is the layer that collapses detection plus investigation plus ticketing into one AWS-native pattern. Investigate-only. Cost-ceilinged. Generic detector plus runner.

Architecture

Every Sentinel workload follows the same proven shape: signal → detector Lambda → SQS → Bedrock Claude session → ticket plus Slack notification → admin reviews via Apex tickets queue. Nothing auto-remediates; every action is investigate-only by design.

Detectors: per-workload AWS Lambda functions firing on AWS EventBridge crons. One Cron resource per workload, all pointing at the same generic sentinel-detector.handler with different SENTINEL_AGENT_SLUG env vars. One generic runner: SQS-triggered Lambda dispatching investigation logic via a typed SentinelAgent registry — no per-agent runner code duplication. Atomic locks: DynamoDB attribute_not_exists(pk) conditional writes prevent same-incident-twice during parallel detector runs. Shared tool library: file_ticket , notify_slack , curl_url , dns_lookup , plus per-agent tools.

Bedrock inference layer: the runner Lambda calls Bedrock via ConverseStreamCommand on the us.anthropic.claude-sonnet-4-6 US inference profile (server-streamed responses, low latency, first-class AWS SDK support). IAM scope: bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream on foundation-model and inference-profile ARNs. Production migrated from Anthropic API direct to Bedrock Converse on 2026-05-05.

Why this is harder than it looks

Problem 01

Detection without investigation

Traditional AI ops platforms excel at detecting signals — uptime drops, deploy failures, performance regressions — but stop at the alerting boundary. The expensive labor happens AFTER the alert fires: a senior engineer reads logs, correlates timestamps across services, runs diagnostic queries, identifies root cause, and files a ticket. Across a 24-workload production fleet running 24/7, this overhead is unsustainable for a 6-FTE team. Sentinel collapses detection + investigation into one Bedrock session.

Problem 02

Cost ceilings on LLM ops agents

Naïve LLM-driven ops agents have no cost ceiling. A single agent run can consume hundreds of thousands of tokens if tool use goes pathological. Sentinel enforces $0.30 soft / $1.00 hard cost ceilings per investigation via the Bedrock Converse session abort path. Synthetic tests verified $0.0822–$0.0941 per real investigation, with most sessions completing in 35–45 active seconds. The architecture is cost-conscious at the session level, not the platform level.

Problem 03

Same-incident race conditions

Detectors firing on a cron (every 5 minutes for Diagnostics; every 15 for GH Triage and Pipeline Hang) will race — two parallel detectors can find the same incident before either has filed a ticket. Sentinel uses DynamoDB conditional writes with `attribute_not_exists(pk)` as the atomic lock primitive. The first detector that writes the lock owns the incident; subsequent detector ticks see the lock and skip the duplicate investigation. No distributed locking infrastructure required.

Problem 04

N workloads should not require N Lambdas

A naïve implementation deploys one detector Lambda + one runner Lambda per workload. At 3 workloads that’s 6 Lambdas; at the 8-workload roadmap state that’s 16 Lambdas plus 8 IAM policies plus 8 SQS queues. Sentinel collapses to ONE generic detector Lambda (parameterized via `SENTINEL_AGENT_SLUG` env var per cron) + ONE generic runner Lambda that dispatches by registry lookup. Adding the 4th workload requires a new entry in the typed `SentinelAgent` registry + a new EventBridge cron rule — zero new Lambdas, zero new IAM policies, zero new queues.

What’s running in production today

Workload	Cron	Tools	Cost
Diagnostics Agent	rate(5 minutes)	curl_url, dns_lookup, get_cf_5xx_breakdown, file_ticket, notify_slack	$0.0822 / investigation
GitHub Triage Agent	rate(15 minutes)	get_workflow_run, get_workflow_run_logs, get_recent_commits, get_workflow_file, file_ticket, notify_slack	$0.065 / investigation
Pipeline Hang Detector	rate(15 minutes)	singleton-lock probe, owner-tenant correlation, file_ticket, notify_slack	$0.0941 / investigation

Three workloads live as of May 2026. ~14,850 detector cron ticks per month across all three (verified via CloudWatch Lambda invocation metrics). Sub-$5/month total operating cost at the current 3-workload state. iSimplifyMe-operated. Not handed off.

Build log

Phase 0
Audit + cost ceiling design (April 2026)
Audit of existing manual investigation workflows across the 24-tenant AWS fleet. Cost-ceiling design: $0.30 soft / $1.00 hard per investigation, with hard session-abort path wired through the Bedrock Converse client. Initial substrate decision: Anthropic Managed Agents (Bedrock-hosted Managed Agents had not yet shipped at the time).
Phase 1
Diagnostics Agent live in production (April 2026)
Production detector Lambda firing every 5 minutes. SST-wired infrastructure: SentinelQueue (SQS), generic runner Lambda, DynamoDB ticket records, EventBridge cron rules, IAM-scoped Anthropic API perimeter. Synthetic tests on staging passed at ~$0.063 per investigation. 7-day canary observation window confirmed zero false positives before broadening to the full monitored fleet.
Phase 1.5
GH Triage Agent live (April 2026)
Second workload reusing the same generic detector + runner pattern. Per-workload tool library: GitHub API client (workflow run + logs + recent commits + workflow file). Scoped to a canary repository for 7 days before broadening to all 37 iSimplifyMe repos via a dynamic exclusion list cleanup in DynamoDB. Verified at ~$0.065 per investigation.
Phase 1.6
Pipeline Hang Detector live (May 2026)
Third workload — detects orphaned content pipeline locks across the 24-tenant fleet (e.g., singleton write-posts lock held past timeout threshold). New tools: singleton-lock probe + owner-tenant correlation. Atomic lock primitive (DynamoDB conditional write) prevents same-incident race conditions during parallel detector ticks. ~$0.0941 per investigation.
Phase 2
Migrated to AWS Bedrock Converse (2026-05-05)
Substrate migration from Anthropic API direct (Managed Agents) to AWS Bedrock via the `ConverseStreamCommand` API on the `us.anthropic.claude-sonnet-4-6` US inference profile. Production deploy succeeded on 2026-05-05 17:07 UTC. Pre-migration substrate preserved for rollback through 2026-06-04. Synthetic tests on staging verified $0.0822 Diagnostics + $0.0941 Pipeline Hang post-migration with no regression in agent behavior or session cost.
Phase 3+
Roadmap workloads (architecture supports — same generic detector + runner)
Issue/PR Triage Agent · Lighthouse Regression Detector · Cost Anomaly Detector · AEO Drift / Citation Surveillance · DNS Watcher. Eight total workloads planned. Each new workload requires only a new entry in the typed `SentinelAgent` registry + a new EventBridge cron rule. Zero new Lambdas. Projected operating cost at full 8-workload state: $25–40/month.

Built and operated, not delivered.

Most "AI ops" engagements end with a slide deck about future AI agents.

Sentinel is the layer iSimplifyMe runs against its own 24-tenant production AWS fleet, today. Three workloads firing on EventBridge crons every 5–15 minutes. Investigate-only. Cost-ceilinged. The same architecture is deployable into a client AWS account on the same SST-wired pattern.

Bootstrapped.In production.AWS-native.

Frequently asked questions

How is Sentinel different from Datadog or PagerDuty AI monitoring?

Datadog and PagerDuty detect signals — uptime drops, deploy failures, performance regressions — and route alerts to humans. They stop at the alerting boundary. Sentinel detects the signal AND investigates root cause autonomously using AWS-native tools (curl, DNS lookups, CloudWatch queries, GitHub API calls), then files a ticket with a written diagnosis. The expensive labor that traditional ops platforms pass to senior engineers is what Sentinel does inside a $0.08 Bedrock session.

Why investigate-only and not auto-remediation?

Auto-remediation requires the AI to be right every time about a destructive action. Investigate-only requires it to be right most of the time about a non-destructive diagnosis. The cost asymmetry is enormous. A Sentinel-pattern agent that misdiagnoses costs you the price of a ticket (some senior-engineer time to read and dismiss it). An auto-remediating agent that misdiagnoses can take down production. Investigate-only is the model that scales with current LLM reliability; auto-remediation isn’t.

What does the per-investigation cost include?

The full Bedrock Converse session: input tokens (system prompt plus tools schema plus prior conversation), output tokens, tool-use orchestration, and any retries. Verified across synthetic tests: $0.0822 for a Diagnostics run with DNS plus HTTP plus a CloudFront 5xx breakdown probe, $0.0941 for a Pipeline Hang investigation that walked a singleton lock plus owner-tenant correlation, $0.065 for a GitHub Triage run that pulled workflow logs and recent commits before classifying the failure. Every investigation has a hard $1.00 cost ceiling that kills the session if it’s ever approached. Most run in 35–45 seconds.

Can Sentinel run inside our AWS account, not iSimplifyMe’s?

Yes — that’s the deployment model for client engagements. The Sentinel-pattern infrastructure (detector Lambdas, runner Lambda, SentinelQueue, DynamoDB locks, EventBridge crons, IAM-scoped Bedrock invocation perimeter) is deployable via SST into a client AWS account. The same generic-detector-plus-runner pattern carries forward; per-workload registries adapt to the client’s monitored surface. iSimplifyMe operates it post-deploy unless the client wants ownership transferred — same arrangement as Apex Portal multi-tenant.

How long does Phase 1 take from kickoff?

For a client deployment of the Diagnostics Agent (the most common starting workload), about 3 weeks: 1 week to scope the monitored surface plus author the per-workload tools, 1 week to provision the IAM-scoped Bedrock inference profile plus SST infrastructure plus DynamoDB tables, 1 week of synthetic testing plus a canary observation window before broadening to the full monitored fleet. Subsequent workloads (GH Triage, Pipeline Hang, custom-vertical detectors) reuse the same runner and generic detector pattern — typically 1 week of incremental work per additional workload.

Get Started

Deploy a Sentinel-pattern AI ops layer

If you run production workloads on AWS and your senior engineers spend hours per week investigating CloudWatch alerts, we can talk through what the Sentinel-pattern deployment looks like for your stack.

Start a conversation See the Layer 3 architecture

Discovery call30 min · Free · No deck — actual mechanics
Phase 1 timeline~3 weeks to first workload live in production
iSM-operatedBedrock, IAM, alerts — all on us

What is Sentinel?

The setup

Architecture

Why this is harder than it looks

Detection without investigation

Cost ceilings on LLM ops agents

Same-incident race conditions

N workloads should not require N Lambdas

What’s running in production today

Build log

Audit + cost ceiling design (April 2026)

Diagnostics Agent live in production (April 2026)

GH Triage Agent live (April 2026)

Pipeline Hang Detector live (May 2026)

Migrated to AWS Bedrock Converse (2026-05-05)

Roadmap workloads (architecture supports — same generic detector + runner)

Built and operated, not delivered.

Frequently asked questions

Deploy a Sentinel-pattern AI ops layer

Stay Ahead of the Curve