What does the monthly retainer include?

Incident response with defined SLAs, monthly token-usage reporting with governance recommendations, model-version upgrade management, ongoing prompt and agent optimization, and observability dashboard access. Scope is fixed per tier; custom scope is available.

Do you replace our existing platform team?

No. We work alongside your team — either as the primary operators for clients without a platform function, or as the escalation tier for teams that have one. Scope is defined at engagement start.

Default is P1 response within 30 minutes during business hours, 2 hours off-hours; P2 next business day; P3 weekly batch. Faster SLAs available on enterprise tier with corresponding pricing.

Home/Services/Managed Infrastructure

SERVICE

Post-Deployment Ops & Managed

We don't ship and leave. Monthly retainer covering incident response, token-budget governance, model upgrades, and ongoing optimization of production agents.

HQChicago, IL

APACMelbourne, AU

StackAWS · Next.js · Nexus

CategoryManaged Infrastructure

There is a wide gap between "the pilot works" and "the system keeps working at 2 AM Sunday." Most agencies hand you a repo, a Loom walkthrough, and a final invoice — then disappear when the model deprecates or the retrieval index goes stale. We architect, deploy, and operate the systems we build.

What Managed Service Covers.

Post-deployment ops is the monthly retainer layer that keeps AI systems running after launch. Scope covers 24x7 incident response with defined SLAs, token-budget governance, model upgrades, ongoing prompt and agent tuning, observability across CloudWatch and custom dashboards, and quarterly architecture reviews. We either operate as your primary platform team or as the escalation tier behind an internal one.

Managed service lives in Pillar III (Operational Excellence) of our 3-pillar framework — Pillar I handles the intelligence core (orchestration, agents, sovereignty, internal tooling), Pillar II handles discovery and authority engineering. Once a system is in production, the cost of failure stops being theoretical. Token bills compound, models deprecate, indexes drift, and prompts that worked in staging hallucinate when edge-case users arrive.

The retainer is scoped per tier with fixed deliverables; custom scope available. We bill for a bounded commitment, not hours: your system stays alive, costs stay predictable, and the escalation owner has read your codebase.

We operate in two postures. For clients without an internal platform function, we are the primary team. For teams with one, we are the escalation tier — your team owns day-to-day, we take the calls they cannot resolve inside a defined window.

What a typical monthly retainer includes:

24x7 incident response with defined P1/P2/P3 SLAs
Monthly token-budget report with per-agent and per-route breakdown
Model upgrade evaluation and scheduled migrations (Claude 4.5/4.6, GPT-5, Gemini 2.x)
Prompt and agent tuning based on production telemetry
CloudWatch dashboards, alarms, and custom observability built on your AWS account
Quarterly architecture review covering cost, latency, accuracy, and coverage
A single escalation channel with a named owner who knows your stack

Incident Response.

Incident response runs on a tiered SLA: P1 issues (production down, data loss risk) get a 30-minute response during business hours and 2 hours off-hours. P2 issues (degraded performance, non-blocking errors) get next business day. P3 issues (minor defects, cosmetic) get weekly batch. Enterprise tier offers faster SLAs including 15-minute P1 response and 30-minute off-hours acknowledgment.

Every retainer ships with an on-call rotation, stack-specific runbooks, and a Slack or PagerDuty escalation channel. You page one number. The person who picks up has seen your architecture diagram, has access to your AWS account, and knows which Lambda owns which route.

The three-tier SLA below mirrors how we operate our own platforms — Apex (12+ clients, 9 modules) and Nexus (9-module AI orchestration). These are the SLAs we hold ourselves to every day.

Severity	Definition	Response (Business Hours)	Response (Off-Hours)	Enterprise Tier
P1	Production down, data loss risk, security incident	30 minutes	2 hours	15 min / 30 min off-hours
P2	Degraded performance, partial outage, non-blocking errors	Next business day	Next business day	4 hours
P3	Minor defects, cosmetic issues, enhancement requests	Weekly batch	Weekly batch	2 business days

Escalation paths are documented per client. If the on-call engineer cannot resolve a P1 inside the SLA window, it escalates to a senior engineer, then to the architect who designed the system. No P1 sits unattended because "the right person is asleep."

Token-Budget Governance.

Token-budget governance is the monthly discipline of measuring, attributing, and tuning AI model spend. Every retainer ships a monthly report showing per-agent, per-route, and per-tenant token usage, anomaly flags, and recommendations. We tune prompts, swap models, and add caching where the math works — the difference between Opus and Haiku on a high-traffic route can be 40x.

Token costs are the new AWS bill. In 2026, a mid-size production AI system can spend $2,000 to $20,000 per month on inference alone, and the variance between tuned and untuned is often 5-10x. Most teams do not have the telemetry to see where the money is going, let alone the capacity to optimize week over week.

We instrument every AI call with agent, route, tenant, model ID, input/output tokens, cache hit/miss, and latency. That data flows into CloudWatch and a custom dashboard on your AWS account. Each month, you receive a written report covering:

Total spend by model and by agent
Per-tenant attribution for multi-tenant systems
Anomalies (sudden spikes, cache-miss regressions, runaway agents)
Specific tuning recommendations with projected savings
Model-swap opportunities where a cheaper model would meet the accuracy bar

The same discipline powers our Bedrock infrastructure work — Claude Opus for reasoning-heavy tasks, Haiku for classification, prompt caching where traffic supports it. This layer prevents a $200/mo agent from quietly becoming a $4,000/mo agent because nobody noticed the retrieval prompt doubled in length after a schema change.

Model Upgrades.

Model upgrades are handled as scheduled migrations, not surprise events. When Claude 4.5/4.6, GPT-5, or Gemini 2.x ships, we run a test suite against your production prompts in staging, measure accuracy and cost deltas, and propose a migration plan. Upgrades happen on your schedule behind a feature flag, not the vendor's release calendar.

Foundation model vendors ship new versions multiple times a year. Most are not drop-in improvements. A prompt that scored 94% on Claude 3.5 Sonnet can drop to 87% on Claude 4.6 because the new model is more conservative about hedging or interprets system prompts differently.

Our upgrade process has four phases:

Shadow test. Run the new model in parallel against a sample of production traffic, score outputs against the baseline.
Regression review. Surface accuracy, latency, cost, and output-format deltas in a written brief.
Staged rollout. Upgrades go live behind a feature flag — 10% of traffic, then 50%, then 100% over 1-2 weeks depending on risk.
Rollback path. The previous model stays behind a flag for at least 30 days. If a regression surfaces, we flip back inside minutes.

Deprecations get the same treatment in reverse. You do not wake up to a 503 because you were still on a 2024-era model ID.

Performance Tuning.

Performance tuning is the ongoing optimization of prompts, agent definitions, retrieval configs, and routing logic based on production telemetry. Each month we review failure modes, latency outliers, and accuracy drift, then ship tuned versions through the same staged rollout as model upgrades. Tuning often delivers 20-40% latency or cost reduction without accuracy trade-offs when the baseline has not been touched since launch.

Production AI systems degrade silently. A prompt that worked on 1,000 users breaks on 100,000 as the query distribution shifts. A RAG pipeline returns stale citations because the index has not been rebuilt. A tool-calling agent drops from 95% to 82% accuracy because a new tool was added without retuning the selection prompt.

We review production telemetry monthly and ship tuning through the same staged rollout as model upgrades. Typical work: prompt restructuring, retrieval chunk-size adjustments, tool-description rewrites, few-shot rotation, cache-key normalization. The agent architecture we build on initial deployment makes these changes safe — agents are versioned, configs are code, rollouts are gated.

Observability.

Observability covers what we monitor and what we surface to you. We monitor token usage, latency percentiles (p50/p95/p99), error rates, cache hit rates, and cost per transaction. These land in CloudWatch and custom dashboards on your AWS account, with alarms routed to the on-call rotation. You see the same data we do — we do not hide operational reality behind a monthly summary PDF.

Observability is not optional. If you cannot see what your AI system is doing, you cannot operate it. Every retainer ships with a baseline layer built on CloudWatch, structured logs, and a custom dashboard tailored to your stack.

Standard metrics we surface:

Token usage by agent, route, tenant, and model
Latency percentiles (p50, p95, p99) for every AI call
Error rates segmented by model error vs application error
Cache hit rates for prompt caching and retrieval caching
Cost per transaction (or per session, or per user depending on pricing model)
Retrieval quality (recall, precision, staleness) for RAG pipelines
Agent step counts and tool-call patterns
Rate-limit and throttling events from upstream model providers

Alarms route to the on-call rotation with context: not "error rate is elevated," but "error rate on /intake jumped from 0.3% to 4.1% over 10 minutes, correlated with a deployment at 14:22 UTC." The first action is resolving the incident, not diagnosing which system fired the alert.

Dashboards live on your AWS account with IAM scoped to your team. You own the data, we operate it. The same pattern runs behind our data sovereignty stance — nothing routes through our infrastructure by default.

How It Differs From a Typical MSP.

A traditional managed service provider handles known workloads with playbooks. AI infrastructure is not a known workload. Models change, prompts drift, retrieval degrades, and token costs move daily. We operate the systems we architected — the on-call engineer wrote the retrieval pipeline, not a ticket-queue tech following a runbook. That is why this service only applies to systems we have either built or fully audited.

Most MSPs were built for stable problems: the stack does not change, the workload does not change, last year's playbook still works. AI infrastructure in 2026 is none of those things. Frontier models ship quarterly, prompt behavior shifts with every version, token costs move daily, and retrieval quality depends on index freshness and embedding drift.

We only take on post-deployment ops for systems we have either built end-to-end or completed a full architecture audit on. We are not a help-desk bolted onto someone else's stack. The engineer picking up your P1 call is familiar with the Apex or Nexus patterns your system is built on — those are the same patterns running across the platforms we operate every day.

The retainer includes adjacent scope a traditional MSP would not touch. Model deprecation migrations, prompt retuning when business logic shifts, observability integration for new agents and retrieval sources — all part of the ongoing service. We pair this with training and enablement, change management, and internal tooling as needed, so your team is not locked out of the system we are operating.

When a pilot fails in production, someone here fixes it before you page us. That is the commitment. The SLAs, the dashboards, the monthly reports — all of it is evidence that the commitment is real.

Get Started

Ready to Get Started?

Let's discuss how we can help your brand dominate.

Schedule a Call

Quick Inquiry