Post-Deployment Ops & Managed
We don't ship and leave. Monthly retainer covering incident response, token-budget governance, model upgrades, and ongoing optimization of production agents.
There is a wide gap between "the pilot works" and "the system keeps working at 2 AM Sunday." Most agencies hand you a repo, a Loom walkthrough, and a final invoice — then disappear when the model deprecates or the retrieval index goes stale. We architect, deploy, and operate the systems we build.
What Managed Service Covers.
Post-deployment ops is the monthly retainer layer that keeps AI systems running after launch. Scope covers 24x7 incident response with defined SLAs, token-budget governance, model upgrades, ongoing prompt and agent tuning, observability across CloudWatch and custom dashboards, and quarterly architecture reviews. We either operate as your primary platform team or as the escalation tier behind an internal one.
Managed service lives in Pillar III (Operational Excellence) of our 3-pillar framework — Pillar I handles the intelligence core (orchestration, agents, sovereignty, internal tooling), Pillar II handles discovery and authority engineering. Once a system is in production, the cost of failure stops being theoretical. Token bills compound, models deprecate, indexes drift, and prompts that worked in staging hallucinate when edge-case users arrive.
The retainer is scoped per tier with fixed deliverables; custom scope available. We bill for a bounded commitment, not hours: your system stays alive, costs stay predictable, and the escalation owner has read your codebase.
We operate in two postures. For clients without an internal platform function, we are the primary team. For teams with one, we are the escalation tier — your team owns day-to-day, we take the calls they cannot resolve inside a defined window.
- 24x7 incident response with defined P1/P2/P3 SLAs
- Monthly token-budget report with per-agent and per-route breakdown
- Model upgrade evaluation and scheduled migrations (Claude 4.5/4.6, GPT-5, Gemini 2.x)
- Prompt and agent tuning based on production telemetry
- CloudWatch dashboards, alarms, and custom observability built on your AWS account
- Quarterly architecture review covering cost, latency, accuracy, and coverage
- A single escalation channel with a named owner who knows your stack
Incident Response.
Incident response runs on a tiered SLA: P1 issues (production down, data loss risk) get a 30-minute response during business hours and 2 hours off-hours. P2 issues (degraded performance, non-blocking errors) get next business day. P3 issues (minor defects, cosmetic) get weekly batch. Enterprise tier offers faster SLAs including 15-minute P1 response and 30-minute off-hours acknowledgment.
Every retainer ships with an on-call rotation, stack-specific runbooks, and a Slack or PagerDuty escalation channel. You page one number. The person who picks up has seen your architecture diagram, has access to your AWS account, and knows which Lambda owns which route.
The three-tier SLA below mirrors how we operate our own platforms — Apex (12+ clients, 9 modules) and Nexus (9-module AI orchestration). These are the SLAs we hold ourselves to every day.
| Severity | Definition | Response (Business Hours) | Response (Off-Hours) | Enterprise Tier |
|---|---|---|---|---|
| P1 | Production down, data loss risk, security incident | 30 minutes | 2 hours | 15 min / 30 min off-hours |
| P2 | Degraded performance, partial outage, non-blocking errors | Next business day | Next business day | 4 hours |
| P3 | Minor defects, cosmetic issues, enhancement requests | Weekly batch | Weekly batch | 2 business days |
Token-Budget Governance.
Token-budget governance is the monthly discipline of measuring, attributing, and tuning AI model spend. Every retainer ships a monthly report showing per-agent, per-route, and per-tenant token usage, anomaly flags, and recommendations. We tune prompts, swap models, and add caching where the math works — the difference between Opus and Haiku on a high-traffic route can be 40x.
Token costs are the new AWS bill. In 2026, a mid-size production AI system can spend $2,000 to $20,000 per month on inference alone, and the variance between tuned and untuned is often 5-10x. Most teams do not have the telemetry to see where the money is going, let alone the capacity to optimize week over week.
We instrument every AI call with agent, route, tenant, model ID, input/output tokens, cache hit/miss, and latency. That data flows into CloudWatch and a custom dashboard on your AWS account. Each month, you receive a written report covering:
- Total spend by model and by agent
- Per-tenant attribution for multi-tenant systems
- Anomalies (sudden spikes, cache-miss regressions, runaway agents)
- Specific tuning recommendations with projected savings
- Model-swap opportunities where a cheaper model would meet the accuracy bar
Model Upgrades.
Model upgrades are handled as scheduled migrations, not surprise events. When Claude 4.5/4.6, GPT-5, or Gemini 2.x ships, we run a test suite against your production prompts in staging, measure accuracy and cost deltas, and propose a migration plan. Upgrades happen on your schedule behind a feature flag, not the vendor's release calendar.
Foundation model vendors ship new versions multiple times a year. Most are not drop-in improvements. A prompt that scored 94% on Claude 3.5 Sonnet can drop to 87% on Claude 4.6 because the new model is more conservative about hedging or interprets system prompts differently.
- Shadow test. Run the new model in parallel against a sample of production traffic, score outputs against the baseline.
- Regression review. Surface accuracy, latency, cost, and output-format deltas in a written brief.
- Staged rollout. Upgrades go live behind a feature flag — 10% of traffic, then 50%, then 100% over 1-2 weeks depending on risk.
- Rollback path. The previous model stays behind a flag for at least 30 days. If a regression surfaces, we flip back inside minutes.
Performance Tuning.
Performance tuning is the ongoing optimization of prompts, agent definitions, retrieval configs, and routing logic based on production telemetry. Each month we review failure modes, latency outliers, and accuracy drift, then ship tuned versions through the same staged rollout as model upgrades. Tuning often delivers 20-40% latency or cost reduction without accuracy trade-offs when the baseline has not been touched since launch.
Production AI systems degrade silently. A prompt that worked on 1,000 users breaks on 100,000 as the query distribution shifts. A RAG pipeline returns stale citations because the index has not been rebuilt. A tool-calling agent drops from 95% to 82% accuracy because a new tool was added without retuning the selection prompt.
We review production telemetry monthly and ship tuning through the same staged rollout as model upgrades. Typical work: prompt restructuring, retrieval chunk-size adjustments, tool-description rewrites, few-shot rotation, cache-key normalization. The agent architecture we build on initial deployment makes these changes safe — agents are versioned, configs are code, rollouts are gated.
Observability.
Observability covers what we monitor and what we surface to you. We monitor token usage, latency percentiles (p50/p95/p99), error rates, cache hit rates, and cost per transaction. These land in CloudWatch and custom dashboards on your AWS account, with alarms routed to the on-call rotation. You see the same data we do — we do not hide operational reality behind a monthly summary PDF.
Observability is not optional. If you cannot see what your AI system is doing, you cannot operate it. Every retainer ships with a baseline layer built on CloudWatch, structured logs, and a custom dashboard tailored to your stack.
- Token usage by agent, route, tenant, and model
- Latency percentiles (p50, p95, p99) for every AI call
- Error rates segmented by model error vs application error
- Cache hit rates for prompt caching and retrieval caching
- Cost per transaction (or per session, or per user depending on pricing model)
- Retrieval quality (recall, precision, staleness) for RAG pipelines
- Agent step counts and tool-call patterns
- Rate-limit and throttling events from upstream model providers
Dashboards live on your AWS account with IAM scoped to your team. You own the data, we operate it. The same pattern runs behind our data sovereignty stance — nothing routes through our infrastructure by default.
How It Differs From a Typical MSP.
A traditional managed service provider handles known workloads with playbooks. AI infrastructure is not a known workload. Models change, prompts drift, retrieval degrades, and token costs move daily. We operate the systems we architected — the on-call engineer wrote the retrieval pipeline, not a ticket-queue tech following a runbook. That is why this service only applies to systems we have either built or fully audited.
Most MSPs were built for stable problems: the stack does not change, the workload does not change, last year's playbook still works. AI infrastructure in 2026 is none of those things. Frontier models ship quarterly, prompt behavior shifts with every version, token costs move daily, and retrieval quality depends on index freshness and embedding drift.
We only take on post-deployment ops for systems we have either built end-to-end or completed a full architecture audit on. We are not a help-desk bolted onto someone else's stack. The engineer picking up your P1 call is familiar with the Apex or Nexus patterns your system is built on — those are the same patterns running across the platforms we operate every day.
The retainer includes adjacent scope a traditional MSP would not touch. Model deprecation migrations, prompt retuning when business logic shifts, observability integration for new agents and retrieval sources — all part of the ongoing service. We pair this with training and enablement, change management, and internal tooling as needed, so your team is not locked out of the system we are operating.
When a pilot fails in production, someone here fixes it before you page us. That is the commitment. The SLAs, the dashboards, the monthly reports — all of it is evidence that the commitment is real.