
The 9 KPIs That Measure Whether Your AI Agents Are Actually Working
Traditional marketing KPIs (CAC, ROAS, LTV) measure outcomes. They miss whether the agent stack producing those outcomes is healthy. The nine KPIs that diagnose your agent stack across productivity, quality, and leverage, with audit + benchmarks per brand size.
Pull your current marketing dashboard. Count the KPIs. CAC, ROAS, LTV, retention rate, NPS, channel attribution, contribution margin. All of them measure outcomes. None of them measure whether the agent stack producing those outcomes is healthy. Two brands can hit the same CAC with two completely different operational states: one with a clean agent stack producing high-velocity work, one with a fragile stack chasing the same outcome through brute-force operator hours. The outcome KPIs cannot tell you which one you are.
The 3-human-plus-7-agent org from Post 13 introduced a different layer of work that the traditional dashboard does not see. This post is the measurement layer that does. Nine KPIs across three buckets, each one diagnostic, each one auditable in a Monday-morning standup. The buckets are productivity (what the stack ships), quality (whether the work meets the brand bar), and leverage (whether humans are freed up). Stack health is the joint state across all three. Outcomes follow.
Why the old KPIs miss it
Three structural reasons. First, traditional KPIs are outcome-side; agent stacks operate input-side. CAC tells you what an acquisition costs but not whether the campaigns producing it are running at half throughput because the Creative Strategy Agent is missing a brand brief. The outcome reads fine; the input layer is broken.
Second, traditional KPIs assume execution scales with headcount. The stack-health question is whether you are producing more output with fewer human hours, which is invisible to a dashboard that does not track the human-hour denominator. Operator overhead is the metric that catches the case where agent output looks healthy but the Operator is spending forty hours a week keeping it that way.
Third, traditional KPIs are weekly or monthly; agent-stack drift happens in days. A two-week lag on noticing that human review acceptance dropped from 80% to 50% means two weeks of operator time absorbing the gap, two weeks of shipped output the Brand Steward did not actually approve, two weeks of compounding rejection memory in the buyer-agent layer covered in Post 12. The instrumentation cadence has to match the failure cadence.
The 9 KPIs, organized by what they diagnose
Each KPI maps to one bucket. Productivity KPIs answer: is the stack shipping. Quality KPIs answer: is the work good. Leverage KPIs answer: are humans actually freed up. Stack health requires all three; chasing one in isolation produces the failure mode where the other two silently regress.
The 9 KPIs, at a glance
Bucket chip shows which question the KPI answers. Productivity, quality, leverage. Chase one in isolation and the other two silently regress.
Variants per brief per hour
Creative Strategy throughput against a clear brand brief.
Iteration velocity
Brief-to-launch hours, end-to-end across the stack.
Cost per output unit
Per variant, per email, per ad, per report.
Human review acceptance rate
Percent of agent outputs shipped unchanged.
Anomaly recovery time
Hours from incident detection to resolution.
Post-launch rollback rate
Percent of shipped output reverted within 7 days.
Operator overhead
Operator hours per week on agent supervision.
Strategy Director ratio
Time spent on strategy vs. on stack housekeeping.
Year-over-year human-hour efficiency
Output per human-hour delta vs. prior year.
01. Variants per brief per hour (Productivity)
The throughput signal for the Creative Strategy Agent. Given a clear brand brief, how many testable variants come back per hour of agent runtime. The brief specificity is the variable that moves this most; a brief that names the channel, the audience, and the constraint envelope will produce three to four times the throughput of a brief that asks for general direction. Audit: time the next concept brief from kickoff to first variant batch; divide variant count by hours elapsed. Benchmark: 25 to 40 variants per hour for a tightly-briefed concept; below 10 means the brief is the bottleneck, not the agent.
02. Iteration velocity (Productivity)
End-to-end hours from approved brief to shipped output. Captures the full stack rhythm, not just the agent's solo throughput. Includes human review gates, the Brand Steward's variant selection, the Campaign Execution handoff. Audit: log timestamps at brief approval, first variant batch, variant selection, launch ready, launch live. Calculate elapsed time across all five. Benchmark: 4 to 8 hours for a single-channel campaign; 1 to 2 days for a multi-channel sequence. If you are at 2 weeks, the bottleneck is human review cadence, not agent throughput.
03. Cost per output unit (Productivity)
Total cost (agent platform + human review time + tooling) divided by output units shipped. Normalize per output type rather than averaging across heterogeneous units. Audit: segment by output type (variant, email, ad creative, report). Pull platform cost + Operator hourly rate × supervision hours + Brand Steward hourly rate × review hours. Divide by units shipped that month. Benchmark: cost per ad variant should be sub-$5 at scale; cost per lifecycle email under $1; cost per weekly report under $25. If your numbers are 3x these, the cause is usually Operator supervision overhead, which is KPI 07's territory.
04. Human review acceptance rate (Quality)
Percent of agent-produced outputs the human (Brand Steward, Operator, or Strategy Director) ships without modification. The most reliable quality signal because it captures both objective quality and brand-fit judgment in one number. Audit: for the next 30 outputs across all agents, log shipped-as-is vs shipped-with-edits vs rejected. Benchmark: 60 to 75% acceptance is healthy for creative agents; 80%+ for non-creative agents (reporting, attribution). Below 50% means the brand brief is too vague or the agent is misconfigured.
05. Anomaly recovery time (Quality)
Hours from incident detection (agent output drift, integration break, metric anomaly) to resolution. This is the metric that catches whether your stack has institutional memory of failure modes or whether every incident is solved from scratch. Audit: maintain a log of incidents with detect time and resolve time. Calculate median and 90th percentile across the last quarter. Benchmark: sub-4-hour median for routine incidents, sub-24-hour for novel ones. If 90th percentile is over a week, the cause is usually missing runbooks rather than agent fragility.
06. Post-launch rollback rate (Quality)
Percent of shipped output reverted within 7 days. Captures the failures the review gates missed. Higher than the acceptance rate's inverse means review gates are passing work that later fails on the actual channel. Audit: track every revert (campaign pause, email recall, content takedown) and divide by total shipped output for the same period. Benchmark: under 3% rollback rate; under 1% if creative is consistent. Above 5% means review gates are insufficiently rigorous or the agents are shipping during low-confidence windows.
07. Operator overhead (Leverage)
Operator hours per week on agent supervision. The leverage signal that decides whether the agent stack is actually saving human time or just relocating it. Audit: Operator self-reports time spent across configuration, debugging, escalation handling, ad-hoc supervision. Tracked weekly. Benchmark: 15 to 25 hours per week at steady state; 30 to 40 in the first three months as the stack onboards. If you are still at 35+ hours at month six, you are subsidizing agent fragility with operator hours, which inverts the economic case for the stack.
08. Strategy Director ratio (Leverage)
Time the Strategy Director spends on strategy versus on stack housekeeping. The metric that catches the case where the Director is doing Operator work because the Operator role is under-resourced. Audit: Director self-reports weekly time split: strategy / Operator overflow / external (sales calls, partnerships, etc). Benchmark: 70% strategy, 10% Operator overflow, 20% external at a healthy steady state. If Operator overflow is consistently above 25%, hire or expand the Operator role; the Director's marginal hour is more valuable than the cost of expanding Operator capacity.
09. Year-over-year human-hour efficiency (Leverage)
Output per human-hour this year divided by output per human-hour last year. The compounding metric. The argument for the agent stack is that it produces a flywheel of efficiency gains that scale faster than headcount additions; this KPI measures whether that flywheel is actually spinning. Audit: total shipped output (normalize per output type, sum to a composite) divided by total marketing human-hours, year over year. Benchmark: 1.5x to 2.5x year-one improvement; 1.2x to 1.5x year-two improvement; 1.1x+ each subsequent year. Below 1.0x means the stack is decaying faster than it is compounding.
What good looks like at scale
Benchmarks shift by brand size. The same KPI value reads as healthy at a $5M brand and concerning at a $50M brand because the volume of work the stack is producing scales with brand revenue. Use the table below as directional, not prescriptive; calibrate against your own actual baseline before drawing conclusions.
Benchmarks by brand size
Directional, not prescriptive. Calibrate against your own baseline before drawing conclusions.
How to instrument the 9 KPIs
Instrumentation is the work most operator teams put off because the framework above is more interesting than the plumbing. Resist the urge. Five concrete steps get you to a usable dashboard inside two weeks.
The 2-week instrumentation
No new tooling required for the first three steps. Operator handles the lift.
Spreadsheet baseline for KPIs 01, 02, 04, 06
Day 1Manual logging of timestamps and outcomes per output. Crude but reliable. Run for two weeks to set the baseline.
Operator time-tracking for KPIs 07, 08
Day 1Operator + Strategy Director self-report weekly time splits. Use whatever calendar tool the team uses; do not over-engineer.
Incident log for KPI 05
Day 1Single shared doc. Every anomaly gets detect time + resolve time. Median + 90th percentile calculated monthly.
Cost-per-unit calculation for KPI 03
Week 1Pull agent platform invoice, divide by output count by type. Add operator and Brand Steward hourly cost. Refresh monthly.
YoY composite for KPI 09
AnnualSum normalized output by type, divide by total human-hours, compare against same period prior year. Annual cadence, not monthly.
The dashboard you end up with after two weeks is not pretty. It is also not the point. The point is having the numbers at the cadence that matches the failure cadence. Pretty dashboards come later, after the baseline data tells you which KPIs deserve the visualization investment. Many brands skip directly to a polished tool and find they were instrumenting the wrong KPIs because they had not run the manual baseline first.
Cresva surfaces all nine KPIs continuously, with anomaly alerts on the ones that move first when something breaks. Productivity, quality, and leverage instrumented from day one. Spreadsheet baselines for the manual KPIs, automated tracking for the operational ones, year-over-year composites for the leverage flywheel.