Pilot live: ACP for AI commerce.Explore ACP
Skip to content
Back to Blog

The 9 KPIs That Measure Whether Your AI Agents Are Actually Working

Traditional marketing KPIs (CAC, ROAS, LTV) measure outcomes. They miss whether the agent stack producing those outcomes is healthy. The nine KPIs that diagnose your agent stack across productivity, quality, and leverage, with audit + benchmarks per brand size.

11 min readStrategy

Pull your current marketing dashboard. Count the KPIs. CAC, ROAS, LTV, retention rate, NPS, channel attribution, contribution margin. All of them measure outcomes. None of them measure whether the agent stack producing those outcomes is healthy. Two brands can hit the same CAC with two completely different operational states: one with a clean agent stack producing high-velocity work, one with a fragile stack chasing the same outcome through brute-force operator hours. The outcome KPIs cannot tell you which one you are.

The 3-human-plus-7-agent org from Post 13 introduced a different layer of work that the traditional dashboard does not see. This post is the measurement layer that does. Nine KPIs across three buckets, each one diagnostic, each one auditable in a Monday-morning standup. The buckets are productivity (what the stack ships), quality (whether the work meets the brand bar), and leverage (whether humans are freed up). Stack health is the joint state across all three. Outcomes follow.

Why the old KPIs miss it

Three structural reasons. First, traditional KPIs are outcome-side; agent stacks operate input-side. CAC tells you what an acquisition costs but not whether the campaigns producing it are running at half throughput because the Creative Strategy Agent is missing a brand brief. The outcome reads fine; the input layer is broken.

Second, traditional KPIs assume execution scales with headcount. The stack-health question is whether you are producing more output with fewer human hours, which is invisible to a dashboard that does not track the human-hour denominator. Operator overhead is the metric that catches the case where agent output looks healthy but the Operator is spending forty hours a week keeping it that way.

Third, traditional KPIs are weekly or monthly; agent-stack drift happens in days. A two-week lag on noticing that human review acceptance dropped from 80% to 50% means two weeks of operator time absorbing the gap, two weeks of shipped output the Brand Steward did not actually approve, two weeks of compounding rejection memory in the buyer-agent layer covered in Post 12. The instrumentation cadence has to match the failure cadence.

The 9 KPIs, organized by what they diagnose

Each KPI maps to one bucket. Productivity KPIs answer: is the stack shipping. Quality KPIs answer: is the work good. Leverage KPIs answer: are humans actually freed up. Stack health requires all three; chasing one in isolation produces the failure mode where the other two silently regress.

The 9 KPIs, at a glance

Bucket chip shows which question the KPI answers. Productivity, quality, leverage. Chase one in isolation and the other two silently regress.

01

Variants per brief per hour

Productivity

Creative Strategy throughput against a clear brand brief.

02

Iteration velocity

Productivity

Brief-to-launch hours, end-to-end across the stack.

03

Cost per output unit

Productivity

Per variant, per email, per ad, per report.

04

Human review acceptance rate

Quality

Percent of agent outputs shipped unchanged.

05

Anomaly recovery time

Quality

Hours from incident detection to resolution.

06

Post-launch rollback rate

Quality

Percent of shipped output reverted within 7 days.

07

Operator overhead

Leverage

Operator hours per week on agent supervision.

08

Strategy Director ratio

Leverage

Time spent on strategy vs. on stack housekeeping.

09

Year-over-year human-hour efficiency

Leverage

Output per human-hour delta vs. prior year.

01. Variants per brief per hour (Productivity)

The throughput signal for the Creative Strategy Agent. Given a clear brand brief, how many testable variants come back per hour of agent runtime. The brief specificity is the variable that moves this most; a brief that names the channel, the audience, and the constraint envelope will produce three to four times the throughput of a brief that asks for general direction. Audit: time the next concept brief from kickoff to first variant batch; divide variant count by hours elapsed. Benchmark: 25 to 40 variants per hour for a tightly-briefed concept; below 10 means the brief is the bottleneck, not the agent.

02. Iteration velocity (Productivity)

End-to-end hours from approved brief to shipped output. Captures the full stack rhythm, not just the agent's solo throughput. Includes human review gates, the Brand Steward's variant selection, the Campaign Execution handoff. Audit: log timestamps at brief approval, first variant batch, variant selection, launch ready, launch live. Calculate elapsed time across all five. Benchmark: 4 to 8 hours for a single-channel campaign; 1 to 2 days for a multi-channel sequence. If you are at 2 weeks, the bottleneck is human review cadence, not agent throughput.

03. Cost per output unit (Productivity)

Total cost (agent platform + human review time + tooling) divided by output units shipped. Normalize per output type rather than averaging across heterogeneous units. Audit: segment by output type (variant, email, ad creative, report). Pull platform cost + Operator hourly rate × supervision hours + Brand Steward hourly rate × review hours. Divide by units shipped that month. Benchmark: cost per ad variant should be sub-$5 at scale; cost per lifecycle email under $1; cost per weekly report under $25. If your numbers are 3x these, the cause is usually Operator supervision overhead, which is KPI 07's territory.

04. Human review acceptance rate (Quality)

Percent of agent-produced outputs the human (Brand Steward, Operator, or Strategy Director) ships without modification. The most reliable quality signal because it captures both objective quality and brand-fit judgment in one number. Audit: for the next 30 outputs across all agents, log shipped-as-is vs shipped-with-edits vs rejected. Benchmark: 60 to 75% acceptance is healthy for creative agents; 80%+ for non-creative agents (reporting, attribution). Below 50% means the brand brief is too vague or the agent is misconfigured.

05. Anomaly recovery time (Quality)

Hours from incident detection (agent output drift, integration break, metric anomaly) to resolution. This is the metric that catches whether your stack has institutional memory of failure modes or whether every incident is solved from scratch. Audit: maintain a log of incidents with detect time and resolve time. Calculate median and 90th percentile across the last quarter. Benchmark: sub-4-hour median for routine incidents, sub-24-hour for novel ones. If 90th percentile is over a week, the cause is usually missing runbooks rather than agent fragility.

06. Post-launch rollback rate (Quality)

Percent of shipped output reverted within 7 days. Captures the failures the review gates missed. Higher than the acceptance rate's inverse means review gates are passing work that later fails on the actual channel. Audit: track every revert (campaign pause, email recall, content takedown) and divide by total shipped output for the same period. Benchmark: under 3% rollback rate; under 1% if creative is consistent. Above 5% means review gates are insufficiently rigorous or the agents are shipping during low-confidence windows.

07. Operator overhead (Leverage)

Operator hours per week on agent supervision. The leverage signal that decides whether the agent stack is actually saving human time or just relocating it. Audit: Operator self-reports time spent across configuration, debugging, escalation handling, ad-hoc supervision. Tracked weekly. Benchmark: 15 to 25 hours per week at steady state; 30 to 40 in the first three months as the stack onboards. If you are still at 35+ hours at month six, you are subsidizing agent fragility with operator hours, which inverts the economic case for the stack.

08. Strategy Director ratio (Leverage)

Time the Strategy Director spends on strategy versus on stack housekeeping. The metric that catches the case where the Director is doing Operator work because the Operator role is under-resourced. Audit: Director self-reports weekly time split: strategy / Operator overflow / external (sales calls, partnerships, etc). Benchmark: 70% strategy, 10% Operator overflow, 20% external at a healthy steady state. If Operator overflow is consistently above 25%, hire or expand the Operator role; the Director's marginal hour is more valuable than the cost of expanding Operator capacity.

09. Year-over-year human-hour efficiency (Leverage)

Output per human-hour this year divided by output per human-hour last year. The compounding metric. The argument for the agent stack is that it produces a flywheel of efficiency gains that scale faster than headcount additions; this KPI measures whether that flywheel is actually spinning. Audit: total shipped output (normalize per output type, sum to a composite) divided by total marketing human-hours, year over year. Benchmark: 1.5x to 2.5x year-one improvement; 1.2x to 1.5x year-two improvement; 1.1x+ each subsequent year. Below 1.0x means the stack is decaying faster than it is compounding.

What good looks like at scale

Benchmarks shift by brand size. The same KPI value reads as healthy at a $5M brand and concerning at a $50M brand because the volume of work the stack is producing scales with brand revenue. Use the table below as directional, not prescriptive; calibrate against your own actual baseline before drawing conclusions.

Benchmarks by brand size

Directional, not prescriptive. Calibrate against your own baseline before drawing conclusions.

KPI$5M brand$15M brand$50M brand
01 Variants / brief / hr15-2525-4035-60
02 Iteration velocity1-2 days4-8 hr2-6 hr
03 Cost per ad variant<$8<$5<$3
04 Acceptance rate55-70%60-75%65-80%
05 Anomaly recovery<8 hr<4 hr<2 hr
06 Rollback rate<5%<3%<2%
07 Operator overhead10-15 hr/wk15-25 hr/wk25-35 hr/wk
08 SD ratio (strategy)60%+70%+75%+
09 YoY efficiency1.5-2.5x1.3-1.8x1.2-1.5x

How to instrument the 9 KPIs

Instrumentation is the work most operator teams put off because the framework above is more interesting than the plumbing. Resist the urge. Five concrete steps get you to a usable dashboard inside two weeks.

The 2-week instrumentation

No new tooling required for the first three steps. Operator handles the lift.

01

Spreadsheet baseline for KPIs 01, 02, 04, 06

Day 1

Manual logging of timestamps and outcomes per output. Crude but reliable. Run for two weeks to set the baseline.

02

Operator time-tracking for KPIs 07, 08

Day 1

Operator + Strategy Director self-report weekly time splits. Use whatever calendar tool the team uses; do not over-engineer.

03

Incident log for KPI 05

Day 1

Single shared doc. Every anomaly gets detect time + resolve time. Median + 90th percentile calculated monthly.

04

Cost-per-unit calculation for KPI 03

Week 1

Pull agent platform invoice, divide by output count by type. Add operator and Brand Steward hourly cost. Refresh monthly.

05

YoY composite for KPI 09

Annual

Sum normalized output by type, divide by total human-hours, compare against same period prior year. Annual cadence, not monthly.

The dashboard you end up with after two weeks is not pretty. It is also not the point. The point is having the numbers at the cadence that matches the failure cadence. Pretty dashboards come later, after the baseline data tells you which KPIs deserve the visualization investment. Many brands skip directly to a polished tool and find they were instrumenting the wrong KPIs because they had not run the manual baseline first.

Cresva surfaces all nine KPIs continuously, with anomaly alerts on the ones that move first when something breaks. Productivity, quality, and leverage instrumented from day one. Spreadsheet baselines for the manual KPIs, automated tracking for the operational ones, year-over-year composites for the leverage flywheel.

Frequently asked questions

Which 3 of the 9 should I instrument first?
Iteration velocity (KPI 02), human review acceptance (KPI 04), and Operator overhead (KPI 07). Together they diagnose stack health in one snapshot. Velocity tells you the agents are shipping, acceptance tells you the work is good, and Operator overhead tells you the leverage is real. If those three are healthy, the other six tend to be too. If any of those three is broken, the rest of the dashboard is academic until the underlying issue resolves.
What if my agent vendor does not expose these metrics?
Build the proxies yourself with the spreadsheet baseline above. Lack of metric exposure is itself a structural signal; vendors confident in their stack expose the numbers, vendors hiding behind opacity often have something to hide. After two months of your own data, you have ground truth against which to evaluate vendor-reported numbers. Vendors whose numbers diverge meaningfully from your manual baseline are usually optimizing for the metric the vendor surfaces rather than the metric that matters.
How long until the leverage KPIs show movement?
Operator overhead (KPI 07) starts moving within two months of the Operator hire and stabilizes around month four. Strategy Director ratio (KPI 08) takes six to nine months because shifting a Director's time profile requires the Operator to have absorbed enough to handle escalations independently. Year-over-year human-hour efficiency (KPI 09) is the slowest; meaningful annual movement requires the full org rhythm to compound across a year of cycles. Plan for the leverage KPIs on a quarterly trend basis, not a monthly one.
What is a healthy variant rejection rate, and what is too high?
Brand Steward picks five to ten of thirty to fifty variants per brief, which is a 70 to 85% rejection rate. That is the healthy range. Higher (above 90%) usually means the brand brief is too vague and the variants are not converging on the intended direction. Lower (below 50%) usually means the Brand Steward is shipping work without enough rigor, which surfaces later as rollback rate (KPI 06) climbing. The rejection rate is a brief-quality signal as much as it is an agent-quality signal.
How do I track cost per output unit across heterogeneous outputs?
Normalize per output type, do not average across types. An ad variant is not directly comparable to an email send or a quarterly report; pretending it is masks the metric you actually care about, which is cost per unit of the specific output you are producing more of. Build a table: cost per ad variant, cost per email send, cost per report. Track each independently. The composite is useful only when comparing against last quarter's composite at the same output mix, which is the year-over-year human-hour efficiency calculation in disguise.
Does this stack measurement replace traditional marketing KPIs (CAC, ROAS, LTV)?
No, it layers under them. Traditional KPIs measure outcomes; the nine KPIs measure the stack producing those outcomes. Both matter for different decisions. CAC tells you if the channel works; the nine KPIs tell you if the stack producing your CAC-positive output is sustainable. Brands chasing CAC without the underlying nine often hit a wall where the outcome stays steady but the operator overhead silently doubles until the team burns out; brands tracking the nine without the traditional KPIs build a beautifully-instrumented stack that does not produce revenue. Both layers are required.

Written by the Cresva Team

Have a question? Email us