Token Spend Is the Wrong Way to Measure AI ROI

Most engineering organizations now have data on how much they're spending on AI. Far fewer have data on what that spend is producing.

Token consumption is easy to track — your vendor surfaces it automatically, often with alarming detail. Teams report on active seats and token burn rates. Leaders build forecasts around them. CFOs start asking questions. And somewhere in that chain, token spend becomes the unofficial metric for whether AI is working.

The problem with that: token spend is a cost metric. Treating it as a productivity or value signal sends engineering organizations optimizing for the wrong things entirely.

Why do engineering teams keep measuring AI by token usage?

Token usage fills a measurement void. If an organization had a strong value measurement practice before AI arrived, it would have a natural home for AI ROI: did we ship more? Did we ship faster? Did operational burden go down? But most organizations don't have that practice — and AI adoption moved faster than measurement infrastructure could follow.

Ed Quick, a technology executive with experience leading engineering organizations at Cars.com, AWS, and SurveyMonkey, describes how the pattern plays out:

“Organizations go through rounds of budget forecasting, spend estimates climb from $5,000 a month to $50,000 a month, and the one thing everyone can agree on is what the number is.”

Ed Quick

Former engineering executive at Cars.com and SurveyMonkey

Whether that number represents value delivered is a separate question — and one most organizations haven't built the infrastructure to answer.

There's also a maturity dimension. The early pressure on engineering teams was to adopt AI — get people using it, get seats filled, show progress. The metrics that made sense for adoption (active users, acceptance rates, tokens consumed) were reasonable proxies for that goal. But adoption metrics and outcome metrics are different things, and organizations that haven't made that transition are still optimizing for the wrong target.

AI’s impact on engineering output

Uplevel data shows that AI increases throughput, holds delivery speed mostly flat, and shifts the composition of work in ways that are harder to read than they look.

PR velocity and throughput have climbed consistently across Uplevel customers over the past 18 months. Volume is up, but volume and speed are different metrics.

PR velocity increases

Cycle time — the time it actually takes to move work from start to done — has stayed roughly flat, and in some organizations has gotten marginally worse. More PRs moving doesn't mean work is completing faster. It can mean more work is in flight at once, which creates its own coordination overhead. limited-cycle-time
The work composition picture is similarly mixed. We see a modest increase in time spent on new value work — feature development, strategic initiatives — after AI adoption. But in many organizations that gain is counterbalanced by a corresponding rise in “KTLO” operational work. The jury is still out on net impact.

new value work - slight increase What this points to is that AI is an amplifier, and what it amplifies depends on what's already in place. Teams with strong CI/CD pipelines, clear backlogs, and good engineering practices tend to get genuine throughput gains. Teams with fragmented workflows, technical debt, and unclear prioritization tend to move faster through the same problems they had before.

Defining and measuring AI ROI

The organizations that answer this question well start with a written definition of success before touching the data. Ed Quick describes the approach he uses, borrowing from Amazon's working-backwards methodology: write the press release first. Before any AI implementation, define what "done" looks like. What is specifically different? What pain points are gone? What can the team do now that it couldn't do before?

That exercise forces a level of specificity that most AI ROI conversations skip. "We'll be more productive" is not a definition. "Epic lead time drops from six weeks to three, and we ship one major feature per sprint instead of one per quarter" is a clear and measurable outcome.

The who matters as much as the what. Ed's advice get engineers, product partners, and business stakeholders in the room together:

Engineers understand what's actually painful in the delivery process.
Product partners know which delays cost the most.
Stakeholders can tell you which outcomes would move business metrics.

The definition that results from that conversation is a collective one — and collective definitions hold up better when reporting time comes.

Uplevel enterprise transformation consultant MC Johansen also explains that the right metrics depend on how AI is being used, and that varies more than most leaders realize. “A team using AI for code generation has different relevant measures than a team using it for documentation or QA. A single token-based metric collapses all of that into one number, and it loses a lot of meaning.”

What governing AI spend looks like in practice

Token management has a close analogy in cloud cost governance, and the lessons from cloud apply — with some caveats.

Cloud cost optimization took years to build as a discipline. FinOps teams, cost management tooling, and enterprise discount structures all emerged gradually as organizations matured from "plug in an API key and run" to actively managing infrastructure, monitoring, and spend. AI is following the same arc. The opacity of token pricing — how models get swapped behind the scenes, what exactly gets charged and when — maps closely to the early days of cloud billing.

One difference: the vendor incentive problem is starker with AI.

Cloud providers had the same incentives, but the tooling ecosystem matured to counterbalance them — third-party cost managers, native spend dashboards, enterprise agreements with discount structures. AI tooling is getting there. The organizations ahead of the curve are already making the same moves: shifting off direct API pricing into cloud provider implementations, introducing prompt caching, standardizing on a defined set of use cases before expanding.

For teams at different AI maturity levels — some deep into agentic workflows, others still evaluating tools — Ed argues that governance works best as “transparency plus manager” accountability. This means keeping spend visible at the team level and promoting a safe environment for engineers to experiment and fail cheaply. A junior engineer who runs an unbounded context window and suddenly costs the org money is a governance failure — and should be treated as a coaching opportunity.

How one engineering org reached 300% productivity improvement

Ed watched his organization go through the same FOMO-driven AI scramble most large companies experienced — scattered POCs, vendor presentations, a week-long hackathon, 30 things in flight with two worth shipping.

His response was to propose something deliberately smaller: with two teams identified as having a solid product backlog and good PM coverage. A hypothesis: if we focus AI on the most time-consuming steps in our delivery process — ticket creation, initial PR generation, code review — can we see measurable throughput improvement? Ed’s teams defined four or five metrics upfront and dedicated three sprints of runway.

Critically, the experiment was framed around a delivery question: can we ship more value faster? That framing shaped everything, from the metrics chosen and the skills built to the coaching structure.

Quick's team brought in one of the most trusted engineers in the organization to build and embed the shared AI skills library. Sprint one showed promise. Sprint two showed a substantial increase. By sprint three, other teams were asking to join. "Well, this really wasn't an experiment anymore,” Ed recalls. “It was just working."

The cost picture followed. Because the initial teams had built structure around specific, high-value use cases, their token spend was notably lower than the rest of the company despite higher productivity. That gave them leverage to move from third-party API access to their cloud provider's implementation, capturing enterprise discounts and introducing prompt caching. Optimization became possible because the use cases were defined.

Watch the full conversation:

What should engineering leaders tell their CFO about AI spend?

The CFO conversation is coming. If it hasn't happened yet, it will — and showing up without a framework is the wrong move.

Ed's preparation advice: know your numbers, know your trajectory, and know what levers you can pull. That requires a level of internal discipline most engineering orgs haven't built yet, but the payoff for that work goes beyond the budget conversation.

"Your CFO is now getting bombarded by marketing and comms and everybody else who's bought an AI tool that is now unpredictable and going crazy. If you're the voice of reason in the room, they're much more likely to trust your engineering leaders in that space and let engineering take a lead-by-example role when it comes to the rest of the company and AI spend."

Ed Quick

That's a different kind of authority than most engineering leaders are used to claiming. And it comes directly from measurement discipline, not from being first to deploy the most tools.

Amy Carrillo Cotten, Uplevel's director of transformation, argues that leaders who approach AI ROI purely via cost will run a race to the bottom, optimizing spend without ever connecting engineering output to business value. The leaders who earn budget and trust are the ones who've defined what value means for their teams and can show progress against it.

How Uplevel helps engineering teams measure what matters

Token spend tells you what AI costs. Answering what AI is producing requires visibility into the underlying work — how engineers are spending time, how work is flowing through the SDLC, where AI is adding throughput and where it's adding operational burden.

Uplevel integrates data from Jira, GitHub, Slack, Calendar, and AI tools to give engineering leaders a continuous read on work allocation, cycle time, and the ratio of new value work to operational work. That's the data behind the charts in this post. Uplevel combines that continuous measurement with contextual understanding and capability building, so the measurement produces decisions and action.

If you're in the middle of an AI ROI conversation and finding that your current metrics don't support it, schedule a StackUp assessment to see where your engineering system actually stands.

Frequently Asked Questions

What's the difference between measuring AI activity and measuring AI outcomes?

Activity metrics — token usage, active seats, code acceptance rates — track whether AI tools are being used and how much they cost. Outcome metrics track what the work produces: feature delivery speed, cycle time, the ratio of new value work to operational burden, and customer impact. Activity metrics are useful for adoption tracking and cost management. As the primary frame for AI ROI, they measure the input and leave the result untracked.

How should engineering leaders define AI ROI?

Start with a specific, written definition of what success looks like before implementation begins. Amazon's working-backwards approach applies well here: write the press release you'd issue on the day the initiative succeeds, then work back to the metrics that would need to move for that headline to be true. The definition process should include engineers, product partners, and business stakeholders — not just engineering leadership.

What metrics should replace token spend as the primary AI measure?

The right metrics depend on what you're trying to achieve and how AI is being used. For delivery-focused teams, epic lead time is a reliable proxy for end-to-end speed. Work allocation — the ratio of new value work to maintenance and operational burden — shows whether AI is freeing up capacity for strategic work. Cycle time tracks whether work is completing faster, not just entering the system faster. Token spend remains relevant for cost management and belongs alongside output metrics as part of a complete AI measurement picture.

How do you govern AI spend across teams at different maturity levels?

Make spend visible at the team level without making it punitive. Assign accountability to managers rather than surfacing cost data directly to executives first. Set spending caps that allow experimentation within safe bounds. Route high-usage alerts to experienced engineers who can coach rather than reprimand. Recognize that teams at different AI maturity levels will have meaningfully different cost and output profiles, and build expectations accordingly.

How do you make the case for better AI measurement to executive leadership?

Start small and bring receipts. Instrument one or two teams with outcome metrics, run them through a sprint or quarter, and surface both the positive results and the cases where AI adoption drove cost without value. Pulling the line from engineering output all the way to customer impact — reduced churn, increased feature adoption, faster delivery — gets executive attention.

Is AI actually improving engineering speed and delivery?

Across Uplevel customer data, AI has consistently increased PR throughput and volume. Cycle time — the end-to-end speed from work start to completion — has shown limited improvement and in some cases marginal decline. The meaningful speed gains tend to appear in organizations with strong technical foundations, clear backlogs, and structured AI use cases rather than open-ended tool access.

What's the right way to run an AI productivity experiment on an engineering team?

Define the hypothesis and success metrics before you start. Choose two or three teams with enough backlog to run for multiple sprints without running out of work. Focus AI on specific, high-friction steps in the delivery process rather than general usage. Embed a trusted, hands-on engineering leader with the teams. Retro after each sprint on what's working and what to adjust. Frame the experiment around delivering value faster — that framing shapes which metrics you choose and how the teams approach the work.

Token Spend Is the Wrong Way to Measure AI ROI

Why do engineering teams keep measuring AI by token usage?

AI’s impact on engineering output

Defining and measuring AI ROI

Hurry Up and 10x: The Path to Real AI Productivity

What governing AI spend looks like in practice

How one engineering org reached 300% productivity improvement

Watch the full conversation:

Tokenomics for the Deeply Skeptical

What should engineering leaders tell their CFO about AI spend?

How Uplevel helps engineering teams measure what matters

Frequently Asked Questions

What's the difference between measuring AI activity and measuring AI outcomes?

How should engineering leaders define AI ROI?

What metrics should replace token spend as the primary AI measure?

How do you govern AI spend across teams at different maturity levels?

How do you make the case for better AI measurement to executive leadership?

Is AI actually improving engineering speed and delivery?

What's the right way to run an AI productivity experiment on an engineering team?

Lauren Lang

Skip the demo. Get real answers on how to maximize AI impact.

Related resources on AI engineering

What Is Developer Productivity in the AI Era?

Top Engineering Intelligence Platforms [2026]

Token Spend Is the Wrong Way to Measure AI ROI

Product

Resources