Developers are writing more code than ever — AI tools have seen to that. But in most orgs, the review pipeline runs at the same speed it always has, and that's one place where things are breaking down.
Teams respond to this pressure in predictable ways. Reviews get faster, but shallower. PRs pile up. Reviewers start approving things they haven't fully read, and escaped bugs follow. Leadership asks for "better code review practices," which usually means a meeting, a checklist, and no structural change.
The assumption that speed and quality trade off against each other deserves scrutiny. In practice, slow reviews and low-quality reviews often share the same root cause. When you fix it, both improve.
Most engineering leaders already know if reviews are slow, and with the surge in AI code, most of them also know why. But having metrics alone doesn't change the behavior — and in some cases, the wrong metrics actively makes it worse.
If you're only looking at cycle time as a single metric, you're missing the critical context that can explain why things are taking so long. PR cycle time isn't one thing. It breaks into distinct phases: time between first commit and PR marked as ready for review, time waiting in queue for review, and time in review.
Here's the breakdown from hundreds of thousands of PRs captured by Uplevel in the past year (averages are shown at p75 to exclude major outliers):
| PR type | Total cycle time | Time between first commit and marked ready for review | Time in queue | Time in review |
| Non-complex | 20 hours | 15 minutes | 45 minutes | 2 hours |
| Complex | 7.7 days | 66 hours | 6 hours | 24 hours |
Complex PRs comprise a quarter of all pull requests, but are responsible for 77% of the total cycle time when we multiply the number of pull requests by average cycle time across complex and non-complex PRs.
For non-complex PRs, the development phase is negligible — about 15 minutes from first commit to ready for review, with queue time and active review accounting for most of the 20-hour total. For complex PRs, the bottleneck shifts dramatically upstream: developers spend an average of 66 hours before a PR is ever marked ready for review. Queue time (6 hours) and review time (24 hours) add to that, but they're not the primary driver.
Large, wide-scope PRs don't just take longer — they produce worse outcomes. Reviewers evaluating hundreds of changed lines across multiple files make more errors. The review becomes a best-effort scan rather than a genuine quality gate.
A good target is around 20% of PRs classified as complex. Above 50% is a signal that teams need to break work into smaller units.
A handful of teams or authors typically drive the majority of complex PRs — which is useful information, and also a prompt to ask why. Complex PR patterns often reflect how work arrives: whether requirements are stable when development starts, whether there's pressure to batch changes, or whether developers lack the context to decompose work before they start writing. The data shows you where to look. The answer to why requires a different kind of inquiry.
Pickup time reflects something similar. It's not just a queue management problem — it signals team norms: how reviews are prioritized against other work, whether there's a shared expectation around response time, whether reviewers feel ownership over the queue. A 4-hour pickup SLA written into a team agreement means something different than one handed down as a policy. The former tends to stick.
Generic advice — "keep PRs small," "review within 24 hours" — fails because it's delivered without measurement or context. Teams hear it, agree with it, and change nothing, because they don't know which specific PRs, teams, or patterns are responsible for the problem. A norm without shared understanding behind it is just a rule to route around.
Size limits only stick when teams understand why they matter and can see their own patterns. When teams consistently produce oversized PRs, the data surfaces the pattern, which is a good starting point. But understanding the root cause requires talking to the team: are requirements shifting mid-sprint? Is there implicit pressure to ship larger units? Do developers know how to decompose the work? The intervention depends on which of those is true.
Set a target for first review pickup, not review completion. A 4-hour pickup SLA is achievable. A 4-hour completion SLA on complex work is not. Most teams conflate these, then wonder why the SLA isn't moving the number. And as with size limits, a SLA works better when the team sets it together than when it's handed to them.
AI output tends to be syntactically correct but it comes with more baggage. It’s harder to verify for intent and architectural fit, and the nature of how AI writes code means that even a well-scoped task produces more lines, more abstraction layers, and more boilerplate than a human would generate for the same feature.
Teams need norms around what "reviewable" means for AI-generated code, which might mean stricter requirements around descriptions or explicit scope constraints. And like other process changes, those norms are most durable when developers have a hand in defining them.
This one surprises most teams as it's where measurement and team dynamics intersect in ways that aren't always obvious.
An abnormally high comment count often points to rework — code that arrived under-specified, or requirements that shifted mid-stream. But abnormally low comment counts are a different problem: they indicate rubber stamping, where reviews are technically happening but nothing is getting caught.
Low comment counts look fine in a report. They only read as a warning signal if you understand how that team actually conducts reviews — whether reviewers feel safe raising concerns, whether there's time pressure that discourages thoroughness, whether the culture treats review as a formality or a true quality gate.
Both patterns show up in escaped bug rates eventually. Tracking comment volume against your baseline, broken out by team, gives you an early warning. Understanding what's driving it requires going a level deeper.
Most teams already run some automated checks — linting, static analysis, basic security scanning — before a PR reaches a human. The case for expanding that gate is getting stronger, and AI adoption is the reason.
AI coding tools generate code faster than human review pipelines were built to handle. When one Uplevel customer introduced AI code generation, the share of complex PRs jumped to 39% — more than double their previous baseline — before anyone had registered the implications of output velocity outpacing review capacity.
Automated pre-reviews can handle first-pass checks: style enforcement, test coverage gaps, known vulnerability patterns, and obvious logic errors. What's left for humans is intent verification, architectural fit, and cross-system impact — decisions that require the context and judgement that a tool doesn't have. According to Qodo's 2025 AI Code Quality report, teams using AI-assisted code review saw quality improvements in 81% of cases, up from 55% without it.
One thing worth naming: shifting to automated pre-review changes what's expected of human reviewers, and that shift needs to be managed. Reviewers who previously scanned for style issues alongside logic errors now need to focus differently. Without clarity on what the automation is covering and what it isn't, teams often end up with redundant effort in some areas and gaps in others. The tooling change and the team expectation change need to happen together.
Full AI-driven review without human oversight is still uncommon in enterprise engineering. But treating automated pre-review as a prerequisite for human review — rather than a parallel nice-to-have — is a practical step available to most teams now.
Aggregate cycle time is a poor scorecard for this work. It mixes fast, simple PRs with slow, complex ones, and improvements on the easy end can mask stagnation on the hard end.
The metrics that matter are cycle time broken out by phase and complexity class.
A few things worth tracking separately:
One connection worth making explicit: PR complexity often starts upstream. Work that was never properly scoped arrives as an oversized PR because requirements changed mid-stream. If review times are high and comment counts are volatile, the place to look is the planning process — how work is defined and handed to developers — before concluding that the review process itself is the problem.
Faster code review doesn't require asking your team to work harder or move faster. It requires understanding where your pipeline is breaking down — and why.
The data to find the where already exists in your PRs. The why usually requires a different kind of investigation: looking at how work is scoped, how reviews are prioritized, and whether the team has the context and psychological safety to change how they work.
Both matter. Teams that treat this as a measurement problem alone tend to improve their dashboards. Teams that treat it as a systems problem tend to improve their delivery.