
Why AI Code Review Misses Real Risks
AI code review misses real risks when teams treat generated comments as evidence that the risky part of a pull request was inspected.
The tool may find useful issues. It may suggest cleaner error handling, missing tests, suspicious null handling, repeated code, or a confusing branch. Those comments can be worth fixing.
But the highest-risk part of a change is often not the line that looks odd in the diff. It is the business invariant, permission boundary, data contract, rollout assumption, or failure mode that lives around the diff.
That is why AI review can make a pull request feel reviewed while production risk remains mostly unchanged.
For the broader code-review workflow problem, read Code Review Antipatterns That Slow Teams Down. This article focuses on the AI-specific failure mode: comment volume improves faster than judgment.
Why AI Review Looks Strong At First
AI review is attractive because it improves the visible parts of review.
It can respond quickly. It can scan every pull request. It can comment without waiting for a human reviewer to find time. It can suggest code changes that are easy to apply.
GitHub's Copilot code review documentation describes this basic shape: Copilot reviews code, provides feedback, and can offer suggested changes in pull requests. See GitHub's overview of Copilot code review.
That makes AI review feel like a clear improvement over an empty review queue.
The trap is that the tool improves the review surface before it necessarily improves the review outcome.
Good code review is not measured by:
- number of comments
- speed of first feedback
- percentage of AI suggestions accepted
- whether every pull request had an automated review
- whether the diff looks cleaner after suggestions
Those are process signals. They do not prove that review reduced the uncertainty that mattered.
A better question is:
Did review inspect the risk that would hurt the system if this merged today?
That question is much harder for AI to answer because the important context is often outside the changed lines.
The Review Situation
Imagine a pull request that adds an internal endpoint for changing a member's role in an account.
The change is small. The endpoint is behind authentication. The tests pass. AI review leaves several comments within a minute.
The core function looks like this:
type ChangeRoleInput = {
actorUserId: string
targetUserId: string
accountId: string
role: 'member' | 'admin'
}
export async function changeMemberRole(input: ChangeRoleInput) {
const target = await db.membership.findFirst({
where: {
userId: input.targetUserId,
},
})
if (!target) {
throw new Error('Member not found')
}
await db.membership.update({
where: { id: target.id },
data: { role: input.role },
})
await db.auditEvent.create({
data: {
actorUserId: input.actorUserId,
targetUserId: input.targetUserId,
accountId: input.accountId,
action: 'membership.role_changed',
},
})
}
The code is not obviously broken at a glance. It has typed input, a missing-target check, an update, and an audit event.
An AI reviewer might leave comments like:
Suggestion: use a custom error class instead of a generic Error.
Suggestion: include the previous role and new role in the audit event for better traceability.
Important: add a test for the "member not found" branch.
Question: should this function return the updated membership?
Those comments are not useless. The audit suggestion is especially reasonable.
But none of them names the production risk.
What The AI Review Misses
The dangerous issues are not style problems. They are system invariants.
The function changes a role, but it does not prove:
- the target membership belongs to
input.accountId - the actor belongs to the same account
- the actor has permission to grant the requested role
- the actor is not escalating a peer across an account boundary
- the role change and audit event commit together
- the tests cover cross-account attempts and unauthorized actors
The missing account scope is especially subtle:
const target = await db.membership.findFirst({
where: {
userId: input.targetUserId,
},
})
If the same user can belong to multiple accounts, this lookup can select a membership from the wrong account. The later audit event records input.accountId, but the updated row may belong somewhere else.
That is not a local code-style issue. It is a tenant-boundary issue.
The safer shape starts by making the invariant explicit:
export async function changeMemberRole(input: ChangeRoleInput) {
const actor = await db.membership.findFirst({
where: {
accountId: input.accountId,
userId: input.actorUserId,
},
})
const target = await db.membership.findFirst({
where: {
accountId: input.accountId,
userId: input.targetUserId,
},
})
if (!actor || !target) {
throw new Error('Member not found')
}
if (actor.role !== 'admin') {
throw new Error('Not allowed to change member roles')
}
await db.$transaction([
db.membership.update({
where: { id: target.id },
data: { role: input.role },
}),
db.auditEvent.create({
data: {
actorUserId: input.actorUserId,
targetUserId: input.targetUserId,
accountId: input.accountId,
action: 'membership.role_changed',
metadata: {
previousRole: target.role,
nextRole: input.role,
},
},
}),
])
}
This example is still simplified. Real systems usually need more policy: owner-only actions, self-demotion rules, protected accounts, invite states, support impersonation, audit retention, and API error contracts.
That is the point.
The real risk comes from rules the diff does not fully explain.
Why The Miss Happens
AI review does not miss these risks because it is useless. It misses them because code review depends on context that may not be in the prompt, the diff, or the repository conventions.
GitHub's responsible-use documentation for Copilot code review explains that the reviewed changes, relevant context such as pull request title and body, and custom instructions are combined into a prompt before a model generates feedback. See GitHub's responsible use guide for Copilot code review.
That input model is important. If the risky rule is not visible in the diff, not stated in the pull request, not encoded in tests, not captured in custom instructions, and not obvious from nearby code, the AI reviewer has to infer it.
Sometimes it will infer well. Sometimes it will produce plausible comments around the risk instead of naming the risk itself.
The Diff Hides The Business Rule
A human reviewer may know that account membership is tenant-scoped because they remember the data model, an incident, or a support escalation.
AI review may see only a function that updates a row.
The code path looks small. The hidden rule is large.
The Pull Request Description Is Too Thin
This description gives AI and humans very little to work with:
Add endpoint for changing member roles.
A better description exposes the review target:
Add role-change endpoint for account members.
Risk:
This touches account authorization. The dangerous cases are cross-account updates,
non-admin actors granting admin, and audit records that do not match the committed row.
Review focus:
1. Is every membership lookup scoped by accountId?
2. Does the actor permission check happen before the role update?
3. Are the role update and audit event committed together?
4. Do tests cover unauthorized actor and cross-account target attempts?
This is useful for human reviewers. It also gives AI review a better chance of looking in the right direction.
The Tests Prove The Wrong Reality
A test like this makes the change look safe:
it('changes a member role', async () => {
await changeMemberRole({
actorUserId: admin.id,
targetUserId: member.id,
accountId: account.id,
role: 'admin',
})
await expectRole(member.id, 'admin')
})
It proves only the happy path.
The review should ask for tests tied to the actual risk:
| Risk | Test evidence |
|---|---|
| Cross-account target update | target user belongs to a different account and must not be updated |
| Unauthorized actor | non-admin actor cannot grant admin |
| Audit mismatch | role update and audit event commit or roll back together |
| API contract | caller receives a stable forbidden/not-found response |
| Regression protection | test fails if accountId is removed from the lookup |
This connects to the broader testing problem in Why Tests Pass but Production Still Breaks: tests can pass while modeling the wrong production condition.
Comment Ranking Is Not Risk Ranking
AI feedback often arrives as a list of comments.
Some comments are local and easy to explain. Others are system-level and uncertain. The local comments may sound more confident because they are closer to the visible code.
That can invert priority:
| Comment | Easy to generate? | Production impact |
|---|---|---|
| Use a custom error type | high | low to medium |
| Return the updated row | high | low |
| Add test for missing target | high | medium |
| Scope membership lookup by account | medium | high |
| Define actor authorization policy | lower | high |
| Make audit and update transactional | medium | high |
The review tool may display all of these as similar comments. Humans still need to label which ones block merge.
Custom Instructions Help, But They Are Not A Policy System
Custom instructions can improve AI review because they make local expectations more visible.
For this codebase, a useful instruction might be:
When reviewing membership, billing, or authorization changes:
- check every query for tenant/account scoping
- check actor permission before state changes
- check audit events for transaction boundaries
- ask for tests around unauthorized actors and cross-account attempts
That is much better than a generic instruction like:
Review this code carefully for security and best practices.
GitHub's documentation notes that custom instructions can help Copilot understand coding style and practices, and that repository context can improve review usefulness. The same overview also says Copilot feedback should be validated and supplemented with human review. See the Copilot code review overview.
So custom instructions are useful context. They are not enforcement.
If a rule matters enough to protect production, prefer encoding it in one of these places too:
- database constraints
- authorization middleware
- reusable policy functions
- integration tests
- contract tests
- static analysis
- deployment checks
- pull request templates
AI review should point at the rule. It should not be the only place the rule exists.
Where AI Review Is Actually Useful
The right conclusion is not "do not use AI review."
The right conclusion is "do not confuse AI review with ownership."
AI review is useful for:
- finding suspicious local code
- asking for missing obvious tests
- spotting inconsistent naming
- noticing duplicated logic
- suggesting clearer error handling
- catching simple edge cases
- summarizing unfamiliar code paths
- giving early feedback before a human review
It is weaker at owning:
- business invariants
- permission boundaries
- tenant isolation
- rollout risk
- incident history
- regulatory constraints
- product behavior trade-offs
- operational failure modes
- whether a test models the right reality
That split matters because AI feedback can be both helpful and incomplete.
If a reviewer treats AI comments as a first pass, the tool improves coverage. If a reviewer treats them as approval, the tool can reduce attention exactly where the system needs human judgment.
GitHub's docs are direct about this principle: Copilot feedback must be validated carefully and supplemented by human review. GitHub also says pull requests produced by Copilot deserve the same thorough review as other contributions. See GitHub's guide to reviewing Copilot output.
A Better Workflow For AI-Assisted Review
A safer AI-review workflow is simple, but it has to be explicit.
1. Treat AI Review As Pre-Review
Let AI review run early, even on drafts if your workflow supports it.
Use it to remove obvious distractions before human review:
- missing branch tests
- unclear errors
- repeated code
- inconsistent naming
- simple null handling
- comments that reveal confusing code
That makes the human review cleaner.
It should not make the human review optional.
2. Require A Risk Statement In The Pull Request
Every non-trivial pull request should say what risk needs review.
For the membership example:
Human review needed for:
- tenant scoping on every membership query
- actor authorization before role update
- audit transaction boundary
- unauthorized and cross-account tests
The point is not ceremony. The point is to stop review from drifting toward whatever comments are easiest to make.
3. Assign Human Owners To The Hidden Context
AI can inspect visible code. Humans must own hidden context.
For a risky change, assign reviewers by risk:
| Risk area | Human owner |
|---|---|
| authorization policy | engineer familiar with account permissions |
| data model | engineer familiar with membership and tenant tables |
| API behavior | engineer familiar with client contracts |
| rollout and support | engineer familiar with operational impact |
This is not needed for every small change. It is needed when the pull request touches a boundary where local correctness is not enough.
4. Label AI And Human Comments By Impact
Use severity labels:
Blocking:
Important:
Suggestion:
Nit:
Question:
If an AI comment is useful but not blocking, say so.
If a human reviewer finds a system risk, label it as blocking and explain why.
This prevents harmless AI comments from crowding out the one issue that decides whether the pull request is safe.
5. Re-Review After Meaningful Changes
AI review can become stale after a pull request changes.
If the author rewrites the core function after the first AI pass, request another automated pass if the tool does not run automatically. Then repeat the human risk check.
The key detail is order: automated re-review does not replace human re-review when the risk changed.
What To Measure Instead Of Comment Volume
If the team measures AI review by comment count, the tool will look successful almost immediately.
Better measurements are closer to outcomes:
| Measurement | Why it matters |
|---|---|
| escaped defect class | shows what review still misses |
| rollback rate after reviewed changes | reveals whether review is reducing bad merges |
| repeated review findings | shows rules that should become tests or automation |
| time from PR open to first useful human comment | separates speed from signal |
| percentage of risky PRs with explicit risk statements | checks whether review has a target |
| production incidents linked to reviewed PRs | shows whether review matched real failure modes |
The goal is not to prove AI review is bad or good.
The goal is to see whether it changes the failure modes that matter.
Practical Checklist
Before relying on AI review, ask:
- Does the pull request description name the risky behavior?
- Are the business invariants visible in code, tests, or instructions?
- Are AI comments separated from human blocking concerns?
- Did a human inspect authorization, data boundaries, retries, transactions, and API contracts where relevant?
- Do tests cover the dangerous path, not only the happy path?
- Did meaningful follow-up changes get reviewed again?
- Are recurring AI findings being turned into tests, linters, or reusable rules?
For decisions where reviewers disagree about ownership, trade-offs, or acceptable risk, use the decision framework in How Software Engineers Make Decisions. AI can summarize options, but humans still decide which trade-off the system should accept.
Final Takeaway
AI code review is most valuable when it reduces mechanical review effort and helps humans spend more attention on the dangerous parts of a change.
It becomes risky when teams treat generated feedback as a substitute for judgment.
The practical rule is simple:
AI can expand review coverage. Humans still own the risk.
That means clear pull request descriptions, explicit review targets, severity-labeled comments, tests tied to production failure modes, and human reviewers assigned to the context the tool cannot safely infer.
When that boundary is clear, AI review can make the process faster without making the team careless.