Why AI Code Review Misses Real Risks

Why AI Code Review Misses Real Risks

AI code review misses real risks when teams treat generated comments as evidence that the risky part of a pull request was inspected.

The tool may find useful issues. It may suggest cleaner error handling, missing tests, suspicious null handling, repeated code, or a confusing branch. Those comments can be worth fixing.

But the highest-risk part of a change is often not the line that looks odd in the diff. It is the business invariant, permission boundary, data contract, rollout assumption, or failure mode that lives around the diff.

That is why AI review can make a pull request feel reviewed while production risk remains mostly unchanged.

For the broader code-review workflow problem, read Code Review Antipatterns That Slow Teams Down. This article focuses on the AI-specific failure mode: comment volume improves faster than judgment.


Why AI Review Looks Strong At First

AI review is attractive because it improves the visible parts of review.

It can respond quickly. It can scan every pull request. It can comment without waiting for a human reviewer to find time. It can suggest code changes that are easy to apply.

GitHub's Copilot code review documentation describes this basic shape: Copilot reviews code, provides feedback, and can offer suggested changes in pull requests. See GitHub's overview of Copilot code review.

That makes AI review feel like a clear improvement over an empty review queue.

The trap is that the tool improves the review surface before it necessarily improves the review outcome.

Good code review is not measured by:

  • number of comments
  • speed of first feedback
  • percentage of AI suggestions accepted
  • whether every pull request had an automated review
  • whether the diff looks cleaner after suggestions

Those are process signals. They do not prove that review reduced the uncertainty that mattered.

A better question is:

Did review inspect the risk that would hurt the system if this merged today?

That question is much harder for AI to answer because the important context is often outside the changed lines.


The Review Situation

Imagine a pull request that adds an internal endpoint for changing a member's role in an account.

The change is small. The endpoint is behind authentication. The tests pass. AI review leaves several comments within a minute.

The core function looks like this:

membership-service.ts
type ChangeRoleInput = {
  actorUserId: string
  targetUserId: string
  accountId: string
  role: 'member' | 'admin'
}

export async function changeMemberRole(input: ChangeRoleInput) {
  const target = await db.membership.findFirst({
    where: {
      userId: input.targetUserId,
    },
  })

  if (!target) {
    throw new Error('Member not found')
  }

  await db.membership.update({
    where: { id: target.id },
    data: { role: input.role },
  })

  await db.auditEvent.create({
    data: {
      actorUserId: input.actorUserId,
      targetUserId: input.targetUserId,
      accountId: input.accountId,
      action: 'membership.role_changed',
    },
  })
}

The code is not obviously broken at a glance. It has typed input, a missing-target check, an update, and an audit event.

An AI reviewer might leave comments like:

Suggestion: use a custom error class instead of a generic Error.

Suggestion: include the previous role and new role in the audit event for better traceability.

Important: add a test for the "member not found" branch.

Question: should this function return the updated membership?

Those comments are not useless. The audit suggestion is especially reasonable.

But none of them names the production risk.


What The AI Review Misses

The dangerous issues are not style problems. They are system invariants.

The function changes a role, but it does not prove:

  • the target membership belongs to input.accountId
  • the actor belongs to the same account
  • the actor has permission to grant the requested role
  • the actor is not escalating a peer across an account boundary
  • the role change and audit event commit together
  • the tests cover cross-account attempts and unauthorized actors

The missing account scope is especially subtle:

const target = await db.membership.findFirst({
  where: {
    userId: input.targetUserId,
  },
})

If the same user can belong to multiple accounts, this lookup can select a membership from the wrong account. The later audit event records input.accountId, but the updated row may belong somewhere else.

That is not a local code-style issue. It is a tenant-boundary issue.

The safer shape starts by making the invariant explicit:

membership-service.ts
export async function changeMemberRole(input: ChangeRoleInput) {
  const actor = await db.membership.findFirst({
    where: {
      accountId: input.accountId,
      userId: input.actorUserId,
    },
  })

  const target = await db.membership.findFirst({
    where: {
      accountId: input.accountId,
      userId: input.targetUserId,
    },
  })

  if (!actor || !target) {
    throw new Error('Member not found')
  }

  if (actor.role !== 'admin') {
    throw new Error('Not allowed to change member roles')
  }

  await db.$transaction([
    db.membership.update({
      where: { id: target.id },
      data: { role: input.role },
    }),
    db.auditEvent.create({
      data: {
        actorUserId: input.actorUserId,
        targetUserId: input.targetUserId,
        accountId: input.accountId,
        action: 'membership.role_changed',
        metadata: {
          previousRole: target.role,
          nextRole: input.role,
        },
      },
    }),
  ])
}

This example is still simplified. Real systems usually need more policy: owner-only actions, self-demotion rules, protected accounts, invite states, support impersonation, audit retention, and API error contracts.

That is the point.

The real risk comes from rules the diff does not fully explain.


Why The Miss Happens

AI review does not miss these risks because it is useless. It misses them because code review depends on context that may not be in the prompt, the diff, or the repository conventions.

GitHub's responsible-use documentation for Copilot code review explains that the reviewed changes, relevant context such as pull request title and body, and custom instructions are combined into a prompt before a model generates feedback. See GitHub's responsible use guide for Copilot code review.

That input model is important. If the risky rule is not visible in the diff, not stated in the pull request, not encoded in tests, not captured in custom instructions, and not obvious from nearby code, the AI reviewer has to infer it.

Sometimes it will infer well. Sometimes it will produce plausible comments around the risk instead of naming the risk itself.

The Diff Hides The Business Rule

A human reviewer may know that account membership is tenant-scoped because they remember the data model, an incident, or a support escalation.

AI review may see only a function that updates a row.

The code path looks small. The hidden rule is large.

The Pull Request Description Is Too Thin

This description gives AI and humans very little to work with:

Add endpoint for changing member roles.

A better description exposes the review target:

Add role-change endpoint for account members.

Risk:
This touches account authorization. The dangerous cases are cross-account updates,
non-admin actors granting admin, and audit records that do not match the committed row.

Review focus:
1. Is every membership lookup scoped by accountId?
2. Does the actor permission check happen before the role update?
3. Are the role update and audit event committed together?
4. Do tests cover unauthorized actor and cross-account target attempts?

This is useful for human reviewers. It also gives AI review a better chance of looking in the right direction.

The Tests Prove The Wrong Reality

A test like this makes the change look safe:

it('changes a member role', async () => {
  await changeMemberRole({
    actorUserId: admin.id,
    targetUserId: member.id,
    accountId: account.id,
    role: 'admin',
  })

  await expectRole(member.id, 'admin')
})

It proves only the happy path.

The review should ask for tests tied to the actual risk:

RiskTest evidence
Cross-account target updatetarget user belongs to a different account and must not be updated
Unauthorized actornon-admin actor cannot grant admin
Audit mismatchrole update and audit event commit or roll back together
API contractcaller receives a stable forbidden/not-found response
Regression protectiontest fails if accountId is removed from the lookup

This connects to the broader testing problem in Why Tests Pass but Production Still Breaks: tests can pass while modeling the wrong production condition.

Comment Ranking Is Not Risk Ranking

AI feedback often arrives as a list of comments.

Some comments are local and easy to explain. Others are system-level and uncertain. The local comments may sound more confident because they are closer to the visible code.

That can invert priority:

CommentEasy to generate?Production impact
Use a custom error typehighlow to medium
Return the updated rowhighlow
Add test for missing targethighmedium
Scope membership lookup by accountmediumhigh
Define actor authorization policylowerhigh
Make audit and update transactionalmediumhigh

The review tool may display all of these as similar comments. Humans still need to label which ones block merge.


Custom Instructions Help, But They Are Not A Policy System

Custom instructions can improve AI review because they make local expectations more visible.

For this codebase, a useful instruction might be:

When reviewing membership, billing, or authorization changes:
- check every query for tenant/account scoping
- check actor permission before state changes
- check audit events for transaction boundaries
- ask for tests around unauthorized actors and cross-account attempts

That is much better than a generic instruction like:

Review this code carefully for security and best practices.

GitHub's documentation notes that custom instructions can help Copilot understand coding style and practices, and that repository context can improve review usefulness. The same overview also says Copilot feedback should be validated and supplemented with human review. See the Copilot code review overview.

So custom instructions are useful context. They are not enforcement.

If a rule matters enough to protect production, prefer encoding it in one of these places too:

  • database constraints
  • authorization middleware
  • reusable policy functions
  • integration tests
  • contract tests
  • static analysis
  • deployment checks
  • pull request templates

AI review should point at the rule. It should not be the only place the rule exists.


Where AI Review Is Actually Useful

The right conclusion is not "do not use AI review."

The right conclusion is "do not confuse AI review with ownership."

AI review is useful for:

  • finding suspicious local code
  • asking for missing obvious tests
  • spotting inconsistent naming
  • noticing duplicated logic
  • suggesting clearer error handling
  • catching simple edge cases
  • summarizing unfamiliar code paths
  • giving early feedback before a human review

It is weaker at owning:

  • business invariants
  • permission boundaries
  • tenant isolation
  • rollout risk
  • incident history
  • regulatory constraints
  • product behavior trade-offs
  • operational failure modes
  • whether a test models the right reality

That split matters because AI feedback can be both helpful and incomplete.

If a reviewer treats AI comments as a first pass, the tool improves coverage. If a reviewer treats them as approval, the tool can reduce attention exactly where the system needs human judgment.

GitHub's docs are direct about this principle: Copilot feedback must be validated carefully and supplemented by human review. GitHub also says pull requests produced by Copilot deserve the same thorough review as other contributions. See GitHub's guide to reviewing Copilot output.


A Better Workflow For AI-Assisted Review

A safer AI-review workflow is simple, but it has to be explicit.

1. Treat AI Review As Pre-Review

Let AI review run early, even on drafts if your workflow supports it.

Use it to remove obvious distractions before human review:

  • missing branch tests
  • unclear errors
  • repeated code
  • inconsistent naming
  • simple null handling
  • comments that reveal confusing code

That makes the human review cleaner.

It should not make the human review optional.

2. Require A Risk Statement In The Pull Request

Every non-trivial pull request should say what risk needs review.

For the membership example:

Human review needed for:
- tenant scoping on every membership query
- actor authorization before role update
- audit transaction boundary
- unauthorized and cross-account tests

The point is not ceremony. The point is to stop review from drifting toward whatever comments are easiest to make.

3. Assign Human Owners To The Hidden Context

AI can inspect visible code. Humans must own hidden context.

For a risky change, assign reviewers by risk:

Risk areaHuman owner
authorization policyengineer familiar with account permissions
data modelengineer familiar with membership and tenant tables
API behaviorengineer familiar with client contracts
rollout and supportengineer familiar with operational impact

This is not needed for every small change. It is needed when the pull request touches a boundary where local correctness is not enough.

4. Label AI And Human Comments By Impact

Use severity labels:

Blocking:
Important:
Suggestion:
Nit:
Question:

If an AI comment is useful but not blocking, say so.

If a human reviewer finds a system risk, label it as blocking and explain why.

This prevents harmless AI comments from crowding out the one issue that decides whether the pull request is safe.

5. Re-Review After Meaningful Changes

AI review can become stale after a pull request changes.

If the author rewrites the core function after the first AI pass, request another automated pass if the tool does not run automatically. Then repeat the human risk check.

The key detail is order: automated re-review does not replace human re-review when the risk changed.


What To Measure Instead Of Comment Volume

If the team measures AI review by comment count, the tool will look successful almost immediately.

Better measurements are closer to outcomes:

MeasurementWhy it matters
escaped defect classshows what review still misses
rollback rate after reviewed changesreveals whether review is reducing bad merges
repeated review findingsshows rules that should become tests or automation
time from PR open to first useful human commentseparates speed from signal
percentage of risky PRs with explicit risk statementschecks whether review has a target
production incidents linked to reviewed PRsshows whether review matched real failure modes

The goal is not to prove AI review is bad or good.

The goal is to see whether it changes the failure modes that matter.


Practical Checklist

Before relying on AI review, ask:

  • Does the pull request description name the risky behavior?
  • Are the business invariants visible in code, tests, or instructions?
  • Are AI comments separated from human blocking concerns?
  • Did a human inspect authorization, data boundaries, retries, transactions, and API contracts where relevant?
  • Do tests cover the dangerous path, not only the happy path?
  • Did meaningful follow-up changes get reviewed again?
  • Are recurring AI findings being turned into tests, linters, or reusable rules?

For decisions where reviewers disagree about ownership, trade-offs, or acceptable risk, use the decision framework in How Software Engineers Make Decisions. AI can summarize options, but humans still decide which trade-off the system should accept.


Final Takeaway

AI code review is most valuable when it reduces mechanical review effort and helps humans spend more attention on the dangerous parts of a change.

It becomes risky when teams treat generated feedback as a substitute for judgment.

The practical rule is simple:

AI can expand review coverage. Humans still own the risk.

That means clear pull request descriptions, explicit review targets, severity-labeled comments, tests tied to production failure modes, and human reviewers assigned to the context the tool cannot safely infer.

When that boundary is clear, AI review can make the process faster without making the team careless.