
Why Caching Causes Inconsistent Data in Production
Caching causes inconsistent data in production when the cached value no longer represents the same truth as the database, API, or service that originally produced it. The cache still makes reads faster, but the system now has two places that can answer the same question, and those answers can temporarily disagree.
That disagreement is not always a bug. Product recommendations, public catalog pages, and expensive aggregate dashboards may tolerate bounded staleness. The problem starts when the application treats cached data as if it were always current while writes, invalidation, deployments, background jobs, or multiple service instances move the real state forward.
For the broader data-correctness cluster, see the SQL And Data Correctness hub. For reliability patterns where caches are used as fallback or load reduction, see the Backend Reliability hub.
The Cache Did Not Break The Code
Caching is often introduced after a system already works.
An endpoint reads account limits from the database. It is correct, but slow under repeated traffic:
async function getAccountLimits(accountId: string) {
return db.accountLimits.findUnique({
where: { accountId },
select: {
accountId: true,
plan: true,
monthlyApiLimit: true,
apiCallsUsed: true,
updatedAt: true,
},
})
}
The cache-aside version looks like a harmless optimization:
async function getAccountLimits(accountId: string) {
const key = `account-limits:${accountId}`
const cached = await redis.get(key)
if (cached) {
return JSON.parse(cached)
}
const limits = await db.accountLimits.findUnique({
where: { accountId },
select: {
accountId: true,
plan: true,
monthlyApiLimit: true,
apiCallsUsed: true,
updatedAt: true,
},
})
await redis.set(key, JSON.stringify(limits), { EX: 300 })
return limits
}
The first request after a miss reads from the database and stores the value for five minutes. Later requests return from Redis. Latency drops. Database reads drop. The endpoint appears unchanged.
Microsoft's Cache-Aside pattern describes this common shape: the application checks the cache, loads from the data store on a miss, then adds the item to the cache. It also warns that cached data cannot always remain consistent with the data store and that stale data must be handled deliberately. See Cache-Aside pattern.
The code did not become obviously wrong. The system gained a second timeline.
A Stale-Read Timeline
Now add a normal product action: an account upgrades from basic to pro.
The write path updates the database and deletes the cache key:
async function upgradeAccountPlan(accountId: string) {
await db.accountLimits.update({
where: { accountId },
data: {
plan: 'pro',
monthlyApiLimit: 1_000_000,
updatedAt: new Date(),
},
})
await redis.del(`account-limits:${accountId}`)
}
That order is usually the right default for cache-aside: commit the source of truth, then invalidate the cache. Microsoft calls out the order because deleting the cache before updating the data store creates a window where another request can repopulate the cache with old data.
Even with the better order, the system can still produce surprising behavior:
| Time | Event | Result |
|---|---|---|
| 10:00:00 | Request A reads basic limits and caches them for 300 sec | Cache now contains basic |
| 10:00:20 | Account upgrade starts | Database update in progress |
| 10:00:21 | Request B reads from cache while write is not finished | User still sees basic |
| 10:00:22 | Database commits pro | Source of truth says pro |
| 10:00:23 | Cache delete fails or times out | Cache still says basic |
| 10:01:00 | API quota check reads from cache | Request may be rejected using old limits |
| 10:05:00 | TTL expires | Next cache miss finally loads pro |
Nothing in that timeline requires exotic distributed systems. A cache delete can time out. A worker can crash after committing the database update. A write path can forget invalidation. A local in-memory cache on one instance can survive while another instance refreshes.
The user experiences it as inconsistent data: the billing page says the account is upgraded, while the quota check still behaves as if it is not.
The Real Contract Is Freshness
The dangerous part of caching is not the cache key. It is the unstated freshness contract behind the key.
For account-limits:${accountId}, the hidden questions are:
- Which fields does this cached value represent?
- Which writes can change those fields?
- How stale may the value be before behavior becomes wrong?
- Is the value used for display, authorization, billing, quota enforcement, or background decisions?
- Does every writer know how to invalidate or refresh it?
- Can two application instances disagree about the same account?
Those questions matter more than the TTL.
Five minutes may be acceptable for a marketing page. Five seconds may be too long for account suspension, permissions, inventory, price guarantees, quota enforcement, or fraud decisions. A shorter TTL narrows the stale window, but it does not turn an implicit correctness rule into an explicit one.
Google's Memorystore guidance frames this boundary clearly: a cache is a good candidate when the data does not need to appear in the application immediately, and if a value exists only in cache, the application must behave acceptably if that value expires and disappears. See Caching data with Memorystore.
That is a useful test for every production cache: if the cached value is missing, stale, or briefly contradictory, does the system still behave safely?
Cache Keys Drift As Responses Grow
Cache keys often start accurate and become wrong later.
The first version caches only account limits:
account-limits:acct_123
Later the endpoint response grows:
- plan name
- quota limit
- current usage
- feature entitlements
- billing status
- trial expiration
- fraud hold
- organization-level override
The key stays the same. The cached value now depends on more state than the key expresses.
type AccountLimitResponse = {
accountId: string
plan: 'basic' | 'pro' | 'enterprise'
monthlyApiLimit: number
apiCallsUsed: number
canUseBulkExport: boolean
billingStatus: 'active' | 'past_due' | 'suspended'
organizationOverride: 'none' | 'temporary_limit_increase'
}
Now many writes can make the cached response stale:
| Changed state | Cached field affected | Invalidation often missed because... |
|---|---|---|
| plan upgrade | plan, monthlyApiLimit | billing service owns the write path |
| usage counter | apiCallsUsed | worker updates usage asynchronously |
| feature entitlement | canUseBulkExport | flag or entitlement system is owned elsewhere |
| invoice payment failure | billingStatus | webhook handler updates billing state |
| support override | organizationOverride | admin tool writes through a different code path |
The original cache key still looks stable, but the response has become a join across multiple business concepts. If every one of those concepts needs to invalidate the same key, the cache is no longer a small performance detail. It is a distributed contract.
This is similar to the way read replicas can look like a scaling fix while quietly changing read-after-write behavior. That related consistency trade-off is covered in Why Read Replicas Didn't Reduce Database Load.
Cache-Aside Races Are Easy To Miss
Cache-aside is popular because it is simple, but the race windows are real.
The most common stale write-back race looks like this:
| Step | Request A | Request B |
|---|---|---|
| 1 | Cache miss for account-limits:acct_123 | |
| 2 | Reads old basic value from database | |
| 3 | Upgrades account to pro in database | |
| 4 | Deletes cache key | |
| 5 | Writes old basic value into cache |
The cache now contains basic even though the database says pro.
The code for Request A looked correct. It loaded from the source of truth on a miss. The problem is that the value it loaded was already outdated by the time it wrote the cache.
There are several ways to reduce this risk:
- Write the database first, then invalidate the cache.
- Give cached values short freshness windows only when the business can tolerate stale data.
- Include a
version,updatedAt, or monotonic revision in the cached payload. - Avoid caching values used for authorization, billing enforcement, or critical state transitions.
- Use compare-and-set or version checks when refreshing a key that can race with writes.
- Move important invalidation into the same transaction boundary as the state change, or into a reliable outbox event.
That last option matters when several services or workers can change the state. If a database write and an invalidation event must stay coordinated, the reliability problem starts to resemble event publication. The durable pattern for that boundary is explained in Transactional Outbox Pattern in Microservices.
TTL Is Not A Correctness Strategy
TTL is useful. It is not enough.
The TTL tells the cache when a value should expire. It does not say whether the value is safe to use for a decision.
| Cache use case | Typical stale tolerance | Safer rule |
|---|---|---|
| public article page | minutes may be acceptable | TTL is usually enough with purge on publish |
| product recommendations | minutes or hours may be fine | make stale behavior visible in ranking metrics |
| account display name | short staleness often fine | invalidate on profile update |
| quota enforcement | usually very low | read source of truth or use versioned counters |
| permissions | near zero | avoid cache or use explicit revocation/version checks |
| prices during checkout | depends on business contract | define price guarantee and fallback behavior |
| fraud or suspension state | near zero | do not rely on ordinary TTL-only cache |
TTL reduces how long stale data can survive after a missed invalidation. It does not protect the system during the TTL window.
If the product rule is "new permissions must apply immediately," then a five-second cache can still be wrong. If the product rule is "recommendations can lag by ten minutes," then a five-minute TTL may be perfectly acceptable.
Write the freshness requirement first. Choose the cache policy second.
Write-Through And Write-Behind Have Different Failure Modes
Cache-aside is not the only pattern.
Write-through updates the cache when the database is updated. Write-behind accepts a write into the cache or buffer and later persists it to the database. Redis describes the key distinction directly: write-through syncs immediately, while write-behind syncs asynchronously and can leave cache and database inconsistent for a short time. See Redis: Write Behind vs Write Through.
The choice should match the correctness requirement:
| Pattern | What it optimizes | Main correctness risk |
|---|---|---|
| Cache-aside | simple read performance | stale values after missed invalidation or race windows |
| Read-through | centralizes cache loading | still needs freshness rules and invalidation |
| Write-through | keeps cache warm after writes | write path now depends on cache availability and latency |
| Write-behind/write-back | lower write latency or buffered writes | acknowledged writes can be lost or delayed before durable persistence |
| Local in-memory cache | fastest reads inside one process | instances disagree until expiry or restart |
There is no universally safest pattern. A write-through cache may improve read freshness but make writes slower or more fragile. A write-behind cache may improve write latency but weaken durability unless the queue and retry path are designed carefully. A local cache may be excellent for static configuration and risky for account state.
The mistake is choosing a pattern only by performance.
Local Caches Create Per-Instance Truth
Distributed caches are not the only source of inconsistency. Local in-memory caches can be worse because each process holds its own copy.
Imagine four application instances:
app-1: account acct_123 = basic
app-2: account acct_123 = pro
app-3: cache miss, reads database
app-4: old value until restart
Requests routed to different instances can return different answers. A deploy may clear two instances before the other two. Autoscaling may create new instances with empty caches while older instances keep stale values. A bug can appear to vanish after restart because the restart accidentally cleared the stale local copy.
Microsoft's cache-aside guidance calls out local caching as a special consistency risk because private caches on different application instances can quickly diverge. That warning is easy to underestimate until production traffic starts moving across many instances.
Local caches are safest when:
- the data is static or versioned
- stale data is harmless
- the value is not tenant-specific or security-sensitive
- reload behavior is explicit
- operators can see which version each instance holds
They are risky when the cached value controls permissions, prices, limits, or state transitions.
What To Measure
Cache metrics often stop at hit ratio. Hit ratio matters, but it does not tell you whether the cache is safe.
Add correctness-oriented signals:
| Signal | Why it matters |
|---|---|
| cache hit/miss by key family | shows which workflows actually depend on the cache |
| value age at read time | reveals stale-but-not-expired behavior |
| invalidation success/failure count | catches missed deletes and timeout patterns |
| invalidation lag | measures time between database commit and cache update |
| source-of-truth bypass count | shows how often code distrusts the cache |
| read-after-write mismatch samples | proves whether critical flows see their own writes |
| cache error fallback behavior | shows whether failures become slower reads or wrong reads |
| per-instance local cache version | detects disagreement across application instances |
For example, include cache metadata in internal logs or traces:
logger.info('account limits read', {
accountId,
cacheKey: key,
cacheHit: Boolean(cached),
cachedVersion: limits.version,
cachedAgeMs: Date.now() - new Date(limits.updatedAt).getTime(),
source: cached ? 'cache' : 'database',
})
The goal is not to log every value. The goal is to make stale reads explainable. During an incident, engineers should be able to answer: did this response come from cache, how old was it, and what invalidation should have happened?
A Safer Cache Design Checklist
Before adding a cache to a production path, write the contract in plain engineering terms:
- Name the source of truth.
- Name the cached value and every field it includes.
- Define the maximum acceptable staleness for each workflow using the value.
- List every write path that can make the cached value stale.
- Decide whether writes invalidate, refresh, version, or bypass the cache.
- Decide what happens when cache read, write, or delete fails.
- Add observability for cache hit, value age, invalidation failures, and fallback path.
- Add at least one read-after-write test for critical workflows.
- Roll out behind a feature flag or per-route switch when the path is important.
- Document how to disable the cache safely during an incident.
For the account-limits example, a better contract might be:
Cached value:
account limit display model, not authorization state
Source of truth:
account_limits table
Allowed staleness:
display page: 60 seconds
quota enforcement: no ordinary cache; read versioned source of truth
Invalidation:
plan upgrade, billing suspension, support override, and entitlement change
Failure behavior:
cache unavailable means slower database read, not stale authorization
That contract prevents a common failure: a cache introduced for display performance quietly becomes part of enforcement logic.
If the cached read happens inside a transaction-sensitive API flow, review the request boundary too. Caching can hide database reads, but it does not remove the need for correct atomic writes. That adjacent concern is covered in Database Transaction Boundaries in Backend APIs.
When Caching Is The Wrong First Fix
Caching is tempting when the database is slow, but it can hide the reason the database is slow.
Do not add a cache first when:
- an endpoint has N+1 query behavior
- a query scans far more rows than it returns
- a missing index or wrong index shape is the real issue
- the value is used for permissions or financial decisions
- writes are frequent and invalidation rules are unclear
- the team cannot observe cache freshness or invalidation failures
Fix the underlying query shape first when possible. If an endpoint runs 101 queries for 50 rows, caching may reduce some database work but leave the access pattern fragile. That is covered in N+1 Query Problem in ORMs.
If the single query is slow because the database plan is wrong, use the query-plan workflow before adding another stateful layer. Start with How to Find and Fix Slow SQL Queries in Production.
Caching is strongest after you understand the read pattern and decide that bounded staleness is an acceptable trade-off.
The Short Version
Caching improves performance by duplicating answers.
Duplicated answers create a correctness question: when the database, service, or API changes, how quickly must the cached answer change too?
Production cache bugs usually come from hidden freshness contracts:
- cache keys that no longer describe the full response
- writes that forget to invalidate or refresh a key
- cache-aside races that write old values after newer commits
- TTLs used as a substitute for correctness rules
- local caches that disagree across instances
- write-behind paths that acknowledge work before durable persistence
- missing observability for value age and invalidation failure
The fix is not "never cache." The fix is to treat the cache as part of system behavior. Name the source of truth, define the freshness budget, choose the caching pattern by correctness risk, test read-after-write paths, and make stale reads observable before production traffic depends on them.