Start now →

Two-Layer Caching Saved My Recommendation Latency

By Abhijna SP · Published April 25, 2026 · 8 min read · Source: Level Up Coding
EthereumRegulationMarket Analysis
Two-Layer Caching Saved My Recommendation Latency

“Why is the second request so fast?” I was staring at the Prometheus latency histogram when I realised the orchestrator wasn’t even touching the database — a per-user TTL cache was serving stale scores while a separate global cache underneath it was silently holding the entire interaction matrix in memory.

What this system actually does

This is a hybrid recommendation engine served over FastAPI. You hit GET /recommend/{user_id}, and it returns scored content items with human-readable explanations like "Users with similar taste liked this" or "Popular among users." Under the hood, it blends three signals — global popularity, collaborative filtering via Jaccard similarity, and content-based similarity computed over shared genres — into a single weighted score per candidate item.

The moving parts are straightforward. A data/ layer handles Tortoise ORM models and repository classes (UserRepository, ContentRepository, InteractionRepository). An engine/ layer owns candidate generation, scoring, and similarity math. And at the centre of it all sits engine/orchestrator.py — the RecommendationOrchestrator — which wires repositories to generators to scorers, owns both caches, and is the single object the API endpoint talks to. The orchestrator wasn’t even touching the database — a per-user TTL cache was serving stale scores while a separate global cache underneath it was silently holding the entire interaction matrix in memory.

The problem: every request was rebuilding the world

The first version of get_recommendations was conceptually clean but operationally brutal. Every call did the same sequence: fetch the user, pull every interaction from the database, build the user-item matrix in memory, compute item-to-item similarities by doing pairwise Jaccard over genre sets, generate candidates, score them, and return results.

The interaction query (Interaction.all()) was cheap. The item similarity computation was not. ContentRepository.get_item_similarities() fetches all content, then for each piece of content queries its genre associations, and runs pairwise Jaccard against every other item's genres:

# data/repositories.py — the expensive call
async def get_item_similarities(self):
contents = await Content.all().prefetch_related("genres")
content_genres = {}
for content in contents:
genre_objs = await ContentGenre.filter(
content_id=content.id
).prefetch_related("genre")
genres = set()
for cg in genre_objs:
genre = await Genre.get_or_none(id=cg.genre_id)
if genre:
genres.add(genre.name)
content_genres[content.id] = genres

similarities = {}
for content_id, genres in content_genres.items():
sim_scores = []
for other_id, other_genres in content_genres.items():
if content_id == other_id:
continue
score = SimilarityCalculator.jaccard_similarity(genres, other_genres)
if score > 0:
sim_scores.append((other_id, score))
sim_scores.sort(key=lambda x: x[1], reverse=True)
similarities[content_id] = [sid for sid, _ in sim_scores[:10]]
return similarities

This is O(n²) over the content catalog, with each iteration doing an await into SQLite to resolve genre foreign keys. With 30 items in the seed data, it's tolerable. With 3,000 items, you'd be waiting seconds. With 30,000, you'd be paging your on-call.

The interaction matrix built (_build_interaction_data) was lighter but still hit the database unnecessarily. It callsInteraction.all(), iterates every row, and groups them into two dictionaries: which items each user interacted with, and how many interactions each item has.

Running these two operations on every single recommendation request was wasteful. The interaction matrix doesn’t change between requests unless someone submits feedback. The item similarity graph doesn’t change unless the content catalog itself changes. Yet every call was rebuilding both from scratch.

Two layers, two different invalidation cadences

The fix wasn’t a single cache. It was two caches with different lifetimes and different scopes, stacked on top of each other.

Layer 1: The global cache. This holds the heavy, shared data — the user-item interaction map, the popularity counts, and the item similarity graph. All users share this data. It’s rebuilt at most once per GLOBAL_TTL (300 seconds):

# engine/orchestrator.py
CACHE_TTL = 300 # 5 mins
GLOBAL_TTL = 300 # 5 mins

class RecommendationOrchestrator:
def __init__(self, user_repo, content_repo, interaction_repo, ...):
# ...
self.cache: Dict[int, List[Dict[str, Any]]] = {}
self._global_cache = {
"user_items": None,
"popularity": None,
"item_similarities": None,
"last_update": None,
}

async def _get_global_data(self):
now = time.time()
if (
self._global_cache["last_update"] is not None
and now - self._global_cache["last_update"] < GLOBAL_TTL
):
return (
self._global_cache["user_items"],
self._global_cache["popularity"],
self._global_cache["item_similarities"],
)

user_item, popularity = await self._build_interaction_data()
item_similarities = await self._build_item_similarities()
self._global_cache = {
"user_items": user_item,
"popularity": popularity,
"item_similarities": item_similarities,
"last_update": now,
}
return (user_item, popularity, item_similarities)

The first request after a cold start (or after TTL expiry) pays the full cost. Every subsequent request within the window reads from in-memory dicts. No database calls. No pairwise computation.

Layer 2: The per-user cache. This holds the final scored-and-ranked recommendation list for a specific user. It sits above the global cache in the call path. If user 7 hits /recommend/7 and their entry is warm, we return immediately without touching the global cache at all:

# engine/orchestrator.py — get_recommendations
async def get_recommendations(self, user_id: int, limit: int = 5):
now = time.time()
if user_id in self.cache:
cached = self.cache[user_id]
if now - cached["timestamp"] < CACHE_TTL:
return cached["data"][:limit]

user = await self.user_repo.get_user(user_id)
if not user:
return []

user_items, popularity, item_similarities = await self._get_global_data()
# ... candidate gen, scoring, ranking ...
self.cache[user_id] = {"timestamp": now, "data": results}
return results

This is the two-layer stack in action. The per-user cache is checked first. If it misses, we fall through to the global cache. If that misses, we finally hit the database. Most requests during steady-state traffic never get past layer 1.

Cache invalidation: the intentional part

The trickiest aspect of any cache is knowing when to throw it away. Here, the per-user cache is invalidated explicitly on feedback:

# engine/orchestrator.py
async def record_feedback(self, user_id, content_id, ratings):
await self.interaction_repo.record_interaction(
user_id=user_id, content_id=content_id, rating=ratings
)
if user_id in self.cache:
del self.cache[user_id]

When user 7 rates a piece of content, we delete their cached recommendations so the next request recomputes with the fresh interaction data. This is surgical — only that user’s entry is evicted. Everyone else keeps their cached results.

The global cache, on the other hand, is not invalidated on feedback. This is deliberate. The new interaction will remain invisible to the scoring engine until the global TTL expires. For a recommendation system, five minutes of staleness is a reasonable trade-off: new ratings don’t need to reshape the entire similarity graph within milliseconds.

But there’s a subtlety. Deleting user 7’s per-user cache forces a recomputation, but that recomputation uses the stale global data. The user’s new rating is recorded in the database, but doesn’t appear in user_items until _get_global_data() refreshes. In practice, this means the user sees updated recommendations that still reflect the old interaction matrix plus their previously computed scores. It's not ideal, but it avoids the alternative — rebuilding the similarity graph on every feedback submission.

The cold-start fallback

There’s one more wrinkle. New users or users with no interaction history get a different candidate generation path entirely:

if user_id not in user_items or not user_items.get(user_id):
candidates = generator.popularity_candidates(limit * 2)
else:
candidates = generator.hybrid_candidates(user_id, limit * 3)

Cold users get a popularity-only slate. The popularity_candidates method sorts items by raw interaction count and returns the top N. No collaborative or content-based signals are possible because there's no history to compute them from. Once a user interacts with even a single item, subsequent requests switch to the hybrid path — collaborative + content-based + popularity — and the scorer's weights (0.4 collaborative, 0.3 content, 0.3 popularity) start producing differentiated results.

This transition happens organically. The user rates something, record_feedback evicts their per-user cache, and the next request sees them in user_items and routes them to hybrid_candidates. No flags, no explicit state machine. The data drives the branching.

What I’d do differently

The current global cache has no eviction beyond TTL. With enough users, self.cache grows without bound — one entry per user who ever requested recommendations during the process lifetime. In a real deployment, you'd want an LRU eviction policy or, at a minimum, a size cap.

The item similarity computation inside get_item_similarities() is genuinely painful. It does individual await calls inside nested loops to resolve genre foreign keys. A single bulk query to build the content-genre map upfront (like the seed script already does with build_content_genre_map()) would cut the database round-trips dramatically. The code for this pattern already exists in the codebase — it just wasn't reused where it matters most.

Both TTL values are hardcoded constants at the module level. Moving them to environment variables or a config object would let you tune cache behaviour per environment without code changes.

Lessons learned

  1. Two caches beat one. Separating per-user results from shared global data let me set different TTLs and invalidation strategies for each. The per-user cache handles the common case (same user, same recommendations). The global cache absorbs the expensive computation (interaction matrix, similarity graph).
  2. Stale data is usually fine in recommendations. Users won’t notice that their similarity scores are five minutes old. What they will notice is a 2-second latency spike on every request. The TTL trade-off is almost always worth it.
  3. Invalidate surgically. Deleting only the affected user’s cache on feedback keeps the invalidation blast radius small. A global cache flush on every write would defeat the purpose entirely.
  4. Cold start is a routing problem, not a scoring problem. When there’s no history, trying to run collaborative filtering is pointless. Fall back to popularity, and let the data naturally transition the user into the hybrid path once interactions exist.
  5. Watch for O(n²) hiding behind await. The pairwise similarity computation looks innocent in Python, but each iteration hides a database round-trip. Bulk-load first, then compute in memory. The seed script already knows this — the recommendation engine should too.

Two-Layer Caching Saved My Recommendation Latency was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This article was originally published on Level Up Coding and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →