How We Built Offline Evaluation for Session Recommendations
How We Built Offline Evaluation for Session Recommendations
At Kiddocare, Kiddocarers open the app and see available babysitting sessions they can apply for. We rank those sessions with a scoring formula — weighing multiple signals to decide what shows up first. The goal is to surface the right session for the right Kiddocarer, so families get someone well-matched and Kiddocarers get work they're good at.
But how do we know the weights are right? Which signals matter more? How much should each one influence the ranking?
Our only way to find out was to change weights, deploy, wait weeks, and check if application numbers moved. One experiment per month. And even then — did the numbers change because of our tweak, or because of school holidays?
We needed a way to test changes against historical data. Instantly.
Before:
Tweak weights ──→ Deploy ──→ Wait 2-4 weeks ──→ Check applications
↑ │
└──────────── "hmm, didn't improve" ───────────────┘
After:
Tweak weights ──→ Run against history ──→ Compare metrics
↑ │
└─────────── Re-run in seconds ──────────┘
Let me walk through how this works with a real scenario.
Meet Siti
Siti is a Kiddocarer based in Petaling Jaya (47400). On April 1st, she opens the app. There are 20 sessions available in her area.
She scrolls through, looks at a few, and applies for 3 sessions:
April 1st — what actually happened
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
20 sessions available in Siti's area
Siti applied for 3:
• Session #7 47500 2 kids aged 3 & 5 (morning)
• Session #12 47400 1 kid aged 7 (afternoon)
• Session #3 46000 3 kids aged 2-6 (evening)
This is historical fact. She saw whatever order the app showed her at the time and picked these three.
Now here's the question: if we had ranked those 20 sessions differently, would her 3 picks have appeared near the top?
Testing a Scoring Formula
We take a proposed formula, score all 20 sessions for Siti, rank them, and check where her 3 actual applications land.
Config A (one set of weights)
Rank all 20 sessions for Siti:
┌──────┬──────────────┬───────┐
│ Rank │ Session │ Score │
├──────┼──────────────┼───────┤
│ 1 │ Session #15 │ 0.91 │
│ 2 │ Session #7 │ 0.85 │ ◄── ✅ she applied for this
│ 3 │ Session #2 │ 0.79 │
│ 4 │ Session #12 │ 0.74 │ ◄── ✅ she applied for this
│ 5 │ Session #9 │ 0.68 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ ← top 5 cutoff
│ 6 │ Session #18 │ 0.65 │
│ 7 │ Session #1 │ 0.61 │
│ ... │
│ 11 │ Session #3 │ 0.42 │ ◄── ✅ she applied for this (buried!)
│ ... │
│ 20 │ Session #14 │ 0.08 │
└──────┴──────────────┴───────┘
Results for Siti:
2 of her 3 picks in the top 5 → Recall@5 = 67%
First hit at rank 2 → MRR = 1/2 = 0.50
At least 1 hit in top 5 → Hit Rate@5 = ✅
Not bad. But Session #3 — the one she ended up applying to — is buried at rank 11. On an iPhone 17 Pro, only about 1.5 sessions are visible at a time in the job listing — that's 7-8 screens of scrolling past the first page. That's a session where she could've been a great fit for the family, and neither side would've connected.
Now Tweak and Re-test
Same 20 sessions. Same 3 applications. Different weights.
Config B (different weights)
Rank all 20 sessions for Siti:
┌──────┬──────────────┬───────┐
│ Rank │ Session │ Score │
├──────┼──────────────┼───────┤
│ 1 │ Session #12 │ 0.88 │ ◄── ✅ she applied for this
│ 2 │ Session #7 │ 0.83 │ ◄── ✅ she applied for this
│ 3 │ Session #15 │ 0.77 │
│ 4 │ Session #3 │ 0.74 │ ◄── ✅ she applied for this
│ 5 │ Session #2 │ 0.66 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─│
│ ... │
└──────┴──────────────┴───────┘
Results for Siti:
3 of her 3 picks in the top 5 → Recall@5 = 100%
First hit at rank 1 → MRR = 1/1 = 1.00
At least 1 hit in top 5 → Hit Rate@5 = ✅
All three at the top. Config B is better for Siti.
Config B wins across the board. And it took seconds to test, not weeks.
But Siti is one Kiddocarer on one day. We need to do this across hundreds of Kiddocarers over weeks to trust the result.
Scaling Up: From One Kiddocarer to the Whole Dataset
Siti is one Kiddocarer on one day. We need to do this across many Kiddocarers to trust the result.
We take a lookback window — say, the last 30 days. Within that window, we find every Kiddocarer who had enough applications, and for each one, we run the same exercise: score all the sessions in their area, rank them, and check where their actual applications land.
Lookback window (last 30 days)
◄──────────────────────────────────────►
Mar 1 Apr 1 (today)
────────────────────────────────────────────────────────
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
All sessions and applications within this window.
For each Kiddocarer: rank their available sessions,
check where their actual applications land.
No train/test split. We're asking a simpler question: within this period of real activity, does the scoring formula rank the sessions Kiddocarers chose higher than the ones they didn't?
It's a retrospective check. If the formula is good, applied sessions should consistently appear near the top of the ranked list. If they don't, the weights need work.
Average the metrics across all Kiddocarers, and you've got a number to compare configs against.
But Wait — Not All Sessions Are Fair Game
Back to Siti. Say there are 50 sessions in the lookback window across all of Malaysia. We don't score all 50. We filter to what she could have realistically seen and applied for — sessions in her area.
50 sessions exist in the lookback window
Kiddocarer: Siti (based in 47400 Petaling Jaya)
✅ Include — sessions in her area (20 sessions):
Session #7 47500 ← applied
Session #12 47400 ← applied
Session #3 46000 ← applied
Session #15 47400
Session #2 47300
...
❌ Exclude (30 sessions):
Session #31 80000 Johor Bahru → too far
Session #22 10150 Penang → too far
...
Why bother filtering? If we include sessions in Johor for a PJ Kiddocarer, the formula trivially rejects them and looks smarter than it is. Garbage in, garbage out.
We also filter which Kiddocarers to evaluate:
- At least a few applications in the window (so there's enough to check against)
- Approved Kiddocarer status (verified and active on the platform)
The Metrics
Three numbers. That's all we need.
Recall@K — of the sessions she applied to, how many were in our top K?
Siti applied to 3 sessions. If 2 appeared in the top 5, Recall@5 = 67%. We check K=5 and K=10. This is the main metric.
Why 5? We load 20 sessions per page, but on mobile only about 1-2 sessions are visible at a time — K=5 is a few screens of scrolling, roughly what most Kiddocarers will browse through. K=10 catches the more patient scrollers.
MRR (Mean Reciprocal Rank) — how far did she scroll to find the first good one?
For a single Kiddocarer: first hit at position 1? Reciprocal rank = 1.0. Position 2? 0.5. Position 5? 0.2. Average this across all Kiddocarers — that's MRR. Directly maps to how the app feels.
Hit Rate@K — did we show her at least one good option?
Sessions are temporary. Some days there's nothing great for a Kiddocarer. Hit Rate tells us how often we deliver at least one viable option. It doesn't differentiate well between good configs (both A and B might score ✅ for Siti), but it shines when segmenting — if Hit Rate drops to 30% in Johor, that's a problem Recall alone won't surface.
We also track AUC as a sanity check — it tells us whether the formula scores applied sessions higher than non-applied ones overall. It won't tell you if the top 5 are good, but it catches if something is fundamentally broken.
What we skip: Precision@K (Kiddocarers want "find me something good," not "every result must be perfect") and RMSE (we don't have ratings — only binary applied/didn't-apply).
Baselines: Is Our Formula Actually Good?
Your formula's Recall@5 — good or bad? No idea. Not without comparing it to something dumb.
We test every config against three baselines:
- Random — shuffle. This is the floor.
- Distance-only — nearest first. The naive default.
- Popularity — most-applied-to sessions first. Tests if your formula is actually personalizing or just rediscovering what's popular.
Recall@5 MRR Hit Rate@5
───────── ────── ──────────
🔴 Random low low low
🟡 Distance-only ──────── baseline ────────
🟡 Popularity-only ──────── baseline ────────
🔵 Config A better better better
🟢 Config B best best best ◄── winner
──────────────────────────────────────────►
"higher is better"
If your formula only matches the popularity baseline, it's not personalizing. It's just learning that popular sessions are popular.
The real signal is the gap between your formula and the best dumb baseline. That gap is what your scoring logic is worth.
Don't Trust the Average
Config B's overall numbers look great. But break it down by region:
Overall KL/Selangor Penang Johor
──────── ──────────── ─────── ──────
Recall@5 good strong okay weak
Hit Rate@5 good strong okay weak
⚠️ Looks good overall. Johor Kiddocarers and families barely benefit.
"Improve Johor recommendations" is actionable — it means families there aren't getting matched with the right Kiddocarers. "Overall Recall@5 improved slightly" tells you nothing about who's being underserved.
Break down metrics by segments that matter to your business. The segments are where the real insights live.
Where This Breaks Down
Offline eval has blind spots. Worth being honest about them.
Position bias is the big one. Historical applications are shaped by whatever sorting was in place before. Top-of-list sessions got more eyeballs, more applications — regardless of actual fit. Our ground truth is contaminated by the old system. A formula that mimics the old ranking will score well even if the old ranking was bad. This is the hardest problem to solve offline.
Real-time availability. Offline, we test against all sessions. In production, some are already filled by the time a Kiddocarer opens the app.
Bottom line: offline eval is a filter, not a verdict. Use it to narrow down to 2-3 good configs. Then run a small live A/B test to confirm.
What We Learned
Start with baselines. Distance-only was surprisingly competitive. Geography does most of the heavy lifting — other signals need to clearly earn their place.
Look at who it fails for, not just the average. Our overall AUC looked healthy. But when we checked the bottom 10 Kiddocarers, 9 of them shared the same failure mode — they apply across many areas, diluting the area signal. The formula can't distinguish for them. That's an insight you'd never get from the overall number.
Be careful with negatives. Treating all non-applications as "not interested" produced garbage metrics. Filtering to plausible candidates — right geography, right time, session actually available — made the results trustworthy.
This isn't the finish line. We still need live tests. But instead of testing every idea in production, we only ship what's proven against history. One idea per month became ten ideas per afternoon. Every improvement means a Kiddocarer finds the right session faster, and a family gets someone better suited to their kids.