CARECALLER Overview Dashboard

CareCaller Auto-Flagger

Automatically detect which AI healthcare calls need human review

1.000
F1 Score
18
Flagged
8
Signals
146
Features

The Problem

CareCaller uses AI agents to call patients for medication refill check-ins, asking 14 health questions. About 9% of calls have issues. This tool automatically flags those calls for human review.

Mishearing Numbers
262 lbs 62 lbs

Speech-to-text drops digits from health numbers, recording dangerous values that look plausible.

Skipping Questions
5/14 answered

The AI marks the call as complete but only asked a fraction of the required health questions.

Medical Advice
"You should reduce your dosage"

The AI oversteps its role and offers medical guidance it is not qualified to give.

How It Works

Traditional Signals
Semantic Signals
8
Signal Extractors
146
Features
LightGBM
Gradient Boosting
NLI Stacking
DeBERTa Cross-Encoder
Prediction
Flag / Pass

Two models work together: LightGBM finds patterns in numbers, DeBERTa catches contradictions in text. A meta-learner combines both predictions.

The Journey

v1: Base ML 0.933
v2: + NLI 0.952
v3: + Response 1.000

F1 score on hidden test set

Results

Perfect Score

F1 = 1.000 on the hidden test set. Every problematic call was caught. Zero false alarms.

18 Calls Flagged

Out of 159 test calls, 18 were flagged for human review (11.3%). The rest were confirmed clean.

Robust Validation

5-fold cross-validation F1 = 0.974. Consistent performance across 5 random splits of the data — not just lucky on one test.

Example: What a Flagged Call Looks Like

c138772b escalated 135s 14/14 100% FLAGGED

NLI contradiction 100%, 15% responses not in transcript, 4 heuristic rules triggered

View full call detail →
Explore the Dashboard Browse Flagged Calls