CareCaller Auto-Flagger

Automatically detect which AI healthcare calls need human review

1.000

F1 Score

Flagged

Signals

146

Features

The Problem

CareCaller uses AI agents to call patients for medication refill check-ins, asking 14 health questions. About 9% of calls have issues. This tool automatically flags those calls for human review.

Mishearing Numbers

262 lbs → 62 lbs

Speech-to-text drops digits from health numbers, recording dangerous values that look plausible.

Skipping Questions

5/14 answered

The AI marks the call as complete but only asked a fraction of the required health questions.

Medical Advice

"You should reduce your dosage"

The AI oversteps its role and offers medical guidance it is not qualified to give.

How It Works

Traditional Signals

Semantic Signals

Signal Extractors

146

Features

LightGBM

Gradient Boosting

NLI Stacking

DeBERTa Cross-Encoder

Prediction

Flag / Pass

Two models work together: LightGBM finds patterns in numbers, DeBERTa catches contradictions in text. A meta-learner combines both predictions.

The Journey

v1: Base ML 0.933

v2: + NLI 0.952

v3: + Response 1.000

F1 score on hidden test set

Results

Perfect Score

F1 = 1.000 on the hidden test set. Every problematic call was caught. Zero false alarms.

18 Calls Flagged

Out of 159 test calls, 18 were flagged for human review (11.3%). The rest were confirmed clean.

Robust Validation

5-fold cross-validation F1 = 0.974. Consistent performance across 5 random splits of the data — not just lucky on one test.

Example: What a Flagged Call Looks Like

c138772b escalated 135s 14/14 100% FLAGGED

NLI contradiction 100%, 15% responses not in transcript, 4 heuristic rules triggered

View full call detail →

Explore the Dashboard Browse Flagged Calls