A voice agent design pattern for high-stakes service calls — missed-call recovery and front-desk deflection for mid-sized clinic chains in Southeast Asia.
Athena answers the phone for a clinic. When a call comes in during peak hours, it routes the caller to the right answer, books or changes appointments on the spot, refuses to act like a doctor, and recovers the calls that would otherwise have gone to voicemail and never called back.
Three product positions hold this whole document together:
Not every capability is an Agent. FAQ runs as a Workflow. Booking runs as a tool-calling Agent. Safety runs as a Guardrail. Closure runs as Structured Summarization. Latency, cost, and reliability are balanced explicitly per module — not assigned one architecture and forced to fit. A router that only needs to classify intent has no business carrying the cost and latency profile of a booking agent, and a booking agent that writes to a medical schedule has no business being optimized for the router's speed budget.
A healthcare voice assistant should not chat like a doctor. Athena pushes every call toward a clear next step — book it, change it, hand it off — without ever crossing the line into medical responsibility. It does not diagnose, triage, or advise. That restraint is not a missing feature; it is the feature.
The first value of AI here is not replacing the front desk — it's making sure every call gets answered, triaged, logged, and recovered. The front desk stays. What changes is that no call falls through the floor during the busy hour, and every call that lands leaves a structured record behind it.
Every missed call during peak hours is lost revenue walking out the door. Athena's job is to make sure no call goes unanswered — and every answered call moves forward.
The leak is structural, not occasional. A front desk is a single-threaded resource: one person, a finite number of lines, and a peak window where calls arrive faster than they can be handled. Industry handle-time data puts a single routine call at three to five minutes of front-desk time, and healthcare average handle time runs around 6.6 minutes per call. With each physician fielding dozens of calls a day, the math compounds fast — and a meaningful share of peak-hour calls never get answered at all. The exact loss depends on each clinic's volume and per-visit value; what's certain is that the leak is real and that it compounds every business day.
I'm deliberately not quoting a single headline figure for missed-call rate, lost revenue, or deflection percentage. The published numbers disagree across sources, and the high-end figures trace back to vendor marketing rather than operational data. A number I can't defend under questioning is worse than no number at all. Instead, here is the shape of the account, so a clinic operator can plug in their own inputs:
Athena's lever is the missed-rate term. It does not change how many people call, what a visit is worth, or how often an answered call converts — it changes how many calls get answered in the first place, and it recovers the ones that slip through. Everything else in the equation belongs to the clinic. The point of writing it this way is to show that the value is mechanical and measurable on each clinic's own data, not borrowed from someone else's slide.
Athena is not a pile of capabilities looking for a use. Each module exists to kill a specific pain — not to show off a capability. The system was designed backward from the four problems a mid-sized clinic actually has, and every module maps to one of them.
| Clinic pain | Athena module | How it solves it |
|---|---|---|
| Repetitive queries overwhelm the front desk | Router + FAQ (Workflow) | High-frequency questions never reach an LLM; the workflow answers directly |
| Booking, rescheduling, and cancellations eat staff hours | Booking (tool-calling Agent) | Closed on the call, with a written confirmation |
| Missed calls leak revenue | Missed-call recovery | The caller is actively re-engaged instead of the lead evaporating |
| Implicit clinical risk on every call | Safety Guard (L1 + L2) | Never diagnoses or triages; escalates to a human the moment a call crosses the line |
| Calls end with nothing recorded | Closure (Summarization) | Every call leaves a structured record behind |
The router is the load-bearing decision. Sending a "what time do you open Saturday?" query through a full reasoning agent is the most common way these systems become slow and expensive at once. Athena classifies first, then spends compute only where the call actually requires it.
The production target is a phone line. This demo runs in the browser — not as a compromise, but because a phone call only lets you hear the agent. The browser lets you watch the system think: routing, safety checks, and tool calls happening live. The assignment asked for multi-agent orchestration "behind the scene" — so I put the scene on screen.
The same voice pipeline drives both. The browser captures audio through the microphone; production captures it through a telephony provider. Everything downstream of the audio entry point — speech-to-text, the orchestrator, the four modules, the synthesized reply — is identical. Only the audio entry and exit points differ between browser mic and a phone line. The browser is not a different system pretending to be the phone; it is the same system with a different front door.
That choice buys an evaluation advantage a phone call can't. The interface is a split view: the call itself on one side, a Decision Room on the other showing Router → Safety → Booking → Closure as each step fires. The reviewer sees both the caller's experience and the orchestration underneath it at the same time — the routing decision, the safety verdict, the tool calls — which is exactly the "behind the scene" workflow the assignment asked to see, made visible.
In a cascading voice stack, STT → LLM → TTS run in series. Every layer adds delay, and on a phone call, silence over a second reads as "it broke." So latency isn't something I tuned at the end — it's the constraint the entire design bends around.
Two things make this hard, and I treated them separately:
These are different problems. Optimizing the first is engineering. Optimizing the second is product. I built for both — and I measure them as two separate metrics, because conflating them hides the thing that actually matters to a caller.
Humans swap conversational turns in roughly 200ms (Stivers et al., PNAS). Nielsen's thresholds put 1s as the limit for "uninterrupted flow." A voice agent isn't judged against software norms — it's judged against the person you'd otherwise be talking to. So the bar is: the caller should never sit in dead air.
Filler (Layer 1 static mp3). The moment routing resolves, a pre-recorded line plays — so the caller hears a response while the real answer is still being generated. To be precise about the timing: the filler doesn't fire at the instant the caller stops speaking. It fires once speech-to-text completes, the orchestrator returns its first streamed byte, and — on any path that isn't an immediate L1 emergency — the router has resolved intent. That's the earliest honest moment the system knows enough to say something, and it says it then.
Three-tier speech architecture. Static mp3, then template slot-filling, then full LLM + TTS. The cheapest sentence is the one you don't generate at runtime, so the most common responses never pay synthesis cost.
Model tiering. The router runs on Haiku — fast, cheap, and sufficient for classification. Booking runs on Sonnet, where accuracy matters and the filler already hides the added cost. Speed where it's safe, accuracy where it matters.
Selective Safety L2. The semantic safety check only fires when the router flags that a turn needs it. On a routine booking it's skipped entirely — designed to save the L2 round-trip whenever the call carries no clinical signal, which in testing took roughly a second off the turn.
The figures below come from a test client in Nanjing hitting a Vercel deployment in San Francisco — which means both the speech upload and the streamed response cross the Pacific twice. A real user in-region in Singapore would see lower numbers.
| Metric | Measured | What it is |
|---|---|---|
| Perceived TTFA (filler onset) | ~2.4–3.6s | First sound the caller hears after they stop speaking |
| Actual content (first real sentence) | ~7.6–8.2s | The full answer via LLM + TTS |
| Server-side, clean US baseline (GitHub Actions probe, no cross-border) | Router 704ms · Booking 4.56s · LLM total 5.5s p50 | Module-level timings, US runner, no Pacific round-trip |
| Cold start | ~10s on first call, drops after warmup | Mitigated by preheat on the welcome modal |
One caveat I want to be exact about: the server-side baseline is a GitHub Actions probe on a clean US runner, not a production Vercel measurement. The production Vercel-to-model p50 was never formally run. I'm citing the probe as evidence that the reasoning layer is fast — not claiming it as a production number, because it isn't one.
The gap between the 2.4–3.6s a caller perceives and the sub-second turn-swap a human expects is real, and I can account for every part of it. It comes from three places: the cross-border test network on both legs (speech upload plus the streamed return), cold start on the first call, and TTS synthesis. The reasoning layer — the part that's actually mine to optimize — is already fast: the router resolves in 704ms on a clean US probe. The three sources of the gap are environmental and deployment problems, not architectural ones. The shape of the system is right; it's running across an ocean from a cold container, and that's a where-it-runs problem, not a how-it's-built problem.
My design target is sub-second perceived latency. I'm not there yet — production filler onset is around 2–3 seconds today. But I know exactly which layers own the gap, three of the four are environmental rather than architectural, and the path to closing them is on the V2 roadmap: deploy in-region to kill the cross-border legs, run an always-on container to kill cold start, and upgrade the TTS model. I'd rather show you a number I can defend than one I can't reproduce.
Each decision below follows the same shape: what I decided, why, and what it costs. Reviewers read trade-offs, not implementations.
The server never carries the audio stream — only transcripts and orchestration. That's the one decision that makes serverless viable here.
Audio goes straight from the browser to Deepgram over a short-lived token; the orchestrator only ever receives the resulting transcript and streams its reasoning back over Server-Sent Events. Because the server handles text rather than a live audio stream, each orchestration turn is a short promise that finishes well inside the serverless function window instead of holding a persistent connection open. That's what lets this run on Vercel functions at all.
The trade-off is explicit: this design does not support scenarios that need a continuous server-side audio stream — true server-side barge-in being the clearest example. That's a known V2 item, not an oversight.
A filler isn't a hack — it's the product deciding what "responsive" means to a caller. The engineering question is "how do I lower latency"; the product question is "what does the caller experience in the second after they stop talking." The filler is the answer to the second question, and it's a deliberate product call to spend a pre-recorded line buying the perception of responsiveness while the real answer generates. It's covered mechanically in §4; it lives here as a reminder that it's a product decision first.
Booking writes to a medical schedule. I'd rather spend 200ms than risk booking the wrong slot — and the filler already hides that cost.
The reasoning is about consequence, not benchmarks. Booking is a write operation against a medical calendar: it picks a slot, commits it atomically, and hands the caller a confirmation. An error there isn't a slow answer, it's a wrong appointment in a real schedule. That makes accuracy the priority over latency for this one module — and since the filler is already covering the perceived wait, there's no user-facing reason to trade accuracy down to save time-to-first-token. I'm not citing a head-to-head benchmark to justify this, because the relevant public numbers are for an earlier model generation and don't apply to the version Athena actually runs. The justification is the consequence of getting a medical write wrong, plus the fact that the cost is already hidden.
The six booking tools the agent calls: check_slots finds available slots in a date range, optionally filtered by specialty or doctor, balanced across doctors and capped at five; book commits a slot atomically and returns a confirmation ID, retrying on collision; reschedule moves an appointment to a new time, claiming the new slot before releasing the old one; cancel releases a slot and removes the booking record; get_booking_by_confirmation looks up a booking before any reschedule or cancel; and get_clinic_hours returns static daily hours. The claim-before-release ordering in reschedule is deliberate — it never leaves the caller with no appointment if the new slot fails.
Perceived latency (filler onset) and actual-content latency (first real sentence) are tracked as separate specification items, for the reason given in §4: they're different problems with different owners, and a single blended number would hide the one the caller actually feels.
Pulled out into its own chapter — see §6. It's not a feature among features; it's the line the product is built not to cross.
The most important thing Athena does is know what it must never do: it does not diagnose, it does not triage, it does not give medical advice. When a call crosses that line, it hands off to a human — fast.
The boundary is enforced in two layers, plus prompt-level constraints across every module.
L1 — regex, synchronous, sub-5ms. Six emergency patterns are matched on the raw transcript before anything else runs: chest pain, severe bleeding, shortness of breath, suicidal ideation, loss of consciousness, and pregnancy emergency. A hit short-circuits the entire pipeline — the router and L2 never run — and the system immediately plays a static emergency reply telling the caller to stay on the line or hang up and dial their local emergency number, then marks the call escalated. There is no model in this path on purpose: an emergency keyword should never wait on an LLM.
L2 — Haiku, conditional, semantic. When the router flags that a turn carries possible clinical signal, a Haiku pass makes a three-way judgment. It escalates to emergency on acute signals — stroke symptoms, acute respiratory failure, severe bleeding, loss of consciousness, active suicidal intent, active labor, severe allergic reaction, or a caller stating they're in an emergency. It escalates to a human on softer or ambiguous distress — palpitations, dizziness, multi-day symptoms, complaints, billing, insurance, or an explicit request for a person. And it continues the booking flow when there's no clinical signal at all, including the common case of a mild symptom attached to a routine booking. When uncertain between the two escalation paths, it biases toward escalate-to-human rather than over-triggering emergency; when the router is uncertain, it flags for L2 rather than skipping it. This selective firing is also what saves latency on routine calls, since L2 is skipped entirely when not flagged.
How "no diagnosis, no triage" is actually enforced. There is no runtime guardrail interceptor. The constraint is enforced by prompt rules in every module plus a schema-validated output contract. The router's prompt forbids producing diagnostic content, giving medical advice, or recommending treatment — it only classifies and flags. The L2 and Closure prompts ban the vocabulary of clinical assessment outright — triage, diagnosis, medical advice, symptom assessment — and Closure additionally forbids referencing symptoms in clinical language. The boundary is held by what the models are instructed never to say and a contract that validates the shape of what they return, not by a filter sitting in the request path.
What I left out is as deliberate as what I built. V1 owns one thing well: real-time answer-and-close during business hours. A take-home has a fixed budget, and the PM's job is to choose, not to attempt everything. Each thing below was considered and consciously deferred, with a reason:
Every limitation here is a choice I can defend — and a line item on the V2 roadmap.
Three vendors, one orchestrator, no real-time model. Every API key lives on the server; the browser never sees them.
A turn flows like this. The browser captures audio through MediaRecorder (webm/opus, 48kHz) and posts it to a Deepgram proxy, which runs Nova-3 speech-to-text with keyterm boosting on the clinic's doctor names so they transcribe reliably. The resulting transcript is posted to the turn endpoint, which opens an SSE stream and runs the orchestrator. Inside the orchestrator, the modules fire in sequence: L1 safety regex first, then intent classification on the router, then — conditionally — L2 safety, then — conditionally — the booking agent with its six tools. The orchestrator emits the response text, the browser requests synthesis from the TTS endpoint, ElevenLabs Flash v2.5 returns an audio blob, and it plays. Closure runs separately on its own endpoint after the call ends, so it never sits in the per-turn latency path.
The four modules and the models behind them:
claude-haiku-4-5claude-haiku-4-5claude-sonnet-4-6, six toolsclaude-sonnet-4-6State and persistence: each call's session lives in an in-process map pinned for the call's lifetime; bookings persist to Upstash Redis, with an occupied-slots set and one JSON record per confirmation ID. The TTS voice is a fixed configuration (stability 0.5, similarity boost 0.75, style 0.0, speaker boost on). Voice-activity detection thresholds — the RMS gate, the one-second silence-to-end-of-turn window, and the six-second silence-to-end-of-call window — are set in configuration, not inferred at runtime.
The single architectural point worth restating: no model in this stack runs in real time over a persistent audio connection. The orchestrator is text-in, text-out per turn, which is exactly what makes the serverless deployment in §5.1 hold together.
A demo you can't re-run isn't a demo — it's a screenshot. Athena ships with a golden-path harness.
Eight golden demos cover the full intent surface: a straight booking; a reschedule by confirmation ID; a cancel; an opening-hours inquiry handled without a tool call; a missed-call recovery; an out-of-scope insurance dispute that routes to a human handoff; a soft-safety case (days of palpitations attached to a booking request) that fires L2 and escalates to a human; and an emergency case (chest pain) where L1 matches, the router never runs, and the pipeline short-circuits. A stroke-symptom case exercising the L2 emergency path exists alongside these, though its verify script hasn't been run.
The verify harness posts each utterance to the turn endpoint, reads the SSE stream, and asserts on pipeline event shape: the resolved intent, whether L2 was required, the safety decision, whether L1 triggered, whether booking completed, whether the intent reached a terminal state, and whether the filler intent matched. It checks the shape of the pipeline's decisions, not the semantic content of the replies — it confirms the system routed and decided correctly, which is the part that has to be reproducible. Two companion harnesses round it out: a filler-logic check covering nine decision cases locally, and a Playwright voice-loop check.