System & Evaluation
Cascading architecture
Three vendors, one orchestrator, no real-time model: Deepgram for ASR, Anthropic Claude (Haiku 4.5 + Sonnet 4.6) for reasoning, ElevenLabs Flash v2.5 for TTS. All API keys live on the server; the browser never sees them.
Browser MediaRecorder (audio/webm; opus)
│ POST audio blob
▼
/api/deepgram-proxy ← server-side, master key stays on server
│ POST → Deepgram REST /v1/listen
│ { transcript, confidence }
▼
Browser → /api/turn (SSE)
│
│ Orchestrator
│ 1. Safety L1 regex (sync, <5ms)
│ 2. Router (Haiku 4.5) → intent + requires_safety_l2
│ 3. Conditional Safety L2 (Haiku) iff flagged
│ 4. Intent-terminal short-circuits (out_of_scope / end_call)
│ 5. Booking Agent (Sonnet 4.6 + tool calls)
│
│ SSE events: turn_started → router_done → safety_done
│ → (intent_terminal | booking_done) → response_text
│ → turn_complete
▼
Browser → /api/tts → ElevenLabs Flash v2.5 → audio playback
│
▼ Loop next turn until End Call
On End Call:
Browser → /api/close → Closure Agent → { outcome, front_desk_task }
Latency
Tracked as two separate metrics, because they measure different things. Perceived TTFA is the moment the caller hears the first sound after they stop speaking — the filler onset. Actual content is the moment the real answer (full LLM + TTS) starts playing. The first is what the caller feels; the second is what the system actually delivers. Conflating them into a single end-to-end number hides the one that decides the caller experience.
| Metric | Measured | What it is |
|---|---|---|
| Perceived TTFA (filler onset) | ~2.4–3.6s | First sound the caller hears after they stop speaking |
| Actual content (first real sentence) | ~7.6–8.2s | The full answer via LLM + TTS — what the original end-to-end TTFA number measured |
| Server-side baseline (GitHub Actions probe, US runner, no cross-border) | Router 704ms · Booking 4.56s · LLM total 5.5s p50 | Module-level timings on a clean US runner; no Pacific round-trip |
| Cold start | ~10s on first call, drops after warmup | Mitigated by preheat on the welcome modal |
Measurement conditions for the two end-to-end rows: a test client in Nanjing hitting a Vercel deployment in San Francisco — both the speech upload and the streamed return cross the Pacific twice. A real user in-region in Singapore would see lower numbers.
Where the gap comes from
The gap between the 2.4–3.6s perceived and the sub-second turn-swap a human expects comes from three places: the cross-border test network on both legs, cold start on the first call, and TTS synthesis. The reasoning layer — the part that's actually mine to optimize — is already fast: the router resolves in 704ms on a clean US probe. The three sources of the gap are environmental and deployment problems, not architectural ones.
V2 path to close it
Deploy in-region to kill the cross-border legs. Run an always-on container to kill cold start. Upgrade the TTS model. The shape of the system is right; closing the gap is a where-it-runs problem, not a how-it's-built problem.
State & persistence
Each call's session lives in an in-process map pinned for the call's lifetime. Bookings persist to Upstash Redis (Vercel Marketplace): a set of occupied slots and one JSON record per confirmation ID. Per-turn KV overhead is well under the Booking Sonnet round-trip, so KV is not a factor in perceived latency.
Evaluation method
Latency is measured at two layers: per-stage server-side wall-clock (Router, Safety L2, Booking, TTS synthesis) and end-to-end browser timings (mic-stop → first filler audible, mic-stop → real reply audible). Both are recorded with sample size and measurement conditions so perceived vs actual numbers stay distinguishable.
Safety classification quality is tracked across canonical scenarios — L1 regex pattern coverage, and L2 escalation tier accuracy across escalate_emergency, escalate_human, and continue_booking_flow. Updates to the safety prompts are validated by an automated decision-behavior check that asserts against fixed utterances.
Architectural decisions are documented with the dead-ends recorded alongside the chosen path, so the trade-offs survive a fresh reader.