Architecture · Latency · Evaluation

System & Evaluation

Cascading architecture

Three vendors, one orchestrator, no real-time model: Deepgram for ASR, Anthropic Claude (Haiku 4.5 + Sonnet 4.6) for reasoning, ElevenLabs Flash v2.5 for TTS. All API keys live on the server; the browser never sees them.

Browser MediaRecorder (audio/webm; opus)
  │ POST audio blob
  ▼
/api/deepgram-proxy   ← server-side, master key stays on server
  │ POST → Deepgram REST /v1/listen
  │ { transcript, confidence }
  ▼
Browser → /api/turn  (SSE)
  │
  │ Orchestrator
  │   1. Safety L1 regex (sync, <5ms)
  │   2. Router (Haiku 4.5) → intent + requires_safety_l2
  │   3. Conditional Safety L2 (Haiku) iff flagged
  │   4. Intent-terminal short-circuits (out_of_scope / end_call)
  │   5. Booking Agent (Sonnet 4.6 + tool calls)
  │
  │ SSE events: turn_started → router_done → safety_done
  │             → (intent_terminal | booking_done) → response_text
  │             → turn_complete
  ▼
Browser → /api/tts → ElevenLabs Flash v2.5 → audio playback
  │
  ▼ Loop next turn until End Call

On End Call:
  Browser → /api/close → Closure Agent → { outcome, front_desk_task }

Latency

Tracked as two separate metrics, because they measure different things. Perceived TTFA is the moment the caller hears the first sound after they stop speaking — the filler onset. Actual content is the moment the real answer (full LLM + TTS) starts playing. The first is what the caller feels; the second is what the system actually delivers. Conflating them into a single end-to-end number hides the one that decides the caller experience.

Metric	Measured	What it is
Perceived TTFA (filler onset)	~2.4–3.6s	First sound the caller hears after they stop speaking
Actual content (first real sentence)	~7.6–8.2s	The full answer via LLM + TTS — what the original end-to-end TTFA number measured
Server-side baseline (GitHub Actions probe, US runner, no cross-border)	Router 704ms · Booking 4.56s · LLM total 5.5s p50	Module-level timings on a clean US runner; no Pacific round-trip
Cold start	~10s on first call, drops after warmup	Mitigated by preheat on the welcome modal

Measurement conditions for the two end-to-end rows: a test client in Nanjing hitting a Vercel deployment in San Francisco — both the speech upload and the streamed return cross the Pacific twice. A real user in-region in Singapore would see lower numbers.

Where the gap comes from

The gap between the 2.4–3.6s perceived and the sub-second turn-swap a human expects comes from three places: the cross-border test network on both legs, cold start on the first call, and TTS synthesis. The reasoning layer — the part that's actually mine to optimize — is already fast: the router resolves in 704ms on a clean US probe. The three sources of the gap are environmental and deployment problems, not architectural ones.

V2 path to close it

Deploy in-region to kill the cross-border legs. Run an always-on container to kill cold start. Upgrade the TTS model. The shape of the system is right; closing the gap is a where-it-runs problem, not a how-it's-built problem.

State & persistence

Each call's session lives in an in-process map pinned for the call's lifetime. Bookings persist to Upstash Redis (Vercel Marketplace): a set of occupied slots and one JSON record per confirmation ID. Per-turn KV overhead is well under the Booking Sonnet round-trip, so KV is not a factor in perceived latency.

Evaluation method

Latency is measured at two layers: per-stage server-side wall-clock (Router, Safety L2, Booking, TTS synthesis) and end-to-end browser timings (mic-stop → first filler audible, mic-stop → real reply audible). Both are recorded with sample size and measurement conditions so perceived vs actual numbers stay distinguishable.

Safety classification quality is tracked across canonical scenarios — L1 regex pattern coverage, and L2 escalation tier accuracy across escalate_emergency, escalate_human, and continue_booking_flow. Updates to the safety prompts are validated by an automated decision-behavior check that asserts against fixed utterances.

Architectural decisions are documented with the dead-ends recorded alongside the chosen path, so the trade-offs survive a fresh reader.