How we moved from exploration to dependable systems
Introduction
In May 2025, we launched Zo, our AI phone assistant. Zo engages callers in natural conversation so patients can schedule care 24/7 without waiting on hold. For healthcare providers, it replaces rigid phone trees and basic call‑routing with a system that scales to unlimited inbound scheduling while honoring existing EHR integrations and practice rules. Routine requests are resolved autonomously; complex cases escalate to staff with context.
The launch marked a milestone in Zocdoc’s mission to power appointment scheduling across every channel. By building a reliable phone experience, we made scheduling more accessible for patients and more efficient for providers—without changing the underlying rules that keep care safe, reliable and compliant.
How We Got Here
Zo began as a hackathon experiment. During our yearly hackathon, our co‑founder Nick Ganju posed a deceptively simple question: What if patients could call any time and book an appointment without waiting on hold?
The first prototype was stitched together with early LLMs and speech tools; it proved the experience was possible and also exposed a hard truth: you can make model behavior more consistent, but you can’t guarantee it. In healthcare, “almost always right” isn’t good enough. Nick and a small team kept iterating until a durable pattern emerged: keep the logic in code and use the model for translation only. In other words, Zo would parse speech, classify intent, and extract entities, while deterministic services would enforce eligibility, policies, and scheduling against sources of truth.
That framing turned a prototype into a product. By early 2025, AI advances, including more consistent function‑calling, gave us the stability we needed to operate at scale. Today Zo engages patients naturally, schedules appointments around the clock, and integrates with existing EHR and practice rules—handling routine calls and escalating the rest. We won’t oversell it here; the work speaks for itself, and third-party benchmarking of AI phone agents has begun. Zo’s success metrics:
- Average time to book an appointment -> ~4 minutes
- Average time to reschedule an appointment -> ~2.5 minutes
- Scheduling calls resolved without human interaction -> Up to 70%
Our Operating Principles
Our prototype worked, but operating at scale surfaced everything that didn’t. Real callers bring accents, background noise, interruptions, and mid‑sentence pivots. Policies evolve, data drifts, prompts that seem stable in a sandbox fray under more diverse conditions. We earned our principles the hard way: by debugging every edge case until a few simple rules held across them all.
What follows is how we now build new LLM related features and functionalities. As we grew in our confidence to work with LLMs and bring AI features to production, we found more and more places where an LLM would have outsized impact especially if we took what we learned from Zo into other Zocdoc products. From review summaries to analyzing product feedback, we have applied these principles over and over again.
1) AI as translation, not oracle
LLMs excel at turning unstructured input (text, speech) into structured output (enums, IDs, JSON). Once the data is in this structured format, we can operate in our familiar world of code. The opposite approach, giant prompts that own an entire logic tree, overwhelms attention windows and bury policy in prose; this approach leads to inconsistent results.
Regardless of what the LLM is being tasked with, it always boils down to the same template: classify, extract, and normalize. Translation is where LLMs shine; orchestration is where code must lead.

This example above shows us what this might look like. From the output you can see how easy it would be to perform some action such as searching for availability. If we view this principle as dictating where best to use LLMs, then the next principle tells us the opposite: where LLMs should be avoided.
2) Deterministic first
When a requirement assumes determinism, the solution must be deterministic. Models can propose; code decides. Letting an LLM own eligibility, scheduling, or policy invites drift and hard‑to‑reproduce bugs. Taking Zo as an example, after the model helps a caller narrow to a time slot, we deterministically verify availability against the source of truth before confirming. Model output is treated as another input to a deterministic process; it is not the final say or decision maker. Deterministic checks convert probabilistic understanding into reliable outcomes, inside the deterministic envelope.

This is a preferable operational framework because it eliminates any need to evaluate the model’s performance on actually selecting a real time slot that is available for the provider. With the right tests in place, we can be 100% certain that our system will not pick an unavailable time slot.
3) Reliability over novelty
We ship the boring thing that works every time before the dazzling thing that sometimes does. Over‑ambitious prompts and “one‑shot magic” produce intermittent failures that erode trust and prolong incidents. When we wanted to generate summaries of patient reviews of providers on the profile page, we broke the task into subtasks and divided the work among multiple agents. While a single prompt and model does produce a summary, the summary produced was not consistent in mentioning the highlights of a provider’s review.
To give you an idea of our subtasks, one agent’s job was just to emit a structured digest (topics, counts, confidence, etc.) that represents an aggregation of the attributes we want mentioned in a review. Smaller model responsibilities plus deterministic post‑processing yield not just a more consistent result, but also a more supportable and easier to diagnose pipeline. Reliability compounds; novelty can come later once the spine is solid.
4) Expose AI intentionally
We prefer real user and client value over buzzword tactics or products, and surface AI only when it clearly clarifies or accelerates the task. “AI because we can” creates UI confusion, raises error costs, and complicates support. One such impact is our LLM categorization of product feedback into thematic buckets. What users feel is faster triage, not “AI”, because we can more quickly spot a large problem area in our product. By contrast, we do surface review summaries because they help patients scan large sections of text quickly in order to help make them the best choice for their healthcare.
Our feedback categorizer has not just made it so we can respond to complaints faster, but it has given us indisputable evidence for what problem areas will have the most outsized impact to benefit our providers and their patients. Individually: a tag; in aggregate: a weekly metric we use to prioritize and measure impact. This metric across time serves not just as a point in time reference, but allows for tracking across weeks and months which allows us to measure impacts of our solutions and fixes.
This philosophy has produced a set of wins beyond just our feedback categorizer. Even within Zo, a product that features AI prominently, we have back-end valuable use cases for AI such as call analyzers that help identify issues such as early call misroutes from noise and ambiguous phrasing. The best AI often is a foundation of the experience—reliability first, spotlighting AI only where needed..
5) Guardrails are features
If we can’t measure it offline and catch regressions in CI, we haven’t really built it. Quiet regressions from model, version, or prompt changes should never first appear in production. Our CI gate requires 100% of eval tests to pass; assertions may use an LLM‑judge for semantic agreement with per‑test thresholds, but the suite must be green. For larger changes, we replay production calls and review deltas across hundreds of samples; the engineering team owns go/no‑go. During runs, models emit structured telemetry—counts, flags, confidence—so post‑hoc checks are deterministic. We debug with data, not vibes.
Enablement at Scale
Proving Zo worked was step one. Step two was making the way we built Zo the way Zocdoc builds. We paused product work and ran two company‑wide, multi‑day trainings—weeks in the making—led by the engineers who shipped Zo. The goal wasn’t to re‑explain the principles you just read; it was to turn them into muscle memory.
Each day combined concise, prepared lessons with hands-on workshops and labs. Engineers practiced decomposing a feature into LLM translation tasks and deterministic services, defining schema‑first I/O, wiring function calls, and standing up evaluation gates and telemetry before any launch. We shipped starter kits—schemas, function‑calling templates, CI/eval harnesses, etc.— so people left with working code, not slides. Instructors stayed on the floor for pairing, prompt surgery, and code reviews to not only support the engineers’ learning, but to also foster creativity with novel ideas on how to use the tools they just learned.
Training alone doesn’t bridge the gap to production, so we created an embed program. Experienced AI engineers joined each team’s first LLM project as temporary guides. They co‑designed contracts and state machines, helped set up offline evals and dashboards, modeled rollout discipline and incident playbooks, and paired on real PRs. Embeds stayed through the team’s first production cut and handed off with checklists, docs, and ownership once the same reliability bars we use on Zo were met.
The result is repeatable enablement: deliberate training, production‑grade scaffolding, and embedded expertise that seeds new teams. That’s how the lessons from Zo moved from a single product to an organizational habit—so dependability scales with adoption.
Processes and Standards: Where We’re Heading
We’re now turning the Zo playbook into a marked path teams can run on. When a team starts an LLM feature today, they should step onto a standard path: begin with a contract, not a prompt. The team defines a versioned schema—enums, normalized fields, and confidence—then plugs that contract into our provider‑agnostic LLM client. The client codifies our opinionated defaults behind one interface, so teams have a sensible starting point.
We’re tuning our operating cadence to match. Our recent LLM‑themed hackathon moved ~20 projects into follow‑up; demos go to the whole engineering org, and leadership routes the best candidates to PMs for sizing and roadmap fit. Quarterly and annual planning now ask every team for an AI strategy—including the choice to not use AI when determinism alone wins. An AI Guild meets monthly to share patterns and anti‑patterns, with live Q&A so fixes and wins travel faster than lore.
Taken together, these standards turn experimentation into an engine: common clients and contracts make work portable and a steady cadence turns good ideas into dependable features. That’s the path we’re laying down so every new AI use case ships with the same reliability as Zo.
Closing
Zo gave us more than a new channel; it gave us a disciplined way to build with GenAI. We’ll keep doing what earned our confidence: keep the logic in code and use the model for translation, ship only what we can evaluate offline and safely replay, and expose AI intentionally where it clearly helps. That’s how we’ll keep scaling dependable systems at Zocdoc.