This is part of an ongoing series about technical innovations in how we built Zo, Zocdoc’s in-house AI project and currently highest-rated autonomous voice agent for healthcare providers.
When building voice agents, every millisecond of response time matters. Patients calling into a doctor’s office to schedule appointments expect the same immediacy they’d get from a human receptionist. But current voice agent systems are often sluggish due to one fundamental problem: they wait until they’re sure the caller is done speaking before running an inference.
Consider the following user statement: “I want to come in for an appointment tomorrow.” A typical voice agent will wait ~2 seconds to make sure that the user is done speaking. Once the agent is sure the user is done speaking, it runs an LLM inference to generate a response. If the LLM inference takes another 1 second, and then the text-to-speech (TTS) renderer takes another 500ms, we’re looking at a response time to the user of about 3.5 seconds (this doesn’t even count ASR latency, network latency, cell network latency etc. but you get the picture).
Some voice agent implementations try to solve this by exclusively using smaller LLM’s so that the inference time is lower. This just trades one problem for another; the voice agent might be more snappy, but the smaller model often returns incorrect answers due to its limited reasoning capabilities. We need to be able to use adequate models but keep responsiveness at an acceptable level.
What if, instead of waiting 2 seconds to see if the user is done speaking, we just run the inference after 100ms of silence? Then we’d be ready with a response 1.6 seconds after the user finished, instead of 3.5 seconds. The main drawback here is that we can’t be sure the user is done speaking after only 100ms of silence, so if the user actually continues with more words, we’d have to cancel the (now irrelevant) inference we started, and start a new inference.
This is the actual implementation we’ve been pursuing since we started building our voice agent in early 2023. At some later point, the Pipecat people dubbed it “greedy inference”, which is a great name for it, so let’s start calling it that.
Enter Greedy Inference
Let’s assume the user says: “I’d like to come in tomorrow … (500ms pause) afternoon … (400ms pause) before 3pm”. There’s 2 pauses here, and the words after each pause alter the meaning of the sentence and the correct response.
Instead of waiting for definitive silence, we start inference processing at the first hint of a pause—as little as 100ms of silence. Here’s how a typical patient interaction flows:
Patient starts: "I want to come in tomorrow"
100ms pause → Thread 1 starts: process("I want to come in tomorrow")
Patient continues: "I want to come in tomorrow afternoon"
100ms pause → Thread 1 canceled, Thread 2 starts: process("I want to come in tomorrow afternoon")
Patient continues: "I want to come in tomorrow afternoon before 3pm"
1500ms pause → Thread 2 canceled, Thread 3 (FINAL): process("I want to come in tomorrow afternoon before 3pm")
The result? We begin processing the patient’s request up to 1900ms earlier than traditional approaches, while still capturing their complete intent.
The Side Effects Problem (And How Event Sourcing Solves It)
The obvious challenge with greedy inference is handling partial, incorrect processing. If our first thread processes “I want to come in tomorrow” and the LLM books a morning appointment, what happens when the patient actually wanted “tomorrow afternoon before 3pm”?
Traditional stateful architectures make this approach nearly impossible—once you’ve mutated the session state based on partial information, rolling back becomes a nightmare of complex state management and potential data corruption.
But our event-sourced SessionMod architecture (explained in detail in this post) makes greedy inference elegantly manageable. Here’s why:
- Immutable session reconstruction: Each thread can safely call
GetSession(session_mods)
without affecting others - Zero side effects: Threads generate SessionMods without applying them, preventing corruption
- Clean rollback: Canceled threads simply disappear—no cleanup needed
- Perfect isolation: Each thread operates on identical session state but cannot interfere with others
This is a perfect example of how architectural decisions compound. Event sourcing was originally chosen for debugging and state management benefits, but it unexpectedly enabled a performance optimization that would be nearly impossible with traditional mutable state approaches.
Zero Side Effects During Processing
Each inference thread operates in complete isolation:
def greedy_inference_thread(utterance, current_session_mods, thread_id):
# Reconstruct clean session state
session = GetSession(current_session_mods)
# Process the (potentially partial) utterance
agent_result = scheduling_agent(utterance, session)
# Return SessionMods without applying them anywhere
return {
'thread_id': thread_id,
'session_mods': agent_result,
'response': agent_result.response,
'status': 'partial' # Only the final thread gets 'final'
}
The key insight: no thread modifies the actual session state. Each thread works off of its own local instance of the session built up from read-only session mods. Each thread generates SessionMods representing what would happen if this utterance were final, but those SessionMods only exist in the thread’s return value.
Only Final Results Persist
When a thread is marked as final (after the 2000ms definitive silence threshold), only then do its SessionMods get appended to the session history:
def handle_final_utterance(winning_thread_result):
# Only the final, complete utterance gets persisted
transcript_entry = {
"turn": current_turn,
"timestamp": datetime.now(),
"speaker": "patient",
"utterance": winning_thread_result.utterance,
"final": True
}
ai_response_entry = {
"turn": current_turn + 1,
"timestamp": datetime.now(),
"speaker": "ai",
"utterance": winning_thread_result.response,
"agent_used": winning_thread_result.agent,
"session_mods_created": winning_thread_result.session_mods, # Only these persist!
"final": True
}
# Append to transcript and session history
transcript.extend([transcript_entry, ai_response_entry])
session_mods_history.extend(winning_thread_result.session_mods)
This creates a beautiful isolation: dozens of inference threads can run, each processing partial utterances and generating potential SessionMods, but only the final thread’s results ever affect the persistent session state.
A Real-World Example
Let’s trace through a complex patient interaction to see how this works:
Turn 1 - Patient starts speaking: "I need to reschedule"
Thread 1A (100ms pause):
- Input: "I need to reschedule"
- Processing: Identifies intent, looks up existing appointments
- SessionMods: [SessionMod("Intent", "Reschedule"), SessionMod("ExistingAppt", "APT-123")]
- Status: PARTIAL (canceled when patient continues)
Turn 1 - Patient continues: "I need to reschedule my appointment for next week"
Thread 1B (100ms pause):
- Input: "I need to reschedule my appointment for next week"
- Processing: More specific intent, checks next week availability
- SessionMods: [SessionMod("Intent", "Reschedule"), SessionMod("ExistingAppt", "APT-123"), SessionMod("NewTimeframe", "next_week")]
- Status: PARTIAL (canceled when patient continues)
Turn 1 - Patient finishes: "I need to reschedule my appointment for next week Tuesday morning"
Thread 1C (1500ms pause - FINAL):
- Input: "I need to reschedule my appointment for next week Tuesday morning"
- Processing: Complete intent with specific timing
- SessionMods: [SessionMod("Intent", "Reschedule"), SessionMod("ExistingAppt", "APT-123"), SessionMod("NewTimeframe", "next_tuesday_am")]
- Status: FINAL ✓
Result: Only Thread 1C's SessionMods get persisted to the session history and transcript.
The Economics of Aggressive Inference
When we implemented greedy inference in June 2024, running 3-5 inference calls per patient utterance made each phone call quite expensive. A typical 10-minute call might trigger 50+ LLM inferences instead of the traditional 15-20. At the price point we had set, the business model didn’t work out at all. However, we made an educated bet that inference costs would plummet over the following months—and we were right. As of summer 2025, between falling token costs and our own optimizations, we’ve achieved a 10x reduction in inference costs, and now the math works.
Another key insight here is recognizing that response time is often more valuable than computational cost in customer-facing applications. Patients who experience snappy, natural conversations are more likely to successfully complete appointments, while those frustrated by sluggish AI interactions may hang up and call competitors.
Implementation Insights
Thread Management
The trickiest part of greedy inference is managing multiple concurrent threads safely:
class GreedyInferenceManager:
def __init__(self):
self.active_threads = {}
self.thread_counter = 0
def start_inference(self, utterance, session_mods):
# Cancel all previous threads for this utterance
self.cancel_active_threads()
# Start new thread
thread_id = self.thread_counter
self.thread_counter += 1
thread = threading.Thread(
target=self.inference_worker,
args=(utterance, session_mods, thread_id)
)
self.active_threads[thread_id] = {
'thread': thread,
'status': 'running',
'utterance': utterance
}
thread.start()
return thread_id
def finalize_thread(self, thread_id):
# Mark this thread as final and cancel others
if thread_id in self.active_threads:
self.active_threads[thread_id]['status'] = 'final'
# Cancel any other running threads
for tid, info in self.active_threads.items():
if tid != thread_id and info['status'] == 'running':
info['status'] = 'canceled'
The Bottom Line
Greedy inference transforms voice AI from a sluggish, turn-based interaction into something approaching natural conversation flow. If you’re building voice AI systems where response time matters—especially in customer service, healthcare, or other human-facing applications—consider whether greedy inference might unlock similar performance gains in your architecture. The key is ensuring your state management approach can handle speculative processing without side effects.
As LLM inference costs continue to drop, techniques like this become increasingly attractive. What seemed economically questionable in 2024 is now a clear competitive advantage. Sometimes the best architectural decisions are the ones that bet on the future becoming more efficient rather than optimizing for today’s constraints.