aidarrowcaretcheckclipboardcommenterrorexperienceeyegooglegownmicroscopenavigatepillTimer IconSearchshare-emailFacebookLinkedInTwitterx

Event-Sourced Sessions: A Game-Changer for Voice Agents

Recently, we were thrilled to see that Zo, our AI Phone Assistant, was listed at the top of industry benchmarks published by a third party.  In a blog series we’re calling “Voice Agent Design Patterns”, we thought we’d share some of the learnings we’ve gained over the past few years developing our voice agent from scratch.

Building robust AI voice agents that can handle complex interactions presents unique architectural challenges. For example, in a medical setting, how do you maintain conversation state across multiple specialized agents handling scheduling, insurance verification, and patient information? How do you handle errors gracefully when dealing with sensitive medical appointments? How do you replay conversations at specific points in order to debug inference or algorithm problems?

After wrestling with these problems in our AI voice agent system for calls to and from doctor’s offices, we discovered that borrowing a concept from distributed systems—event sourcing—provided an elegant solution that transformed how we think about patient conversation state management.

The Traditional Approach and Its Limitations

Most conversational AI systems for medical offices store session state as a single, mutable object that gets updated as the patient call progresses. While simple to implement, this approach creates several pain points:

  • Debugging nightmares: When something goes wrong with a caller interaction, you only see the final state, not how you got there.
  • No rollback capability: If an agent makes a mistake during the call, you can’t easily undo it.
  • Concurrency issues: Multiple agents trying to update the same patient session object leads to race conditions. This isn’t normally a problem but it becomes one when you’re attempting “greedy inference” (explained below).
  • Limited observability: Understanding the patient interaction flow requires extensive logging.

Enter Event-Sourced Sessions

Instead of storing session state directly, we shifted to an event-sourced architecture where the session is constructed from a sequence of SessionMod objects. Each time an agent needs to update the patient interaction state, it returns a SessionMod rather than mutating the existing state.

The SessionMod Pattern

The core of our system revolves around two simple concepts:

SessionMod: A lightweight object that represents a single modification to the session state

Session: The current state object, reconstructed by applying all SessionMods in sequence

Here’s the elegant simplicity of it:

# Agent returns modifications instead of mutating state
def greeting_agent(call_data):
    # Agent logic here...
    return SessionMod("PatientIntent", "ScheduleAppointment")

def info_collection_agent(call_data):
    # Agent logic here...
    return [
        SessionMod("PatientName", "Jane Doe"),
        SessionMod("PatientDOB", "1985-03-15"),
        SessionMod("InsuranceProvider", "Blue Cross")
    ]

def scheduling_agent(call_data):
    # Agent logic here...
    return [
        SessionMod("AppointmentDate", "2024-03-20"),
        SessionMod("AppointmentTime", "2:30 PM"),
        SessionMod("ConfirmationNumber", "APT-7854")
    ]

The GetSession() Magic

At the beginning of each agent call, we reconstruct the current session state with a simple GetSession() method:

def GetSession(session_mods):
    session = Session()  # Start with empty session
    
    for mod in session_mods:
        session.apply(mod)  # Apply each modification in order
    
    return session

# In each agent:
def appointment_confirmation_agent(call_data, session_mods):
    current_session = GetSession(session_mods)
    
    # Now we have the complete current state
    patient_name = current_session.patient_name
    appointment_time = current_session.appointment_time
    
    # Agent can make decisions based on current state
    # and return new SessionMods
    return SessionMod("ConfirmationSent", True)

This creates a beautiful flow:

  1. Agent starts: Calls GetSession() to get current state from all previous SessionMods
  2. Agent processes: Makes decisions based on complete session context
  3. Agent finishes: Returns new SessionMod(s) representing what changed
  4. Next agent starts: Calls GetSession() which now includes the previous agent’s modifications

Complete Call Traceability with Structured Transcripts

Here’s where the system gets really powerful: each phone call generates a structured JSON transcript that captures not just what was said, but the complete execution context of every agent interaction.

Each transcript entry looks like this:

{
  "turn": 14,
  "timestamp": "2024-03-20T14:32:15Z",
  "speaker": "ai",
  "utterance": "I've found an available appointment with Dr. Smith on March 22nd at 2:30 PM. Would that work for you?",
  "agent_used": "scheduling_agent",
  "session_mods_created": [
    {
      "key": "AppointmentOption1", 
      "value": "2024-03-22 14:30"
    },
    {
      "key": "ProviderConfirmed", 
      "value": "Dr. Smith"
    }
  ]
}

This means every single agent execution is completely traceable:

  • What the agent said (utterance)
  • Which agent ran (agent_used)
  • What it changed (session_mods_created)
  • What state it started with (reconstructed by replaying all previous SessionMods)

Perfect Replay with Replay(turn_number)

With this structure, debugging becomes trivial. Want to understand why the agent offered the wrong appointment at turn 14? Just run:

def Replay(call_id, turn_number):
    transcript = load_transcript(call_id)
    turn_data = transcript[turn_number]
    
    # Reconstruct the session state by replaying all previous SessionMods
    previous_session_mods = []
    for i in range(turn_number):
        if transcript[i].get("session_mods_created"):
            previous_session_mods.extend(transcript[i]["session_mods_created"])
    
    session_state = GetSession(previous_session_mods)
    
    # Run the exact same agent with the exact same input
    agent_function = get_agent(turn_data["agent_used"])
    result = agent_function(call_data, session_state)
    
    print(f"Agent: {turn_data['agent_used']}")
    print(f"Reconstructed input state: {session_state}")
    print(f"Original output: {turn_data['utterance']}")
    print(f"Original SessionMods: {turn_data['session_mods_created']}")
    print(f"Replay output: {result}")

This gives you perfect reproducibility for any moment in any call. No more guessing what went wrong. You can step through the exact execution context that led to any issue by reconstructing the session state from the SessionMod history.

A Real Call Flow Example

Let’s trace through how this works in practice, showing both the SessionMod flow and the transcript capture:

Call starts: session_mods = []

Turn 1 - Patient: "Hi, I need to reschedule my appointment"
Turn 2 - AI (greeting_agent):
  - current_session = GetSession([]) → empty session
  - Processes greeting and intent detection
  - Returns: SessionMod("PatientIntent", "RescheduleAppointment")
  - Transcript records: agent="greeting_agent", session_mods_created=[SessionMod("PatientIntent", "RescheduleAppointment")]

Turn 3 - Patient: "My name is Jane Doe, DOB March 15th 1985"  
Turn 4 - AI (patient_lookup_agent):
  - current_session = GetSession([SessionMod("PatientIntent", "RescheduleAppointment")])
  - Finds patient in system
  - Returns: SessionMod("PatientID", "12345")
  - Transcript records: agent="patient_lookup_agent", session_mods_created=[SessionMod("PatientID", "12345")]

Turn 5 - Patient: "I need to see Dr. Smith instead"
Turn 6 - AI (scheduling_agent):
  - current_session = GetSession([
      SessionMod("PatientIntent", "RescheduleAppointment"),
      SessionMod("PatientID", "12345")
    ])
  - Checks Dr. Smith's availability
  - Returns: SessionMod("NewProviderRequested", "Dr. Smith")
  - Transcript records: agent="scheduling_agent", session_mods_created=[SessionMod("NewProviderRequested", "Dr. Smith")]

Now if turn 6 has an issue, you can Replay(call_id, 6) and the system will reconstruct the exact session state by replaying SessionMods from turns 2 and 4, then re-run the scheduling_agent with that precise context.

The Superpowers This Unlocked

1. Perfect Replayability

Every patient interaction becomes perfectly reproducible, down to the individual turn level. Need to debug why the scheduling agent offered the wrong appointment slots at turn 14? Just run Replay(call_id, 14):

# Replay the exact execution context from any turn
result = Replay("call_abc123", 14)

# This reconstructs:
# - The exact session state the agent started with
# - The exact agent that was used  
# - The exact input and output
# - The exact SessionMods that were created

The structured transcript gives you:

  • Turn-level debugging: Jump directly to any problematic moment in the call, and be able to step into code with a debugger
  • Agent-specific testing: Test individual agents against real conversation contexts.
  • Regression testing: Ensure code changes don’t break existing call patterns.

2. True Multithreading (Greedy Inference)

With immutable SessionMods, multiple agents can safely call GetSession(session_mods) without locks or synchronization. Only appending new SessionMods needs coordination, making the system naturally thread-safe and performant.

Why is this useful? It enables “greedy inference”.  Normally in a voice agent, the infrastructure has some heuristics it applies to understand when a caller is done speaking.  This involves waiting some period of silence before processing the user utterance.  This leads to extra lag, since you have to wait 1-2 seconds before even starting the inference.

With greedy inference, you keep spawning inferences while the user is speaking (after every word if you like), each time assuming what the user has said so far will be their total utterance.  If the user continues speaking, you cancel or ignore the prior inference thread and start a new one.  Once you conclude that the user is done speaking (by waiting for the appropriate silence period), you can use the results from the last inference you ran. This eliminates the lag created in the default method by waiting for the silence period before starting inference.

More on this in the next blog post.

3. Rewind-ability

Made a mistake? Just remove the last few SessionMods and call GetSession() with the truncated list. Sometimes you’ve completely finished processing an utterance, but the user says something more that was meant to be part of the previous utterance.  You can simply “rewind” the effects of the previous inference and run another inference:

# Oops, agent made an error - just remove the problematic SessionMods
original_mods = [
    SessionMod("PatientIntent", "ScheduleAppointment"),
    SessionMod("PatientName", "Jane Doe"),
    SessionMod("ProviderRequested", "Dr. Smith"),
    SessionMod("AppointmentBooked", "2024-03-20 2:00 PM")  # Wrong time!
]

# Remove the problematic modification
corrected_mods = original_mods[:-1]
clean_session = GetSession(corrected_mods)

# Now try again with the correct agent

4. Effortless Context Summaries in the Agent’s System Prompt

Perhaps the most elegant benefit: generating context for the currently running agent is just a matter of calling GetSession(session_mods) and including relevant fields in the system prompt. The session object already contains exactly the information that agent needs, distilled from the entire patient interaction history.

def create_agent_context(session_mods):
    session = GetSession(session_mods)
    
    return f"""
    System: You are a medical appointment scheduling agent. Here's the patient context:
    - Patient intent: {session.patient_intent}
    - Patient info: {session.patient_name}, DOB: {session.patient_dob}
    - Insurance: {session.insurance_provider}
    - Provider requested: {session.provider_requested}
    - Available slots checked: {session.slots_checked}
    - Current status: {session.booking_status}
    """

No complex context management; just reconstruct the session and use what you need.

5. Minimal Context Windows for Maximum Consistency

Since the session summary provides comprehensive context, each agent only needs to see transcript entries from its own recent turns rather than the entire conversation history.

def get_agent_input(agent_name, transcript, session_mods, recent_turns=3):
    # Get comprehensive state summary
    session = GetSession(session_mods)
    context = create_agent_context(session_mods)
    
    # Get only recent transcript entries for this agent
    recent_transcript = [
        entry for entry in transcript[-recent_turns:]
        if entry.get("agent_used") == agent_name
    ]
    
    return {
        "system_prompt": context,  # Complete state summary
        "recent_history": recent_transcript  # Minimal recent context
    }

This approach delivers multiple benefits:

  • Reduced token usage: Agents see comprehensive state but minimal history.
  • Consistent reasoning: No unpredictable effects from long conversation histories. Agent decisions based on a clean state rather than conversational nuance.
  • Faster inference: Smaller context windows mean faster LLM responses.

Implementation Insights

A few key learnings from building this system:

SessionMod Design Matters: We found it crucial to make SessionMods semantic and domain-specific rather than generic. SessionMod("PatientUrgencyLevel", "High") is more useful than generic key-value updates.

SessionMod Schema Evolution: SessionMods need to be versioned. As your voice agent capabilities evolve, you’ll want to add new SessionMod types while maintaining backward compatibility with existing calls.

Ordering and Consistency: Since the transcript is naturally ordered chronologically, it’s easy to ensure SessionMods are applied in the correct order in GetSession(), even across distributed components.  Just make sure your ApplySessionMod() algorithm is applying them in the correct order.

The Bottom Line

Event sourcing with SessionMods and structured transcripts creates a robust, observable, and flexible system. The ability to replay any turn of any patient interaction, safely handle concurrency, gracefully recover from scheduling errors, and effortlessly maintain patient context has made our agents more reliable and our development process dramatically more pleasant.

The structured transcript approach means every single agent decision is traceable and reproducible. When something goes wrong, you don’t just know what happened – you can replay the exact context that caused it and test your fix against that precise scenario.

If you’re building conversational AI systems, especially ones involving multiple specialized agents for scheduling, insurance verification, and patient communication, consider whether event sourcing with complete execution traceability might solve some of your thorniest state management and debugging challenges. The upfront complexity is not that high, and it pays dividends in system reliability, compliance capabilities, and developer productivity.

The best architectures often emerge when we stop trying to solve problems in isolation and start borrowing proven patterns from other domains. Event sourcing worked for banking systems and distributed architectures—it turns out it works pretty well for voice agent conversations too.

We Are Hiring for our Zo AI Phone Assistant Team!

Does this kind of software development appeal to you?  At Zocdoc, we are at the bleeding edge of voice agent development. Come help us build the next version of the world’s best healthcare voice agent. Check out the job listings here!