Schemas and Metadata at Zocdoc: Authoritative Source of Truth

This is the first post in an open, experimental series about how we’re trying to tame schemas and metadata at Zocdoc. We’re sharing as we go, exploring what works, what doesn’t, and what we change our minds about along the way.

If you’ve ever been on call for a broken pipeline at 2 a.m., you already know this story. A producer adds a field, a change data capture (CDC) job changes how a value is serialized, and a downstream model quietly drifts. The catalog still shows yesterday’s truth, the dashboard throws today’s error, and the Slack thread becomes tomorrow’s archaeology.

Moments like that remind us why consistency matters. With so many tools defining schemas in their own way, it’s easy for good systems to fall out of sync.

Our motivation is clear. Data organizations need a centralized, system-agnostic source of truth for schemas and key governance metadata. Today’s ecosystem is fragmented, tools like Unity Catalog, Snowflake schemas, schema registries, and others each manage or replicate schemas in their own way, creating duplication and inconsistency. This fragmentation makes it difficult to trace the evolution of a dataset or enforce consistent governance policies. The result is a lot of partially overlapping sources of truth, each valuable but none definitive.

The gap above isn’t caused by any one system doing its job poorly; it’s that every system has its own view of truth. The schema a data engineer sees in Snowflake may differ subtly from what Dagster expects or what a registry believes is valid. Over time, these differences grow into silos. What we need is something to sit above these tools and stitch their perspectives together, allowing them to reinforce one another rather than diverge.

The goal isn’t to replace those tools or diminish their native schema practices. They each play an important role. Instead, we want to enhance them with higher-level continuity and reuse, creating a single layer of shared truth that ties their individual perspectives together. This would allow every team and system to draw from the same source of truth without losing the benefits of their specialized environments.

We imagine this centralized layer as a place that can reliably answer questions like:

What the schema is (and how it has evolved)
Who owns it (and how to reach them)
What it means (and how risky a change might be)
Whether a change is compatible before it lands and breaks things

We’re confident in this need because we’ve seen its absence cause repeated pain across teams and systems. Still, we recognize that there’s nuance in how best to solve it. We’re approaching this effort with conviction, but also with humility and a willingness to adjust as we learn what truly works in practice.

Schemas and metadata, explained simply

Let’s start with some basics. When we talk about schema, we simply mean the structure of the data, its fields, types, and rules. A schema defines what a dataset or event looks like and how it can change over time.

Metadata is the extra information that gives that schema context. It’s who owns the data, what domain it belongs to, what each field means, whether it contains sensitive information, and how long we need to keep it. It also includes simple expectations about how that data should evolve, like whether new fields can be added safely or if certain changes would break downstream systems.

We’re keeping this definition focused on the essentials, the parts that most often cause broken pipelines or confusion when they’re missing or out of sync.

Why a single source of truth matters

A unified answer helps when ingestion is noisy. Data attributes morph, events evolve, and small changes quietly ripple through a fleet of consumers. A central view lets you catch drift quickly and talk about it before it causes chaos.

Ownership is another challenge. When a table or event misbehaves, someone should be accountable, without spelunking through git history, wikis, or dashboards. Schema history often lives in multiple places: a registry for events, a warehouse for tables, and code for models. Stitching those together by hand is slow and error-prone.

Governance metadata adds another layer. Tags, glossary terms, and policies are most useful when they travel with the schema through its lifecycle, not when they’re kept in a separate system that’s always slightly out of date.

What exists today (and why we’re not trying to replace it)

There’s a lot to like in the current ecosystem. Catalogs and governance platforms help people discover datasets, see lineage, and manage ownership. Schema registries validate event payloads and enforce compatibility for individual topics. Warehouse and lakehouse controls enforce schemas and permissions inside their platforms. Orchestration and transformation tools describe models and contracts within their own graphs. Observability tools watch for drift after the fact.

Each tool plays an important role, but each is limited to its domain. What we haven’t found, at least not in a way that fits our needs, is a single, system-agnostic place where schema, ownership, and meaning come together and can be used to gate changes across CDC, events, and analytical tables.

If you’ve found that tool, or you’re building it, we’d love to hear from you.

Our experiment: Codex (no, not that Codex¹)

Codex is our attempt at Zocdoc to build a structured repository, a coordination layer that defines the shared rules of engagement for schema and metadata across systems. Tools like Dagster, Snowflake, and Databricks continue to operate independently, but Codex ensures they do so in harmony. It’s more conductor than observer, setting the tempo for how data should move and evolve.

Sitting above the broader data stack, Codex connects the dots between the many systems that define, process, and serve data. The goal isn’t to duplicate those tools but to bring their perspectives together, add the governance context we care about, and make that combination actionable.

We’re starting with ingestion, CDC and events, because that’s where schema tends to move fastest and break the loudest.

Our initial goals are simple. We want to unify the observed schema (from CDC, events, or tables) with the declared intent (what we say is allowed), track changes over time, set basic compatibility expectations, and help humans talk to each other when a change looks risky.

That’s the hypothesis. We’ll test it, learn from it, and share where it falls short.

Problems we’re working to prevent

The “latest schema” should be documented, accessible, and compliant. Data without clear ownership or meaning quickly could become dangerous, not just inconvenient. A small, unnoticed producer change can ripple through CDC jobs, corrupt downstream tables, and trigger major incidents. Documentation that drifts away from reality doesn’t just create confusion, it can erode trust and slow response times when issues occur.

Codex’s role is to surface these risks as they might emerge through consistent schema validation, compatibility checks, and metadata comparison across systems so teams can spot deviations early and respond before they escalate into major incidents.

Guiding principles (so we don’t build the wrong thing)

We’re choosing sustained iteration over static implementation. Investigate, measure, adjust. Fewer hot takes, more small bets. Codex should integrate with other tools, not replace them. It should make life easier for the engineer on call, with clear diffs, clear ownership, and clear next steps. Schemas should act as contracts, not just documentation. And we’ll focus on a small, durable set of ideas instead of chasing feature sprawl.

Initial scope: ingestion (CDC and events)

We’re starting with the noisiest corner. Attributes in DynamoDB table can appear, disappear, or change shape as code evolves. We want those changes to be visible fast, linked to ownership and meaning, and evaluated for safety under existing expectations. Event registries already enforce compatibility per topic, but we want to connect that enforcement to downstream consumers and governance context so everyone reads from the same page.

If we can reduce surprises, shorten the time to understand issues, and prompt better conversations before changes land, that’s success. If not, we’ll adjust or stop; that’s part of the process.

What we’re not promising (yet)

We’re not building a perfect global lineage graph or automatic classification system. We’re not creating one registry to rule them all. And we’re not enforcing rules across every system from day one. We’re starting with visibility and shared context, then adding constraints only where they prove helpful.

Open questions we’ll explore in this series

How strict should compatibility rules be? Where should truth live when declared contracts and observed payloads disagree? How do we keep glossary terms close to the teams that know the data best without creating metadata sprawl? What’s the lightest review process that still prevents real incidents? And how do we scale without centralizing everything?

We expect our answers to evolve. That’s a feature, not a flaw.

How we’ll keep ourselves honest

We don’t have results yet, but we’ll measure and share as we go. We’ll track how long it takes to detect and respond to schema inconsistencies, how many schema changes are caught before deployment, how quickly contracts are updated when changes are intentional, and how many surprise incidents are caused by schema drift.

If those numbers don’t improve, we’ll say so and share what we learned.

An invitation

If you’re a data engineer who’s built something similar, or decided not to, we’d love to hear from you. What did you try? What backfired? Where did you draw the line between helpful guardrails and process for process’s sake?

Drop us a note, or tell us what you’d like us to test as we build.

In the next post, we’ll share a broad taxonomy of schema changes, additions, removals, renames, type changes, and explore why they matter differently for events, CDC, and analytical tables. It’ll be less about implementation details and more about developing a shared vocabulary to talk about risk.

Thanks for reading and being part of our journey. If this resonates (or doesn’t), tell us why. The only way this experiment works is if we stay curious, skeptical, and transparent about what we find.

Zocdoc Codex is not affiliated with OpenAI Codex. ↩︎