Over the last two years, the Zocdoc Engineering team has been transitioning our Monolithic application into a microservice oriented architecture (MSA). As our engineering team has grown, we’ve faced the challenge of dealing with our Monolithic application from both a development and an operational perspective. The ever-increasing complexity of the codebase was hindering the velocity of our engineering team, increasing the time it took us to develop and launch features for the business and our patients. We’ve learned a few lessons since we began our journey from Monolith to MSA in 2014 and are applying these learnings to the current generation of our architecture. In this post, we’ll explore our quest to break up the Monolith, some of the mistakes we made along the way and how we have changed our approach as a result.
Like most startups, Zocdoc started out as a simple MVC web application. All the webservers ran the same version of code and could process any request for our public endpoints. We’ll refer to this single codebase + application as our Monolith.
A simplified representation of our architecture:
Over time we added things like database replication and cron system, but the architecture didn’t fundamentally change. This approach served us well for a long time, enabling our team to stay focused on building a great product for both doctors and patients.
Our engineering team and codebase have grown substantially in the nine years since those humble beginnings, and signs of stress began to show a few years ago. Compared to when we were a smaller team, we were finding it increasingly difficult to ship features into production. Frequent complaints from our engineers included compilation time, continuous integration (CI) runtime and application startup time during local development. The complexity of the codebase was also rising from years of evolutionary feature development, making it difficult to make sense of any particular section of code, and leading to more bugs and decreased velocity. Initially we tried to make developing inside the Monolith better, but these efforts were high cost and not sustainable. We believed that moving to a service oriented architecture would help address most, if not all, of these problems.
First Attempt: Services in our Datacenter
Tired of small incremental improvements for the Monolith, we embarked on a larger effort to transition to a microservice oriented architecture. The initial high-level goal was to reach a point where our engineers could write their new features in a service rather than in the Monolith. Once we achieved that, we would slowly pull our existing code out of the Monolith into smaller services.
Some of our goals during this phase were:
- Minimize changes to our infrastructure and development experience
- Enable new feature development outside the Monolith as quickly as possible
Our updated architecture diagram:
The only additions here are a series of services which are fronted by an internal load balancer.
We were able to keep the same language (C#) and operating system (Windows) which allowed us to reuse some of our existing libraries and tooling. Developers were able to get new services running easily and, most importantly, they were able to build some great products outside the Monolith. However, after the initial euphoria wore off, we started noticing some new problems this architecture introduced.
Coupling through the database
Most new products and features we work on are built on top of the existing core Zocdoc datasets, e.g. doctors, patients and availability. Access to these datasets is a hard requirement for most new services in order to provide value to the business. We allowed all services access to the Monolithic database; this allowed teams to get started quickly, but inadvertently introduced a coupling point between all our services, the Monolith and a particular version of our database schema. Teams would later have difficulty making schema changes, migrating data to new tables and correctly ordering the deployment of schema, service and Monolith changes. These are still problems in a service oriented architecture, but they can be more easily dealt with if there are no dependencies on a specific data representation within a data store sitting outside an individual service.
The service framework we were using made it incredibly easy to define the contracts between our services, which were set as a pair of request and response data transfer objects (DTOs) that would automatically be serialized and deserialized by the sending and receiving services. This ease of use, and our lack of experience, lead us to not place enough emphasis on creating stable, well designed contracts. Fields we frequently added and removed from these DTOs, often leaking implementation details to their consumers and increasing coupling. This simple request/response model also made API versioning difficult so we forewent that entirely; always running the latest version of each service in production. Normally we would consider this a good thing, achieving continuous delivery, but the frequent rate of change on the contracts tended to manifest itself as either build failures in CI or cause issues in production.
A distributed mess
The coupling through our database and frequently changing contracts made it difficult for our services to present a unified model of our business entities, we were lacking what Eric Evans calls a bounded context; a well defined, consistent model with an explicit boundary interface, playing a key role in Domain Driven Design. Our services’ implementation tended to be in multiple places, full of leaky abstractions, with half the logic in the Monolith and half in the service. Our engineers needed to run every service locally to work on their features, even if they were in unrelated areas of the codebase. This not only increased the hardware requirements for our development and CI environments, but also made debugging incredibly difficult. We had a mess before and now we had a distributed mess. Trying to debug a failure in this environment was a frustrating task because any one engineer wouldn’t have sufficient knowledge of the entire system, further decreasing velocity.
We didn’t change our approach to production monitoring, deployment, hardware provisioning or application development. Each of these areas had a team focused on it, placing approximately five teams between experimenting with a new technology and having it running in production. This created a time delay between the initial desire to use and the release of new technology, most of which was spent waiting on other teams. This was inefficient, and prevented new technologies from being used, even if they would have greatly benefited our architecture. Most of this communication overhead existed because we were running our own datacenter with a small team – no easy feat. Additionally, because we had so many teams involved, their incentives were not always aligned, which increased the division between the teams and slowed progress.
Second Attempt: Services in AWS
We revised our service strategy at the beginning of this year and looked at solutions for the problems encountered with our initial approach.
Our priorities for our second attempt:
- Decouple components from the Monolith
- Allow teams to move independently
- Simplify development and testing environments
- Avoid undifferentiated heavy lifting
We tried to come up with a plan for building multiple web artifacts from the Monolith, essentially splitting the deployed code up by product area. After some analysis, this turned out to be an incredibly hard problem. The long evolutionary development in the Monolith had led to poor design practices, leaving us with code that was incredibly intertwined and complicated, leaving high tech debt that would need to be paid before we could see the benefits. So instead of trying to separate our existing codebase, we chose to extract the important datasets from the Monolith in a form that would be the most useful to new products being built outside it. These core domain models could then be consumed by any other service that required that information.
Our current architecture looks like this:
Decouple components from the Monolith
Instead of migrating all data sets, we’ve focused our efforts on several essential datasets, prioritizing data that unblocks the greatest number of teams. To make this happen, we’ve consolidated all the writes for a dataset into a single code path, allowing us to capture every change and stream it into AWS. Once the data is in AWS, there is a single service responsible for serving requests for a particular dataset. You can think of this pattern as a variation of the strangler pattern where both new and existing features can now switch their read requests to these new services. These services are only serving read requests initially, freeing us from having to deal with two-way data synchronization for now. Each service has an isolated data store that only the responsible service has access to, helping eliminate coupling through direct data store access. This forces other services to communicate using their API, a topic we’ll describe below. There are some other non-trivial problems introduced by this architecture, like two way data synchronization, that we’ll discuss in a later post.
Allow teams to move independently
Stable, thoughtfully designed contracts are one way we’re combating tight coupling this time around. We’re performing the design upfront, and are using Swagger as the API specification. Senior engineers from across all our teams collaborated on a shared set of API design guidelines so that all our APIs conform to the same principles, improving consistency across the organization and capturing our best practices. This is a critical step to ensure we’re not leaking implementation details and limiting dependencies between services.
Choosing Swagger to define our service contracts has several benefits beyond just describing the shape of the API. It also allows us to generate code from the contracts in multiple languages and to generate a developer documentation website; neither of which our old solution had. API versioning is built into our routing structure, making it easier to support multiple API versions at once. Versioning can be difficult but we’re hoping the combination of upfront design, isolated data stores per service, and providing our core datasets through an API will help avoid some of the issues we encountered in the last iteration.
Simplify development and testing environments
Previously a major pain point for our teams was our shared CI environment, a centralized bottleneck for all code changes. This system needed a dedicated team to maintain the environment and fix the frequent environmental issues that would occur. We couldn’t expect this single team to maintain every other team’s CI environment, especially as our new services introduced many new data stores and other infrastructure components. Synchronizing state across so many disparate services is a non-trivial problem that we wanted to avoid because of the high maintenance cost; instead we are focusing our testing efforts against mock services and relying on good production monitoring. Each team is now responsible for their own CI environment and has a direct pathway into production, removing the central bottleneck that was causing much frustration.
The mock services we introduced are containerized HTTP servers that serve static or predictable test data, they behave like the real service but don’t have an external stateful data store to maintain. Using Docker to run our mocks locally simplifies our environments immensely because each engineer can run their own copy of the mock service locally, independent of every other engineer at the company. We no longer need to worry about restoring data backups or managing another team’s data store on each development machine. Putting all of this together, we’re able to develop, test and deploy our services independently of each other. Obviously we can’t escape the realities of our distributed system in production and are evaluating tools like Zipkin to help us diagnose problems in production.
Avoid undifferentiated heavy lifting
We’ve removed the majority of our undifferentiated heavy lifting by leveraging AWS – in particular the managed services AWS provides. With Amazon performing the bulk of the work to provision and maintain complex pieces of software (like RDS), we’ve freed up our engineers to focus on the differentiated problems that matter to the business. Fast and frictionless access to compute and storage options also gave our teams the opportunity to experiment with different approaches that were previously inaccessible. To address the alignment problem, we’re embracing the DevOps philosophy and placing the production provisioning and support on our development teams, allowing them to own their services end to end. Our existing Ops Engineers now have the time and capacity to focus on building tooling to make our systems more reliable, instead of being handed code from our engineers to run.
We have also embraced Infrastructure as Code. Previously all our servers were configured and updated by hand or through limited automation, we didn’t take full advantage of a configuration management solution like Puppet or Chef, partly because we were using Windows Server. All our new services are built as stateless docker containers, allowing us to deploy and scale them easily using Amazon’s ECS. The service’s external resources are managed through Cloudformation, promoting environment parity between Development, CI and Prod. The EC2 instances our containers run on are immutable AMIs created by Packer and Ansible, allowing us to spin them up and down dynamically as part of an Auto-Scaling Group. If one of our machines or containers is acting up, we can just terminate it and automatically get a replacement which is really good for our sleep schedule.
As we transition our Monolithic application toward a microservice oriented architecture, we’re learning from our mistakes and adjusting our approach. The four key lessons we covered in this post:
- Keep each service’s data store private to that service and accessible only via its API to prevent coupling at the data store around an implicit representation of your data
- Use well defined contracts to avoid coupling between services and the Monolith
- Mock out dependent services at the API layer and use a containerized mock server to simplify development and CI
- Leverage open source, managed services and cloud infrastructure to avoid undifferentiated heavy lifting
We’ve made some great progress on our architecture over the last couple years but we’re not done yet. Teams have been building out their services and we’re continuously improving and refining our approach. A good early indicator of our progress is that some of our newest team members have never even developed in the Monolith! They’re spending their time working with React and rebuilding our front end, or bootstrapping our Scala transition for some critical backend components, stay tuned for future posts on those topics. This work is a huge accomplishment for us and would not have be possible without the many talented engineers who helped work on these projects.
About the author
Chris Smith is a Principal Engineer in the Developer Tools team at Zocdoc. He’s passionate about developer productivity, cloud infrastructure and Continuous Integration.