Zocdoc is moving at full-speed towards a microservice architecture. This journey was easily motivated by our previous battles with production outages, difficulties scaling horizontally, and bloated application code.
Our “Monolith,” the original Zocdoc codebase, built on .NET over the company’s first ten years still powers a significant percentage of our application – we cannot simply turn it off. We have encouraged our engineering teams to build new products (Aqueduct, Patient Powered Search) in the cloud (outside of our Monolith) and to move existing application code out of our Monolith. While the process of shrinking our Monolith is underway, we still have to maintain a Continuous Integration (CI) environment that empowers our engineers with the benefits of CI.
We know our Monolith will not scale, and as we continue to move towards a world where our application is completely based on a Microservice Architecture (MSA) we have faced challenges in maintaining our CI system. Our Monolith CI system was showing its age, it was slow to provide feedback and we lacked confidence in its output due to environmental and test flakiness. To improve the system we focused on three main areas: increasing parallel execution capacity, automating CI for feature branches, and improving our end-to-end testing infrastructure. During this process we lifted our CI system out of our datacenter and into AWS, which presented us with a new challenge: staying under budget.
Merging with Confidence
Transitioning from a Monolithic architecture to a MSA is akin to swapping out the engine of a car. We aren’t stopping the car – we are continuing forward at full speed and improving the features of the car as we make this transition. Technically, the engine swap is removing the single process constraint and preparing our application for horizontal scaling. So one of the questions we had to answer was: How can we be confident that contributions to the Zocdoc codebase will function as expected during the transition? To be able to merge with confidence we focused on three areas:
- Increase Parallel Execution Capacity – Our CI system needed to scale to handle peak workload
- Automate CI for Feature Branches – All feature branches needed to be tested in isolation.
- Improve End-to-End Browser Testing – Test flakiness needed to be reduced.
Increase Parallel Execution Capacity
For engineers to have confidence that their contributions will work as expected, we need to integrate often and reduce feedback time to a minimum. Previously, we had four sets of static CI servers in our datacenter. This meant that most of the time the servers were underutilized, and during peak working hours a queue of contributions to integrate would form (about 5 hours), resulting in delayed feedback. To be able to reduce feedback time we needed to scale out our CI system, and to do that we focused on two specific areas: reducing the size of our database and melting snowflakes.
Reduce Database Size
To be able to scale out our CI system first we needed to scale down our database server. We needed the ability to run hundreds or thousands of database servers, and due to the size of our Monolithic database that was not possible without incurring significant cost. We began by peeling back the onion: every business process, every decision that led to such a large database in CI. From a technical perspective we were using this large data set for many forms of testing (integration tests, end-to-end browser tests, query performance tests, load tests), data analysis, QA bug evaluation, and powering the local development environments for our engineers. This one size fits all dataset was anything but one size fits all.
Enter Schema Zen, an open source tool that allows us to script and create a SQL Server database. Essentially Schema Zen provides the functionality to serialize our database (tables, objects, data, etc) to files, and deserialize these files to create a new and equivalent database server. We built a thin layer on top of Schema Zen that we have dubbed “CleanDB,” which manages the serialized files and handles the most common cases an engineer or CI process requires. By switching to CleanDB we have seen several improvements:
- Small Footprint – The dataset size decreased from 1TB to 300MB. Not only does this impact physical storage space, but it also has decreased memory requirements, database initialization time, website application startup time, and improved the horizontal scalability of the application. We have reduced the database server resource requirements from 12 cores, 64GB memory, and 1TB disk space to 4 cores, 8GB memory, and 20GB disk space while seeing an overall performance increase.
- Dev + CI Environment Parity – By using CleanDB for both CI servers and local development environments we have removed several unique processes that existed only in CI. The environment parity lowered the barrier for entry for our engineers to understand a failed CI run, and decreased the environmental flakiness due to data discrepancies that would inevitably arise between the various environments.
- Increased Server Availability – When using larger databases we had to reset them between every run which took 20 minutes. In the cloud that means we would be wasting ⅓ of a compute hour. With CleanDB we spend 2 minutes at the start of a run deserializing from files and rebuilding the database, increasing available time to execute work by 18 minutes.
Previously, we were plagued with one-off machines, or snowflake servers. When we needed to expand our CI fleet, an engineer would work with IT to acquire compute resources in our datacenter, and follow a (hopefully up-to-date) playbook of step-by-step instructions for how to setup a new server for CI. Obviously, this would not scale. We ended up with many servers that were unique, which lead to environmental flakiness and resulted in an unproductive use of engineering time. To ensure our servers remain as identical as possible we now rely on:
- Infrastructure as Code (IaC) – We aligned the company around a Configuration Management tool, Ansible. By using Ansible to describe the configuration of our servers we ensure that each server is created using the same process. Now when we need to update a server, rather than remoting into it and performing point-and-click changes, an engineer is able to update the code that generates all the servers of that type.
- Automated Machine Image Creation – Executing our Ansible scripts to setup a server on a fresh OS install takes about 90 minutes. To reduce the startup time we use Packer, “an open source tool for creating identical machine images for multiple platforms from a single source configuration.” When an engineer updates our IaC, she kicks off Packer to build a new machine image, paying the 90 minute cost upfront before the image is needed in production. The resulting image can be used as the blueprint to quickly spin up fresh, identical servers.
- Short-Lived Servers – Snowflakes are distinct because they follow different paths from the atmosphere to the surface. Similarly, servers are more likely to become unique the longer they live. By reducing drift from the original image we also reduce environmental flakiness. To accomplish this we created a tool (“Groundskeeper Willie”) that queries our cloud and terminates servers that have exceeded their lifetime.
Automate CI for Feature Branches
As a consequence of our old architecture the capacity of our CI system was limited. Our engineers had to consider when to run a full integration test on their feature branch and deal with the mental overhead of directly interacting with our proprietary CI management tool (“OnDemand”). Every engineer had to learn about OnDemand, how and when to use it, and make sure their contributions made it through this step before merging into master.
Harness the Power of Defaults
For engineers to receive feedback on their feature branches, they would manually queue their feature branch for CI via OnDemand. This special mechanism for our engineers to request feedback lead to a few problems:
- Mental Overhead – Even though manually queuing contributions for feedback became a well ingrained behavior, we were adding extra work for our engineers to do, and ultimately this process had to happen several times because of environmental and test flakiness. The context switching slowed us down. Engineers are makers, and reducing feedback time, manual work, and repetitive tasks pays off in dividends.
- Build It Yourself ⇒ Maintenance Costs Forever – By having a proprietary tool it meant that we had to maintain it. One of our team’s responsibilities was to make sure the tool was up, and if it went down, to fix it.
Instead of requiring engineers to manually opt in a branch for CI, we decided to automatically opt all branches in for CI. This was made possible by the earlier work to increase parallel execution capacity. We connected our CI system more tightly with the 3rd party integrations GitHub has exposed, so that, as engineers push their changes, we capture the event, and automatically run the tests against these changes. The real wins here were removing as much of the mental overhead as possible and reducing additional features we had to maintain to a minimum.
Reduce Barriers to Entry
As CI executes tests on a branch, the output is captured and stored in the form of logs. TeamCity, the build management software we use, parses those logs and determines if the build passed or failed. When a build finishes, rather than requiring engineers to come to OnDemand to view results, we deliver two things to the engineer, a status (passed or failed), and the ability to dive deeper into the logs so they can analyze as necessary.
Our Engineers are used to Pull Requests, so we tied TeamCity into GitHub via the Status API. Now engineers can see CI results of their branch in a tool they were already using everyday.
Improve End-to-End Browser Testing
CI delivers reliability. As software engineers we need to know if the work we are doing will function as expected in production, and while unit and integration tests have their place, a full end-to-end test is necessary to ensure all of our components are working well together. Did that dependency update break something else upstream?
Embrace Open Source
Selenium has been the backbone of our end-to-end tests for several years, as its webdriver technology allows our engineers to create tests that simulate user interaction. We created proprietary software to manage parallel execution of Selenium tests in our datacenter, using a botnet and a server we named “Central Command” (CC).
Zocdoc’s business is not in webdriver technology, so after this system was created the engineers moved onto other projects. SeleniumHQ continued to work on their technology, adding parallelism, extending the feature set, and created a new product named Selenium Grid, which was essentially a robust version of our proprietary software. We did not initially update our system to use Selenium Grid due to the team changes. When we did update to the latest version of Selenium and incorporated Selenium Grid we saw improvement in test run time, and a reduction in flaky tests. Being aware of dependency version updates, and having a regular process to update them, is key to embracing open source technology.
When our implementation of browser testing for CI was released, it worked well for our engineers. Over time our engineering organization has grown, leading to increased activity in our Monolith, and the fixed size of our botnet became a limitation. There were several problems with our home-grown approach:
- Static Servers – There was no auto scaling of bots or CC. We used a static amount of space on our SAN (9TB), and during peak hours work would queue to the point that we had to implement priority queuing for time sensitive work (testing production patches).
- Single Point of Failure – CC managed all end-to-end tests for the entire company, and executing all end-to-end tests is a requirement for all pushes to production.
- Drift – These servers were not updated, and the team that built the architecture had moved on to other projects. The infrastructure had drifted significantly from our documentation which meant that making any change was dangerous, and due to this danger we didn’t update the bots or CC often enough to be good at it.
To remedy these problems we built an auto scaling cluster of servers to run containers, relied upon on IaC, and regularly terminate servers that exceed their lifetime.
- Dynamic Cluster Size – Our cluster runs on AWS’ Elastic Container Service, and we utilize CloudWatch alarms based on CPU reservation to determine if the cluster needs to scale in or out. When no tests are being run the cluster is shrunk down to zero containers, and during peak hours we can grow the required capacity, with no more queueing due to lack of resources.
- Fault Tolerance – While melting snowflakes we learned to use IaC, and we reused that approach while incorporating Selenium Grid’s Hub and Node architecture. SeleniumHQ provides docker containers for the Hub and Nodes, and by using these containers, Docker Compose, and a little bit of routing magic, we now spin up a set of Nodes and a Hub for each run of end-to-end tests.
- Short-Lived Containers – We spin up containers on an as needed basis, and terminate them when their work is completed. The short lifetime reduces the drift between the IaC and containers that are currently being run on the cluster.
When CI was being run out of our datacenter, we had a limited amount of compute resources – it was pre-purchased by our IT department, and the costs were hidden from the engineers. From our perspective we just magically had compute resources, and we didn’t ask how much they cost. As we lifted our CI environment to AWS our monthly costs started mounting (see CI Monthly Costs). We were paying more than expected, and even though the cost scaled with the number of runs, a CI infrastructure cost equivalent to several engineers was not palatable.
CI Monthly Costs: There was a direct correlation between feature set rollout and increasing costs. In March we hadn’t rolled out the most expensive testing (end-to-end tests). Over the next few months we rolled out changes to save money, and started running end-to-end tests using Selenium Grid.
AWS provides a Billing and Cost Management tool to track how much money each account is spending at the service level. This tool is helpful; however, considering the number of teams using this AWS account, it was impossible to slice the data to create actionable work with the goal of reducing costs. Elasticsearch is one of the top three costs, but which part of CI is using it? Which teams are using that resource in our account? We needed a cloud management tool, and picking one that fit our needs mattered. We settled on CloudHealth for the features, and the price.
With CloudHealth we were able to see how much we were spending and slice the data using AWS tags. We aligned our technology organization around two tags for all AWS resources: Owner and Project. With these two tags we can see how much money each engineering team (owner) was spending and slice that data down to the project level (see Cost Breakdown).
Cost Breakdown: Via CloudHealth we break our costs down by Owner, and Project. Now we can easily see how much each owner (x-axis) is spending (y-axis) on a given project (stacked bar).
With the detailed data we were able to focus our efforts on the worst offenders, ensuring that we worked on the most meaningful optimizations first. As the owners of the CI infrastructure our team dug deep, reducing the CI costs by approximately 90%. We used several approaches to reduce costs:
- Spot Instances – We launched our CI system with on demand server instances, and while we had great availability, the system cost more because of it. We are now using Spot Instances across several availability zones to overcome bid spikes that can occur in any given availability zone.
- Aggressive Auto Scaling – Our ECS cluster was one of the larger costing items, we were using auto scaling from the beginning, but not aggressively scaling. When the cluster is horizontally scaled in servers are terminated, and we didn’t have any control over which servers were terminated. A terminated server could be hosting several active containers, and if an active Selenium container was terminated, that run of CI would fail due to environmental flakiness. To overcome this limitation we made the Selenium infrastructure able to recover from a container that was terminated. With the fault tolerance increased we made the auto scaling more aggressive (see Selenium Containers).Selenium Containers: Before we made our Selenium Grid infrastructure more fault tolerant, the ECS cluster was easily approaching 1000 containers during the day (left). After we added fault tolerance and made the scaling in policy more aggressive, the ECS cluster is capping around 150 concurrent containers at the same time (right).
- Auto Termination of Resources – We expanded Groundskeeper Willie to terminate all CloudFormation stacks that were missing Owner, or Project tags.
- Linux > Windows – Our Monolith currently requires a Windows server, to remove this complexity is not an easy or a quick challenge. Where possible we moved work over to Linux machines, essentially minimizing the amount of work that requires a Windows instance.
At Zocdoc we follow agile methodologies: pull risk forward, build only what we need, and tear down projects that no longer work. Infrastructure applications in the cloud have important additional steps: optimize for usability, and keep costs low. As we sunset our datacenter, shrink our Monolith, and expand our reliance on microservices we are embracing significant risk, and we mitigate that risk with a robust Continuous Integration Environment that scales with our growing needs, and doesn’t break the bank.
About the author
Scott Roepnack is a Senior Software Engineer on the Pre-Appointment (a.k.a. Piña Caliente) team at Zocdoc. Prior to joining Zocdoc, Scott worked in Media Technology at Touchstream Technologies, and completed his Master’s in Computer Science at Florida Atlantic University. He’s passionate about engineering, the opportunity to work on new challenges, board games, traveling, and cooking.