This site looks healthier in portrait mode.

Understanding insurance coverage and using it to find a relevant doctor is an important but complex task for patients. In the U.S., we have hundreds of insurance carriers and many of these have hundreds of plans available. Patients have a lot to sort through in order to identify the right information they need to find a doctor and book an appointment. To help patients have an easier time deciphering their insurance, we decided to build the Zocdoc Insurance Checker to extract key pieces of information from a photograph of a patient’s insurance card. The information we were interested in was the following:

  • Name of the Insurance Carrier and Plan
  • Member ID of the user

Given this information, we can check whether a certain visit by the patient is in-network or not – and only surface in-network doctors to the user. We can also provide the user an estimated copay amount based on an eligibility check. The system that we built uses various advances in computer vision and deep learning, and is a composition of deep neural nets that simultaneously perform classification, image cropping and OCR. The rest of this blog details the architecture of this system and highlights various tricks and lessons we had to employ to achieve an accuracy level on par with our patients.
Insurance Card Capture

Figure 1: A stylized animation showing what the product is designed to do.

Build vs. Buy

We knew image recognition and OCR were well studied problems, so our first thought was to explore external tools and services. However none of the services we found were designed to handle a level of variance in lighting, blur, orientation, angle, size and background that we had in the images we got from our users. Here is an example of typical card image variance:

Figure 2: Examples of card images that get passed through to us. Note, these are all images of a Zocdoc engineer’s expired card, taken only for illustrative purposes.

Model Architecture

There have been significant advances in the state of the art in computer vision using deep learning approaches over the last few years – and this motivated us to build our system using a deep learning neural network pipeline. Our system is composed of three individual networks, which include a base classification network, an alignment network, and an OCR network. The alignment and OCR networks are connected using a differentiable attention layer so that they can be jointly optimized. Below is an architecture of the entire system, and the rest of this article covers the details of each component separately.

Figure 3: Our network for classification, field extraction and OCR

Base CNN Model

The base model for the neural network takes in an image scaled to a 200×300 pixel resolution and outputs the following information

  • Carrier and Plan IDs
  • Location of the coordinates of the bounding box for the member ID and its orientation.

The base model is based on a VGG16 model where we started with the pre-trained convolutional layers and added our own set of dense layers that translate the convolutional features into the above outputs. Initializing with the image-net weights for the convolutional filters (transfer learning) allows us to start our training process with feature detectors trained on a large corpus of image-net images – instead of relying exclusively on our training data – and improves the generalization of our models.

Training multiple outputs together also improves generalization by sharing features between multiple tasks (multi-task learning). Our loss function is a standard cross entropy loss on the categorical variables for Carrier and Plan ID – however, due to the fact that our training data had some mis-labelled examples the losses were capped at a fixed level to allow the system to be confident in its predictions even if they didn’t agree with the user on a few samples.

To generate the MemberID bounding box labels, we used tesseract to detect all the words and corresponding bounding boxes on the card image – if any of the detected words (or the combination of any two close by detected words) matched the user input exactly, we used the corresponding bounding box (or the merged bounding box for the two pieces) as our training data. This process results in us having MemberID bounding box labels for around 10% of the images in our dataset. The process resulted in some errors where the user had labeled some actual text on the card – but that text did not correspond to the MemberID. To avoid penalizing the model too heavily in these erroneous cases we used a loss that consisted of a quadratic penalty for small errors – but then transitioned into sequence of linear penalties for larger errors.

Figure 4: MemberID bounding box loss function

Multi-Hypothesis Prediction: Another type of labelling error was for cards where the same MemberID number appeared at multiple places on the same card. In this case our training data could have the bounding box in either of these locations. To be robust to this type of training data, the model predicted multiple hypothesis (in practice just two suffice) of the MemberID location and the loss was taken to be the minimum over the different predictions. The multiple hypotheses prediction model was initialized using the trained single hypothesis model – and additional hypotheses were initialized by randomly perturbing them from the single hypothesis prediction. As we train, for cards that had multiple occurrences of the MemberID, the different hypotheses attach to the different locations where the MemberID shows up. For the majority of cards where the MemberID appears only once on the card, both the hypothesis move to very close to the same location. For simplicity, after training we just choose the prediction which had the highest overlap with the bounding boxes in the training set as a single prediction for the next stage.

Alignment and Attention Model

MemberID extraction network

Figure 5: Zoomed in view of MemberID extraction network, composed of a VGG CNN and an attention layer.

The next stage in the pipeline for the member ID extraction is to start with a high resolution image extracted with the loose bounding box estimate from the Base CNN Model and aligning the bounding box coordinates with the exact location of the MemberID. The alignment model also uses a similar architecture to the VGG Model for the convolutional layers. Finally, we use a differentiable attention layer based on DRAW: A Recurrent Neural Network For Image Generation to extract a tighter image of the MemberID with refined bounding box coordinates. This attention layer maps each pixel in the 512×64 output to a differentiable function of the alignment bounding box coordinates as well as the high resolution image around the approximate bounding box (input to the alignment model). The function incorporates a trainable padding amount as well as a trainable gaussian blur to attenuate the high frequency noise in the image of the MemberID and make the mapping smooth enough to allow effective training of the alignment. During training, we initialize the blur level to higher level than needed and let it recover to a lower level while at the same time improving the alignment of the bounding box.

OCR Model

Given a very localized image of a text sequence, the OCR component was relatively straightforward. Our considered options were to use the open source OCR engine tesseract, or to build it ourselves. The distribution of MemberID texts, fonts and backgrounds on insurance cards is a much more constrained distribution than what a generic system like tesseract has been tuned to handle – and so a custom solution can provide superior performance. A little research also showed that many other companies were building their own versions for their own problems, and that the deep learning community has provided good baseline examples upon which to build.

Similar to the above references, our solution includes a sequence of two convolutional and max pooling layers and a Dense layer across the depth and height of the images followed by a sequence of bidirectional GRU layers along the width, a Dense layer across the depth of the GRU layers and a softmax activation for each location along the width. The OCR model outputs a 128 x 40 dimensional tensor where 40 is the size of the vocabulary (including BLANKs) and represents a probability of each symbol appearing at each of 128 locations along the length of the image.

One core challenge in OCR training is aligning output with the images (i.e., an image may be 512 pixels wide, but contain 8 characters, and the issue is assigning each character to a sequence of pixels). We solve this problem using the CTC (Connectionist Temporal Classification) loss, and we train both the OCR Model and attached Alignment Model together. We can do this joint training since we used a differentiable attention layer in the Alignment Model – the joint training significantly improves performance over individual training for the two pieces.

Data Collection

A key challenge in any image recognition and OCR training task is to get labeled images. In our case, patients specify their insurance carrier and plan before they book an appointment (and/or upload their card). In the card uploading flow, we also prompt them to manually enter their MemberID.


Zocdoc Insurance Picker

Figure 6: A snapshot of the Zocdoc Insurance Picker

One interesting observation we made after looking through a lot of our data, and a strong justification for this product, is that patients are often wrong about what plan they have, or what their MemberID is (and not just considering typos). Beyond just being convenient, we’re finding that our process also can be more accurate than relying purely on patient effort. Going forward, we plan to use a Mechanical Turk-like tool to use internal Zocdoc operations teams to label images in a HIPAA compliant manner – and improve the training data quality – and hopefully improve system accuracy even more.

Model Training

As briefly mentioned above, the user photos contain cards in any of four orientations – 90 degrees from each other. We also use data augmentation by transforming the images with small rotations, 180 degree flips, small translations as well as small changes to the color channels – to get additional training data and improve generalization.

The training was run on 8 GPU servers (AWS p2.8xlarge instances) using the Keras library in Python with the Tensorflow backend for multiple days for each of the models.

End-To-End Performance On Member ID Extraction

Below we describe the effect of various components of the MemberID extraction pipeline on the end-to-end accuracy and compare the system’s accuracy to that of our users. Our goal was to achieve similar or improved accuracy relative to our users – which was at around 82% on our test dataset (by this we mean, patients only correctly entered their MemberID 82% of the time). We started by using just the Base CNN model that predicted a single MemberID bounding box hypothesis location, and used Tesseract as the OCR engine to read the text in the image within the bounding box. This gave us a 42% end-to-end accuracy – which was a good start but not satisfactory (given what state-of-the-art image recognitions systems can achieve). We noticed that slight alignment errors that were very easy for a human to see were causing Tesseract to fail, so we added a second stage to align the approximate prediction of the first stage more accurately with the MemberID text. This boosted the accuracy to 69%. Finally, we added the attention layer and the in-house OCR model and jointly trained it with the alignment model. We also added multiple-hypothesis prediction to the system to improve accuracy on certain insurance card types with multiple occurrences of the MemberID which put us at an accuracy level of 86%, comparable to the user accuracy on our test set. At this stage, we decided to start testing the model in production while we continue to improve the performance over time.


Method Accuracy
User Labels 82%
Base Model (single hypothesis) → Tesseract 42%
Base Model → Alignment Model → Tesseract 69%
Base Model (multi hypothesis) → Alignment Model + Attention + OCR Model 86%



The speed at which we can develop deep learning solutions to new problems has tremendously accelerated over the years. We were able to develop a proof of concept (and justify the effort) very quickly, with the availability of cloud-based GPU servers, pre-trained models, and open sourced architectures and code. We quickly learned though that getting to a quality that is appropriate for a production level personal health application required a little more ingenuity and trial and error than just simply stringing together open sourced components. This wasn’t surprising to us, given the known complexity of both insurance cards and images taken without forcing formatting constraints. We’re very happy with where we are now, and excited that with more data, the system will literally have superhuman accuracy.

Our model is now in production on the Zocdoc app and mobile web and is helping our patients search for care while improving the accuracy of the resulting recommendations. This is the first ever deep learning based product for Zocdoc, and represents our philosophy of bringing innovation to the patient healthcare experience.

About the author

Akash Kushal is an engineer at Zocdoc in the Piña Caliente team and enjoys building machine learning systems and flying fixed wing aircraft.

Show comment (1)

Close comments

One response to “Making Sense Of Insurance Cards Using Deep Learning”

  1. Johan Sanmartin says:

    This is excellent. Finally an easier way to demystify health insurance and the millions of rugulations that we always seem to ignore or overlook until the moment we need them most.

You might also like