blue connect dots

Myndshft Blog

blue connect dots

August 22, 2019

How Medical Concept Embeddings Can Help Reduce Billing Errors 

by Tyrone DeSilva | Healthcare Technology

One of the problems that Myndshft seeks to solve in healthcare administration is automating and improving medical billing. Currently, medical billing is composed of tedious manual processes, which introduce errors and latency that drive up costs and reduce provider and patient satisfaction. Machine learning and process automation are part of the toolbox Myndshft uses to tackle these challenges. In order for machine learning and process automation systems to be effective, they first need to learn to comprehend claims data. Claims data commonly contains two ontologies, International Classification of Disease, 10th revision, Clinical Modification (ICD-10-CM) and Current Procedural Terminology (CPT). The solution we propose for handling these ontologies is medical concept embeddings.

Embeddings have conventionally been used to obtain distributed representations of words. They map from a vocabulary space to a real vector space. However, embeddings have also been used in other areas, e.g. graph embeddings, entity embeddings, sequence embeddings. Of particular interest to Myndshft is applying embeddings to represent medical concepts. The goal of this blog is to provide a brief overview of the motivation for creating medical concept embeddings, explain how they learn to represent medical ontologies, and explore potential uses for embeddings.


Medical ontologies contain large numbers of medical concepts. For example, ICD-10-CM is a diagnostic ontology which contains over 70,000 concepts. Unified Medical Language System (UMLS) is a metathesaurus of many different ontologies which contains over 3 million concepts. Concepts are represented as codes within an ontology. Typically machine learning models will need codes to be one-hot encoded before they can make sense of them. Unfortunately, one-hot encoding discards a lot of information from the original coding system. 

For example, in ICD-10-CM V91.07 and V91.0 are clearly similar, since they both have the same prefix (both are injuries related to watercraft incidents). When we one-hot encode these codes, unless we have another feature for the prefix, the model will treat these two related concepts as unrelated. Another problem with binary encoding is that in most cases they result in high dimensional, sparse features.

Using embeddings solves both problems. In a well-trained embedding model, V91.07 and V91.0 should be close to each other in the embedding space. Additionally, we can control the dimensionality of our distributed representation. Lower dimensional embeddings can be nearly as good as very high dimensional embeddings and generally, both have much lower dimension than the one-hot encoding.


At Myndshft, we trained our embedding model using MIMICIII1 EHR data, which contains information pertaining to an individual ICU admission, including multiple diagnoses coded in ICD9, and procedures, coded as CPT codes. The embedding model maps ICD9 codes and CPT codes into the same 100 dimensional embedding space.

CBOW from word2vec paper

We used a similar approach to the Continuous Bag of Words(CBOW) model from the original word2vec paper2. Instead of using word proximity as the context, we define context as all CPT and ICD9 codes during a hospital admission. For each code in a hospital admission, we train a multilayer perceptron to predict that particular code using the other codes in the context. Define w(t) as the code we are predicting, and let {w(i)}i≠t be the remaining CPT and ICD9 codes. Here’s how the model is defined in tensorflow:

# LOOKUP_DICT is used to label encode our ICD and CPT codes
def create_model():
    model = tf.keras.Sequential([
        layers.Input(name="input_1", shape=(MAX_LEN,)),
        layers.Embedding(len(LOOKUP_DICT) + 1,
        layers.Dense(len(LOOKUP_DICT), activation="softmax"),
    return model

Embeddings also allow us to visualize the relationships between codes in a two or three-dimensional space for easy interpretation of the model. Clicking on a point in the embedding projector shows you neighboring points in the embedding space.

Other examples of using distributed representations for healthcare data include Choi et al3, who use claims data to train their embeddings. See also Choi et al4, which includes demographic information and trains on intra-visit co-occurrence as well as inter-visit sequence information to train Med2vec embeddings. This repository has code to train their model on MIMIC3 data. Graph-Based Attention Model (GRAM), another model by Choi et al5 supplements representation training on EHR data by adding hierarchical information about the medical concepts(e.g. V91 is a parent to V91.07) and an attention mechanism which allows them to train more effective embeddings with less data. Finally, there’s been promising research by Nickel and Kiela6 into using hyperbolic embeddings. Hyperbolic embeddings are useful for capturing hierarchical representations, which are a key characteristic of many medical ontologies. They train their embeddings using the UMLS metathesaurus. Their embeddings, along with more information about hyperbolic embeddings can be found here.

Practical Application

We can envision a couple of immediate applications of medical concept embeddings, but they can easily be applied anywhere that meaningful representations of medical ontologies are used.

The first application is in Health Information Retrieval(HIR). One HIR application of embeddings is finding related medical concepts, even across ontologies. The interactive embedding example above illustrates how embeddings and a nearest neighbors algorithm can be used to explore ICD9 and CPT codes. A recommender system for diagnosis and procedure codes could improve provider workflows for billing and documentation, speeding up the process and reducing errors. Because we can embed multiple ontologies into the same space, we can also aid translating between different ontologies. Although a metathesaurus like UMLS can do the same thing for standard coding schemes like ICD-10-CM and SNOMED-CT,  building a crosswalk from an internal ontology could take thousands of expert hours. With the right data, an embedding can aid translating between any number of ontologies.

The second application relates to billing errors and Fraud, Waste, and Abuse(FWA). In medical billing, there is an estimated $210 billion lost to overbilling7, with as many as 80% hospital bills containing a billing error8. FWA is estimated to account for up to 10% of medical spending9. Many of these billing errors are caused by claims coding errors, either due to human error or fraud.  An example of human error could be data entry, such as a typo on a CPT code, causing the wrong procedure to be billed. In the case of fraud, more expensive codes are intentionally used in a practice known as upcoding. In the most extreme cases procedures may be billed that were never even performed.

Anomaly detection models can be used to detect billing errors and FWA by flagging claims that don’t fit the usual pattern of ICD and CPT codes. ICD and CPT embeddings can be used as model features. We’ve already gone over some of the benefits of using embeddings as modeling features in the motivation section, such as dimensionality reduction, encoding concept similarity, and continuous values. 

To illustrate how trained embeddings can encode meaningful information for detecting unusual coding patterns, let’s go over a simple heuristic solution. Embedding spaces have a notion of distance between points(this is how the visualization above knows which points are nearest to the one you selected; it calculates the distance to all the other points and sorts them by distance). 

We can define the heuristic as follows: for each CPT code in the claim, calculate the minimum distance overall ICD codes on the claim. If that minimum distance is over a threshold value, flag the claim. A large minimum distance indicates that the CPT code is not sufficiently related to the ICD codes in the claim, so something questionable may be going on. The sensitivity of the heuristic can be tuned by adjusting the threshold value. Without needing to train any new models, we already have a simple solution using just the information encoded in the embeddings.



[1] Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. MIMIC-III, a freely accessible care database, 2016.

[2]: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space, 2013.

[3]: Youngduck Choi, Chill Yi-l Chiu, and David Sontag. Learning Low-Dimensional Representations of Medical Concepts, 2016.

[4]: Edward Choi, Mohammad Bahadori, and Jimeng Sun. Multi-layer Representation Learning for Medical Concepts, 2016.

[5]: Edward Choi, Mohammad Bahadori, Le Song, Walter Stewart, and Jimeng Sun. GRAM: Graph-based Attention Model for Healthcare Representation Learning, 2016.

[6]: Maximilian Nickel, Douwe Kiela. Poincaré Embeddings for Learning Hierarchical Representations, 2017.

[7] The Path to Continuously Learning Health Care in America, 2012.

[8] Kelly Gooch. Medical billing errors growing, says Medical Billing Advocates of America, 2016.

[9] Coalition Against Insurance Fraud. By the numbers: fraud statistics.