In the era of big data, organizations must become proficient at converting data into insights to stay competitive. However, 90% of the existing data is unstructured – which means they’re not suitable for traditional analysis. This includes text, images, and video files. It becomes very important to transform this raw unstructured data into a structured format to enable further analysis and gain business value. In this post, our goal is to outline how our natural language understanding (NLU) model identifies custom named entities from a volume of text, builds relationships between them, and stores it in a knowledge graph.

To demonstrate the value of our approach, we applied it against a publicly available database which contains millions of prescription drugs and their information. We wanted to process unstructured drug and package inserts (PI), extract symptoms (also called indications) mentioned within the PI, and get corresponding dosage information for each patient type using natural language processing (NLP) and NLU.

The unstructured text for each drug enters our matching process, then perform custom named entity recognition (NER) and build relationships among these entities (NLU) to create an organized knowledge graph of the drug’s dosage information.

Technical Approach

The unstructured input drug insert files must be cleaned before they can be processed by the NLP pipeline. This cleaning consists of removing extra whitespace, case-normalizing, and removing special characters. We collate the symptom descriptions to build a working taxonomy. Then, we leverage an object-oriented approach to create custom tags that seed the development of a knowledge graph. Next, the processed text is further processed by spaCy, a powerful NLP library. Finally, the data goes through custom domain-specific components. For this healthcare application, we have developed three custom named entity recognition components: the Indication Tagger, Patient Type Tagger, and Dosage Tagger.

Visualization of custom NER on text from drug package insert

Custom NLP Components for Healthcare

The Indication Tagger is a greedy match algorithm to recognize indications from the accumulated taxonomy. The algorithm is “greedy” because it attempts to extract as much information as possible. For example, consider the sentence “The recommended dosage of Keytruda for adult patients with adenocarcinoma of the breast is 50 mg/m2.” It is preferable to extract “adenocarcinoma of the breast” instead of “adenocarcinoma” by itself. By performing greedy matching, the Indication Tagger preserves the rich information that is available in the unstructured data.

Patient Type Tagger is an algorithm for labeling the target patient for the dosage. It utilizes NLP techniques such as parts of speech (POS) tagging and noun chunking. The algorithm identifies specific noun chunks that indicate details about a patient type, specifically if the patient is adult, pediatric, or if the dosage is independent of age. These details are then converted to our custom Patient Type entities in the NLP pipeline.

PoS Tagging example for a drug package insert

PoS Tagging example for a drug package insert

The Dosage Tagger is an algorithm that leverages the recognition of verbs to infer recommended dosage information. Sentences containing dosage information are parsed by matching a pattern to capture a dosage amount, dosage unit, or dosage metric.

These NLP objects are sent to our NLU model. This model relates the indications, patient types, and dosage information in a hierarchical structure. Our model successfully integrates three taxonomies (indication, patient type, and dosage) to create meaningful relationships. This leads to a powerful tool for extracting structured data from free text.

Our Findings

The gist below shows the output that our model produces for the drug Etopophos.

Our model understood that the dosage for an adult with refractory testicular cancer is 50 to 100 mg/m2. It got this information from reading through the entire drug package insert (a document containing over a thousand words), finding the relevant sections, and then understanding it. It then stored the information using a knowledge graph to preserve relationships and enable flexibility in querying.

Extension to Other Domains

This approach can be used to find the invisible relationships from any unstructured text. For example, you could apply this in the financial world to build a knowledge graph of assets and their attributes from sales sheets. Or in retail, create a graph of SKUs and attributes from the description instead of having everything defined and manually entered up front. This approach helps bring order to chaos, and unlocks previously unusable data.

Learn how NLP and knowledge graphs can help turn chaos into order for your business

Get In Touch

Related Content

Enabling Data-Driven Product Planning Decisions to Reduce Risk

A leading CPG company partnered with us to explore how ML could help them make smarter product roadmaps

Improving Quality of Care through Population Health Analytics

A leader in healthcare partnered with us to improve care quality using  analytics

MVP Helps Startup Validate Market

A healthcare startup wanted to test product-market fit fast. See how we helped achieve that