Technical Approach
The unstructured input drug insert files must be cleaned before they can be processed by the NLP pipeline. This cleaning consists of removing extra whitespace, case-normalizing, and removing special characters. We collate the symptom descriptions to build a working taxonomy. Then, we leverage an object-oriented approach to create custom tags that seed the development of a knowledge graph. Next, the processed text is further processed by spaCy, a powerful NLP library. Finally, the data goes through custom domain-specific components. For this healthcare application, we have developed three custom named entity recognition components: the Indication Tagger, Patient Type Tagger, and Dosage Tagger.
Custom NLP Components for Healthcare
The Indication Tagger is a greedy match algorithm to recognize indications from the accumulated taxonomy. The algorithm is “greedy” because it attempts to extract as much information as possible. For example, consider the sentence “The recommended dosage of Keytruda for adult patients with adenocarcinoma of the breast is 50 mg/m2.” It is preferable to extract “adenocarcinoma of the breast” instead of “adenocarcinoma” by itself. By performing greedy matching, the Indication Tagger preserves the rich information that is available in the unstructured data.
Patient Type Tagger is an algorithm for labeling the target patient for the dosage. It utilizes NLP techniques such as parts of speech (POS) tagging and noun chunking. The algorithm identifies specific noun chunks that indicate details about a patient type, specifically if the patient is adult, pediatric, or if the dosage is independent of age. These details are then converted to our custom Patient Type entities in the NLP pipeline.
PoS Tagging example for a drug package insert
The Dosage Tagger is an algorithm that leverages the recognition of verbs to infer recommended dosage information. Sentences containing dosage information are parsed by matching a pattern to capture a dosage amount, dosage unit, or dosage metric.
These NLP objects are sent to our NLU model. This model relates the indications, patient types, and dosage information in a hierarchical structure. Our model successfully integrates three taxonomies (indication, patient type, and dosage) to create meaningful relationships. This leads to a powerful tool for extracting structured data from free text.
Our Findings
The gist below shows the output that our model produces for the drug Etopophos.
Our model understood that the dosage for an adult with refractory testicular cancer is 50 to 100 mg/m2. It got this information from reading through the entire drug package insert (a document containing over a thousand words), finding the relevant sections, and then understanding it. It then stored the information using a knowledge graph to preserve relationships and enable flexibility in querying.
Extension to Other Domains
This approach can be used to find the invisible relationships from any unstructured text. For example, you could apply this in the financial world to build a knowledge graph of assets and their attributes from sales sheets. Or in retail, create a graph of SKUs and attributes from the description instead of having everything defined and manually entered up front. This approach helps bring order to chaos, and unlocks previously unusable data.