Insights From Medical Documents

Data related to the healthcare industry is estimated to undergo a CAGR (compound annual growth rate) of 36% by 2025. With this significant growth, the prospective benefits of big data in healthcare are unquestionable. This colossal collection of data that is elucidatory of an individual’s health conditions and quality of life is collected from various sources. Examples of patient-specific data include Electronic Health Records (EHRs), patient portals, biometric data, payer and public records, clinical data from Computerized Physical Order Entry (CPOE), and physician’s handwritten notes and prescriptions. The sources through which this information is collected can range from traditional sources such as medical devices, wearable devices like activity trackers or smartwatches, or from Electronic Patient Records (ERPs) to non-traditional sources such as social media posts from Twitter feeds blogs, Facebook status updates, etc. Non-patient specific data is also being collected through sources like news feeds, articles in medical journals, to name a few.

To recognize the full potential of big data and artificial intelligence in healthcare means to combine the traditional forms of data with the new forms of data.  AI can help in the detection of indications of any symptoms, diseases, and medical conditions at the nascent stages. Sharing these insights facilitates a physician’s engagement with the patients and helps them cater to the patient’s needs in a more personalized manner. Healthcare insurers also benefit from these inferences. While filing claims with the Center for Medicare and Medicaid Services (CMS), accurate medical records can positively impact risk adjustment. The insured’s actual medical conditions and post-examination experience are considered over a conservative estimate, leading to revenue generation. AI takes the hassle and guesswork out of the equation, and real-time data analytics help both the insurers and the healthcare organizations refine their products and offerings based on large data sets.


In the modern digital world, technologies across industries are bringing better insights and increasing overall efficiency. This initiative is primarily driven by AI, ML, and cloud computing. With that said, many healthcare companies still operate without a single unified medical records format. While a significant amount of money is spent on initiatives like monitoring and medication tracking, far less is spent on access to data or on usable medical records. This data continues to exist and grow within insurance providers’ databases, across hospital warehouses, and within the doctors’ local databases. Its ability to drive powerful, conclusive results has not yet been harnessed. These records contain important information ranging from insights into everyone who has ever visited a doctor and extensive data of the broader population. It contains valuable information on people and the ability to foretell hidden warning signs about future diseases and even the untold answers to drug efficacy questions, the indicators, and the contra indicators – if leveraged appropriately. While the quality of data, especially unstructured data, can lead to valuable insights, it runs the risk of inconsistency thus inaccuracy. One of the most infamous examples of this inaccuracy is the ‘translations’ of poor handwriting on prescriptions.

Solution Approach

A Deep Learning based solution, provided as SaaS (software as a service), ensured ease-of-access and streamlined the analysis of these documents (in PDF) to increase the monthly throughput. was used as a platform to create the end-to-end solution that augmented the manual workflow with automation and rendered relevant information instead of going through the whole document.

User flow

When a user requested an analysis of a PDF by uploading a document through the UI, the webserver received the request, collected the document, and triggered a ReportInsightSolution™ API call that decomposed and analyzed it. The web server received the upload request, collected the document, and the ReportInsightSolution™ API decomposed the PDF using Computer Vision pipeline.

Activities such as 90-degree page rotation, image-level layout analysis, connected component analysis, column/ block finding, and deep-learning-based text extraction from the charts were part of the computer vision pipeline. The raw text obtained was pre-processed and cleaned for further analysis. Following this, text normalization such as performing spell checks, expanding abbreviations was performed. Chart-level keyword matching was used to find the edit distance-based similarity, token-based distance similarity, Rabin-Karp algorithm, and context for matched keywords was derived. QuickUMLS (Unified Medical Language System) package was used for identifying medical terms and their context. NLP model was trained on medical terms like diseases, drugs, personal information (BMI, Smoking, etc.).

The project was initiated with the platform, automatically creating the required environments. Development images configured based on predefined templates were installed on-premises as development VMs within the infrastructure. This enabled authentication using LDAP, seamless project setup using Bitbucket, Jenkins, and Docker (ensuring build and deployment without software compatibility issues).

Data Collection and Normalization data libraries aided in connecting to the different data sources and collecting data from cloud based storage system. The platform was also able to help in the data correction process, such as cleansing, standardization, and stale and extraneous data removal. A significant part of the data transformation journey while creating models involved collecting raw, unformatted, unparsed, continuous data.

Data Exploration and Versioning libraries were used for exploratory data analysis to perform univariate, bi-variate, and bag of words analysis — on both the structured and unstructured datasets. Different datasets and their versions could be controlled and stored using the platform’s data versioning capabilities. This allowed for easy retrieval and storage of datasets and files.

Model Training and Inference

NLP models were trained using deep learning based technology (BiLSTM +CRF) to recognize medical terms.’s framework allowed quick reproduction and retraining of the model development process, enabling data scientists to review the model and its potential limitations more closely. This finally helped identify disease, procedure, drugs, personal information and provide multiple formats( JSON,CSV).

For inference, Two pipelines or workflows were created in Output from the two pipelines was combined, duplicate entries from both the medical term and the context were removed. Output was generated in CSV/JSON format with detected medical terms and their respective context.


The solution enabled the client to increase clinical notes analysis efficiently by 42%. This implementation helped our client scale up their operations and ease expansion without increasing the number of employees and improve the throughput per month by utilizing a streamlined man-machine workflow.

How can help Healthcare Organizations transform their journey to cognitive AI solutions is an AI/ML Application Lifecycle Management Platform. enables complete lifecycle management of AI/ML solutions, addressing the AI transformation journey of enterprises on any cloud platform of choice. offers functionality essential for building AI/ML solutions – primarily enabling data scientists to rapidly build predictive and prescriptive models. The platform provides a user-friendly interface to develop, deploy, and manage AI/ML solutions at scale. In addition, supports the incorporation of these solutions into business processes, surrounding infrastructure, products and applications.

Key benefits of include:

  • Empowers data scientists to transform AI/ML research into solutions 
  • Improves the productivity of data scientists by enabling them to focus on the business problem, developing algorithms and rapid experimentation of models 
  • Addresses the shortage of skilled data science resources with automated workflows, toolkits and frameworks 
  • Manages AI transformation journey costs without any wastage of R&D efforts 
  • Provides an enterprise-ready and secure environment for complete lifecycle management of AI/ML applications
  • Enables at-scale deployment of enterprise AI/ML applications on-premise, cloud (AWS, GCP, Azure), or hybrid environments

Additional details on can be found at: We can schedule a demo of the platform for anyone interested in learning more.

Have Any Questions?

Need more information about the platform?