NLP Research

From SHARP Project Wiki
Jump to: navigation, search

Natural Language Processing (NLP) In Research

The clinical and research medical community creates, manages and uses a wide variety of semi-structured and unstructured textual documents. To perform research, to improve standards of care and to evaluate treatment outcomes easily — and ideally, in an automated fashion — access to the content of these documents is required. The knowledge contained in unstructured textual documents (e.g., pathology reports, clinical notes), is critical to achieving all of these goals. For instance, clinical research usually requires the identification of cohorts that follow precisely defined patient- and disease-related inclusion and exclusion parameters. Biomedical NLP systems extract structured information from textual reports, facilitating searching, comparing and summarization.

{{#ev:youtube| l1WJhfAz_CY }} Natural Language Processing Guergana Savova, PhD; SHARPn Co-Investigator; Assistant Professor at Harvards Children's Hospital Boston. Dr. Savova discusses the steps to natural language processing or information extraction of clinical narrative.


Releases - Download, install, configure, and use the software produced.

Natural Language Processing Releases
cTAKES is currently in process of becoming an Apache project. No official releases are available yet, but this is where the community now hangs out.

Presentations - Presentations made or found during the coarse of this grant that are relevant to this project.

Documents - Documents created by or used by this project.

References - Additional resources relevant to this project.

Introduction to Natural Language Processing (NLP) in the clinical domain

…because a lot of clinical data is captured in free-text notes.
Extracting structured information from free text facilitates…
…to enable research, improve standards of care and evaluate outcomes easily.

NLP systems can extract structured information from these notes that allows the information contained there to be searched, for example for a diagnosis, compared, perhaps to find common co-morbidities with a certain diagnosis, and summarized.

Therefore, NLP is a critical component in SHARP Area Four. It facilitates the use of clinical narratives in the similar way as structured data for high throughput phenotyping, decision support at the point of care, and evaluation of health care delivery outcomes.


The SHARPn NLP team is currently working on improving the functionality, interoperability, and usability of a clinical NLP system, Clinical Text Analysis and Knowledge Extraction System (cTAKES).

  • Functionality - continue translating NLP research outcome to better cTAKES
  • Interoperability - work with Clinical Element Models and high throughput phenotyping programs for interoperable systems
  • Usability - improve the usability of cTAKES through adopting standards and investigating NLP use cases

Additionally, SHARPn NLP also plans to be a delivery platform for open source clinical NLP systems through Open Health Natural Language Processing (OHNLP) consortium ( and welcomes contributions from clinical NLP researchers.

Project Team

Thanks yous go to the Natural Language Processing team.

Clinical Natural Language Processing (cNLP)

Overarching goal: High-throughput phenotype extraction from clinical free text based on standards and the principle of interoperability

Focus: Information extraction (IE): transformation of unstructured text into structured representations Merging clinical data extracted from free text with structured data

cNLP Specific Aim 1 Clinical concept and event discovery from the clinical narrative

  • (1) defining a set of clinical events and a set of attributes to be discovered
  • (2) identifying standards to serve as templates for attribute/value pairs
  • (3) creating a "gold standard" through the development of annotation schema, guidelines, and annotation flow, and evaluating the quality of the gold standard
  • (4) identifying relevant controlled vocabularies and ontologies for broad clinical event coverage
  • (5) methodological support for a broad array of clinical event discovery and template population
  • (6) extending Mayo Clinic's clinical Text Analysis and Knowledge Extraction System (cTAKES) information model, and implementing best-practice solutions for clinical event discovery.

cNLP Specific Aim 2 Relation discovery among the clinical events discovered in Aim 1

  • (1) defining a set of relevant relations
  • (2) identifying standards-based information models for templated normalization
  • (3) creating a gold standard through the development of an annotation schema, guidelines, and annotation flow, and evaluating the quality of the gold standard
  • (4) developing and evaluating methods for relation discovery and template population
  • (5) implementing high-throughput scalable phenotype extraction solutions as annotators in cTAKES and UIMA-AS, either within an institution’s local network or as a cloud-based deployment integrated with the institution’s virtual private network.