Data Normalization Framework 2.0

From SHARP Project Wiki
Revision as of 18:14, 28 May 2013 by TBleeker (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search


Normalizing data within a healthcare environment means taking all or parts of electronic healthcare documents and transforming them into data that has structure and uses standard underpinnings for terms and measurements and so on. This documentation will refer to this as the Data Normalization framework. You may see it also referred to as a pipeline. It's not like a pipeline you normally think of which is used to transport something. It's more like a manufacturing plant that consumes materials and makes something else.

If you have not done so please study our thoughts and methodology on data normalization. Data normalization involves, on the surface, several parts of an application:

  • The incoming data
  • A transformation of the data
  • The terminology services
  • A place to store the results

This can be accomplished a number of ways. Specialized software is added to process incoming data and pass it to the appropriate processing pipeline. Clinical Element Models (CEMs) are used as the models to store the pipeline results. From there CEMs can be turned into any number of forms. In 2.0 we are placing them into a CouchDB. Previous releases used a MySQL database.

Out of the box, the Data Normalization pipeline has a limited number of healthcare documents that can be presented as data; they are:

  1. HL7 messages
  2. CCD (Continuity of Care Document)
  3. CDA (HL7 Clinical Document Architecture)
  4. XML table (SHARP defined)
  5. (being evaluated) cCDA (HL7 Consolidated Clinical Document Architecture)

CCDs and CDAs are XML documents already but HL7 messages are not. Since the pipeline transforms XML, the HL7 messages are treated first and changed into XML. This is one of the functions of Mirth Connect. In essence, the Data Normalization pipeline's syntactic processing step processes XML and nothing else. Finally a generic XML form is available. If nothing else, you can transform your data into this form first before sending it through the Data Normalization pipeline.

For a more granular diagram see the image below describing the steps of the process.

Picture of the steps in the pipeline

  1. Data is passed from an organization to the data normalization framework.
    • External to the cloud environment this can be through NwHIN Connect (optional).
    • Internally this is simply via the file system.
  2. Mirth Connect transforms incoming data if necessary.
    • HL7 text messages to XML appropriate to the configuration of the pipeline.
    • CCD and CDA documents are already in XML format.
  3. Transform the incoming XML documents to objects represented by CEMs.
    • Syntactic (extract values from XML and place into CEM fields based on configuration file)
    • Semantic (transform local codes to standardized codes based on configuration file)
  4. Mirth Connect pulls CEMs that have been processed
  5. Mirth Connect transforms the CEM XML result into a database for further consumption and research

A default configuration is provided, however all incoming data is not created the same from one institution to another, so the expectation is that this configuration will be changed by you to match the incoming data. If you don't make any changes to the configuration, it is assumed that you would just be learning how the pipeline works by running a test.


Installing the Data Normalization framework will involve these basic steps:

  • Download the code from the sharpn/datan SourceForge repository.
  • Download, install, and configure Mirth Connect.
  • Setup the Mirth Connect channels for feeding the pipeline.
  • Download, install, and configure UIMA-AS.
  • Configure the plug-in and connector.
  • Download, install, and configure CouchDB.
  • Setup the Mirth Connect channels for pulling the CEM results into a database.


Once you have a working system, whether it be pre-built or install, there are a couple more things to consider:

Among the resources for this project you will find the pipeline configuration files. There are configuration files which determine how the model of the incoming data will change (model mapping) and configuration files for terminology standardization (semantic mapping). Configuration files are simply straight text files that you can edit. In order to change them to meet your needs you will need someone that understands the format of your incoming data and the values that can possibly be rendered in your models.

Use the documentation for configuration files to customize the Data Normalization pipeline.