Data Normalization Pipeline 1.0

From SHARP Project Wiki
Jump to: navigation, search


Normalizing data within a healthcare environment means taking all or parts of electronic healthcare records and transforming them into data that has structure and uses standard underpinnings for terms and measurements and so on. This documentation will refer to this as the Data Normalization pipeline. It's not like a pipeline you normally think of which is used to transport something. It's more like a manufacturing plant that consumes materials and makes something else.

If you have not done so please study our thoughts and methodology on data normalization. Data normalization involves, on the surface, several parts of an application:

  • The incoming data
  • A transformation pipeline for the data
  • The terminology services
  • A place to store the results

This can be accomplished a number of ways. The architecture team has defined a generalized architecture found here. Specialized software is added to process incoming data and pass it to the appropriate processing pipeline. Clinical Element Models (CEMs) are used as the models to store the pipeline results. We also store the CEM instances in a database.

Out of the box, the Data Normalization pipeline has a limited number of healthcare documents that can be presented as data; they are:

  1. HL7 messages
  2. CCD (Continuity of Care Document)
  3. CDA (HL7 Clinical Document Architecture)
  4. XML table (SHARP defined)

CCDs and CDAs are XML documents already but HL7 messages are not. Since the pipeline transforms XML, the HL7 messages are treated first and changed into XML. This is one of the functions of Mirth Connect. In essence, the Data Normalization pipeline's syntactic processing step processes XML and nothing else. Finally a generic XML form is available. If nothing else you can transform your data into this form first before sending it through the data normalization pipeline.

For a more granular diagram see the image below describing the steps of the process.

Picture of the steps in the pipeline

  1. Data is passed from an organization to the data normalization pipeline.
    • External to the cloud environment this is through NwHIN Connect.
    • Internally this is simply via the file system.
  2. Mirth Connect sorts HL7 messages into various types.
    • Mirth Connect transforms incoming data from HL7 messages, CCD and CDA documents to XML appropriate to the configuration of the pipeline.
  3. Transform the incoming XML documents to objects represented by CEMs.
    • Syntactic (extract values from XML and place into CEM fields based on configuration file)
    • Semantic (transform local codes to standardized codes based on configuration file)
  4. The CEM XML output is sent to Mirth Connect
  5. Mirth Connect transforms the CEM XML result into a database for further consumption and research

A default configuration is provided, however all incoming data is not created the same from one institution to another, so the expectation is that this configuration will be changed by you to match the incoming data. If you don't make any changes to the configuration, it is assumed that you would just be learning how the pipeline works by running a test.

Getting Started

As you get started, it is worthy to note that there are two implementations of the Data Normalization pipeline. One is based on templates and the other based on XMLTree. The 1.0 offering to this point is XMLTree. Using either of these two implementations does not typically change what a user does. A note will be added to any of this documentation where there is a difference to the user.

There are a couple ways to take advantage of data normalization software provided by SHARP:


  1. No download or install. Simply start an instance of an image on the cloud. Configure it, and run your data.
    • The cloud image has the connectivity software and the pipeline installed. You can use the pipeline alone and ignore the connectivity software or you can use them together.


  1. Obtain hardware, download, build, and install from the code repository. Configure it and run your data.
    • Downloading the code from sourceforge is for the pipeline only. You can set up the pipeline and feed it XML files for processing, by configuring an input location.
  2. If desired, you can also add ways to feed the pipeline.
    • Installing the connectivity software requires an existing pipeline.
    • In addition to feeding the pipeline Mirth Connect can be used to save the resulting CEM XMLs to a database.


Once you have a working system, whether it be pre-built or install, there are a couple more things to consider:

Among the resources for this project you will find the pipeline configuration files. There are configuration files which determine how the model of the incoming data will change (model mapping) and configuration files for terminology standardization (semantic mapping). Configuration files are simply straight text files that you can edit. In order to change them to meet your needs you will need someone that understands the format of your incoming data and the values that can possibly be rendered in your models.

Use the documentation for configuration files to customize the Data Normalization pipeline.