Data Normalization Pipeline 1.0 Download

From SHARP Project Wiki
Jump to navigationJump to search


The Data Normalization Pipeline can be invoked on its own, that is without Mirth Connect (the connectivity software). However, this presumes that you have the appropriate input data. Sample data is distributed for testing. While the connectivity software is technically optional you will find it indespesible for feeding the pipeline and for transforming the resulting output.

This pipeline install gets you step 3 in the overview. Everything within the Data Normalization box, from the Initializer to Generate CEM.


  1. Some software developer skills
  2. Minimal familiarity with Subversion code repositories
  3. A way to checkout files from Subversion (Eclipse IDE recommended)
  4. Oracle's Sun Java 1.6 or greater runtime. Mirth Connect will be used and this Java is required by Mirth Connect.

Install Data Normalization pipeline

Eclipse IDE install:

  • Start Eclipse
  • (Optional) Make a new workspace
  • File -> New -> Project -> Checkout projects from SVN
  • Create a new repository location: svn://
  • Select SharpnNormalization_XMLTree
    • NOTE: Template based - The template based implementation is not yet available.
  • If not the default, change the project JRE to 1.6
  • <SHARP_DATA_NORM_HOME> is where SharpnNormalization_XMLTree is now located.

Command line install: Checkout will need to extrapolate from the Eclipse IDE instructions

  • Install an SVN client.
  • Checkout the Data Normalization code from the repository location above.
  • <SHARP_DATA_NORM_HOME> is where SharpnNormalization_XMLTree is now located.

Test sample data with default pipeline configuration

The UIMA Collection Processing Engine (CPE) is used to run the pipeline. The CPE is configured by a file called a descriptor. This descriptor defines things like the location of input and output data, the mechanism by which to process the collection of input, and what documents are being passed in and processed. There are more. It is left to the reader to study the descriptor in the GUI or in its XML format.

  1. Start the CPE:
    • Command line:|bat
    • In Eclipse, Run -> Run Configurations. You will see a Java Application configured: Launch UIMA_CPE_GUI--sharpnnormalizationXMLTree
  2. Load the descriptor
    • File -> Open CPE Descriptor
    • SharpNormalization_XMLTree -> desc -> collectionprocessingengine -> MayoLabsHL7CPE.xml
      • The descriptor file names and directory names contain a designation of what kind of data being dealt with. This is a good practice to use as it may not be easy to discern just by looking at the input file names. In this example we can tell these should be HL7 messages.
  3. Input data
    • Notice the incoming data location (Input Directory in the Collection Reader section)
    • Use any editor to view the files in the input directory.
      • As you might expect, these HL7 messages have already been converted to XML files since that is the only thing the pipeline can process. A step you will need to take using the connectivity software if you have this kind of input.
  4. Press the Play button (looks like a green arrow near the bottom of the interface).
  5. Output data
    • Check out the data that is now in the output directory (Output Directory in the Analysis Engines section); each file is a CEM in XML format. The results are not displayed in the GUI. The GUI is simply a means to run the pipeline. Use any editor to view the files by navigating through the system directories.

The output results for labs have a special naming convention. You will notice that lab results output file names do not have the HL7 lab categories: coded, narrative, ordinal, quantitative, quantitativeInterval.

The pipeline has an Analysis Engine defined with a set of special parameters. You can see these parameters in the middle of the Collection Processing Engine Configurator GUI launched previously. When we talk about configuring the mapping files, the MODEL_MAPPING and SEMANTIC_MAPPING files are the ones you will be changing. When you create your own pipeline, you will use the sample configuration files as the basis for your configuration. Once you have created your own configuration files, they would be pointed to from here and saved to your own CPE descriptor. After setting up the connectivity software, you will be creating this configuration.

  • MODEL_MAPPING - Path and name of a configuration file mapping out how to transform from one model to another.
  • SEMANTIC_MAPPING - Path and name of a configuration file mapping out how to transform terms used in the data.
  • Element property - A static property file. Use unchanged. A mechanism to move attributes from the source to the target.
  • LOINC2CEM MAPPING - A static property file. Use unchanged for lab processing. This file is used to tell the pipeline which of the six CEM lab types to use based on the LOINC codes that end up being involved.
  • DOCTYPE - This is the type of the document or data being sent into the pipeline. There are these types of documents: HL7, CCD, CDA, TABLE.
  • DEID - Reserved for future use.

Next Steps

Now that you have a pipeline you need to set up the connectivity software.