Minutes from Meeting Day 2

From SHARP Project Wiki
Jump to navigationJump to search

Day One Summary

Project Logistics

  • The wiki has been updated live throughout the meeting. It has a public and a private view. Once you login, each project has its own space. Your information can be kept private.
  • There are both public and private distribution lists.
  • Twitter will be ongoing throughout the project.
  • Regular Meetings
    • PAC meets quarterly
    • Project leads meet monthly
    • Project teams are currently meeting weekly
    • Cross-cutting workgroups will meet on an ad hoc basis as needs arise
    • Cross-SHARP communications: Feel welcome to contact members of the other SHARP projects if you feel they have answers to your questions. There will be a SHARP meeting at AMIA in the fall.
  • Program Evaluation by NORC

This is not an evaluation of the science, so much as an evaluation of the process and how well the project has been run. They will be looking at how progress is being documented.

  • Project Management
    • Project managers are assigned to each project.
    • The project management team will discuss:
      • Scope for each project, in the 6 month to 1 year timeframe.
      • Milestones and schedule to connect projects and ensure they don't impede each other.
      • Risk
      • Resource allocation
  • Risk
    • Expectations from within the project (broad progress) are different than those from outside the project (targeted products for the marketplace)
      • Fun widgets that do glamorous things are not a long-term mark of success for SHARP Area 4. We will need to be able to articulate a story for the broad functionality that we'd like to produce.
      • There needs to be a larger framework for how all of the pieces or widgets work together to achieve an overall functionality of gleaning meaningful information from clinical records.
      • Philosophically more middleware. Each project will achieve successes within their domains, but we need to produce tools that will support secondary use by those outside of the SHARP Area 4 group. The body of work may not be high-profile to the broad public, but will facilitate high-profile research asking health-related questions of clinical information.
      • Can we choose a few clinical areas to focus on to ask some of these research questions?
      • How do we integrate the different projects of Area 4? If we don't define specific domains to work in and see specific links between them, we run the risk of not having an overarching body of work at the end of the project.
      • There is a risk of overfitting the solutions if they are targeted for a specific research question.
      • Reach out the other SHARP projects to connect with them as potential "secondary users" of the data.
      • Four years is a long time. It's a longer horizon than many projects. This allows us to articulate a grand vision that is equal to the four year time frame. Technologies will evolve over four years.
      • Can we define phenotyping use cases to apply the methodology to the NLP processes?
      • Beacon is a natural laboratory, the other SHARP projects, such as Cognitive, may also be obvious end users. SHARP is not going to do CER, it will enable others to do CER. We can claim success by facilitating this type of research that others will do. We need to make the best tools, and as many of them as we can, to enable others to do research with the tools.
      • Define a project arc for the four years, so that there are milestones along the way that allow you to assess your progress during the course. How do we define success as Area 4 and how will others define success?
      • Need to provide tools for researchers, realistically, in two years - and identify partners to take the tools and use them in the last year or two of the project.
      • ONC is looking at intermediate timepoints, in six month periods and asking for who are the end users of what you are producing. As a non-renewable cooperative agreement, there are clear deliverables for the end of the project.
  • A SHARP Area 4 Sandbox

Should there be a workgroup across the projects to define requirements for a test environment for Area 4? Could it be developed internally or would it need to be outsourced?

Project 4 Scaling Capacity, Marshall Schor, IBM

This project is here to ensure that should you adopt UIMA as a framework, your use is optimal and succesful. There are new UIMA capabilities that address large scaleout of processes.

UIMA facilitates the building of modular components for unstructured analytics. This allows sharing and re-use of the modules by others.

UIMA-AS allows the scaleout of analytic pipelines and networking issues of time and geography.

  • High-throughput Analytics
    • Running multiple cores, java with large memory footprints, etc. to improve existing analytics that need to run faster.
    • Can use UIMA-AS to run analytics locally - identify bottlenecks, etc.
    • Looking to improve UIMA-AS as an output of its use in SHARP Area 4.
  • Activities
    • This group will engage to assist in early choices in using UIMA as well as fine-tuning of processes after they've been running succesfully.
    • Use cases articulate how an end-user would actually interact with the product you are developing. Combining use cases with good architectural thinking.


http://uima.apache.org Please view Marshall as the UIMA lab TA, who will help as you try to use UIMA. Also under Educational Demo/Docs on the wiki.

A tutorial for UIMA would be very helpful for the group. This is of interest for several projects in Area 4. What baseline info are you expecting of people before you come out to do a tutorial? What is most helpful?

  • Go through Quick Start
  • Read the tutorial and user guide. Go through the example.
  • This could be done via WebEx to maximize attendance. Additional follow-up for specific projects? Could use telepresence suites at Deloitte offices?
  • Indicate to Lacey what your requirements for this type of assistance to ensure one project does not utilize all of the resources for this assistance.
  • Send Lacey the particulars of what you need training-wise for UIMA.

IBM can consult on the Area 4 Sandbox.

  • Is this set of machines or is this a place in cyberspace?

Advice on how best to build the Sandbox. And there is some capital in the budget to resource the Sandbox.

  • What about a virtual machine?

Open source licenses - public versus eventually commercializable. Apache cannot include LGPL easily. Need to be aware of the licensing being set up. Can it be changed, commercialized, etc? This decision has implications on what you can do downstream. Stick with non-viral. May be contraints imposed by Apache UIMA. Need some more info before making a decision. Organize a task group to look at licensing issues. Open Health Tools may be a resource for this task group.

  • What is the final software deliverable?

A library of tools, widgets and annotators that would fit in a UIMA pipeline and facilitate secondary use of EHR data. This may not require a UIMA-AS scaleout. High-througput, low latency?

  • How does UIMA fit with the structured data?

Structured data, if in non-standard codes, etc., could be handled via a UIMA pipeline to 'normalize' the non-standard codes/structures into a standardized format for use. Then the normalization pipeline could be joined with phenotyping pipelines to result in a classification of a patient for clinical research. Some of the complex phenotyping algorithms may not fit well within a UIMA framework. UIMA may not make sense as a common framework for data flows for all aspects of Area 4.

    • We need to execute towards a concrete use case. UIMA pipeline and otherwise. Architectural work needs to be grounded in use cases and requirements.
    • Monthly architecture-specific telecon with representatives from each project and any other interested parties.
    • This may introduce some dependencies and these need to be planned for.

  • Is UIMA document centered? Does it have a representation of 'patient', etc?

Yes, you can define 'types', which could include patient. This types may be best defined with an end-user in mind. May need a process by which we can put suggestions for these types in a repository that can evolve as we proceed.

  • What is the strategy for describing a set of use cases for the four year projects?

Are they evolved from individual projects and then assembled from there? We also need to look at the other SHARPs and the Beacons for use cases. Still need a process. Is this something that's kept on the wiki? An overall strategy will help define what best fits in UIMA and what does not.

A library is a good deliverable, but having some pre-packaged pipelines will be helpful for users.

Cloud computing as a way to bring technologies to local sites. In 2011, SHARP will address the security of cloud-deployed systems.

Project 5 Data Quality, Kent Bailey, Mayo

What are the biostatistics needs of SHARP Area 4?

Missing data:

  • New imputation techniques that are now available for statisticians
  • Multiple imputation uses the probabilities associated with the imputed data points, which reflects the uncertainty of that imputation

Uncertain diagnosis:

  • Diagnosis are decisions clinicians make to proceed with patient care. However, these decisions are not necessarily pertinent or appropriate for research needs.
  • Wellness and disease are not binary states, rather a continuum, diagnoses may be better approached from a probabilistic perspective when using clinical data for research.
  • For epidemiological studies (PAD example, see slides), you don't add up number of cases and controls, you add probabilities.

Unequal precision of a continuous phenotype:

  • There is uneven uncertainty of data elements, both over time and from different sources.
  • Situations where you need to account for uncertainty in the data. (This is the overarching problem addressed by data quality)


  • Data is not a number, but a posterior distribution (MCMC).
  • Don't try to change the data quality, but take it into account. Therefore you can propagate the error to take it into account throughout your analytical steps.


Example of the blood counts, how do you handle the fluctuation in an individual's data points as they change over time? This change may or may not be relevant to the analysis.

  • For phenotyping, the bulk of the exclusion is determining which phenotypic information is relevant to the questions being asked.
  • What are the metrics of good data? This is what allows us to make a decision about what information is relevant.

Caution - the summary of a patient's state is a much higher dimensional thing than the number associated with a specific data element. Does it address disease severity, disease progression? What do you model? The assumption is that you are modeling the probability, but you may be modeling a combination of factors.

  • Would there be value on modeling with data from health people?

Using the probablities associated with a diagnosis, is there a threshold where you can declare disease?

  • You need to analyze what the appropriate threshold is.
  • May be able to get some power out of using the probability versus the threshold.

The idea of "perfect" data is not a useful goal. We should focus on the utility of the data and the probabilities may be a good approach to defining that utility for the other projects. For research, you need the data as predictors for the questions you are asking, not as a decision making process to determine how to proceed with clinical care. This is the leveraging of opportunistic data.

  • What are the data consistency metrics we want to look at for phenotyping?
  • Consistency is not lack of quality, you need to measure the variability and take it into account in your analysis.
  • Which downstream analytics require higher consistency data? Which can tolerate more noise? This may allow you to determine how much effort needs to put into standardization of data on the front end.
  • How do you identify garbage data, such as lab values for the deceased, null values, etc.?

Need a task group to formulate the trajectory of the Data Quality group to determine what is needed by the other projects.

Project 6 Evaluation Framework, Stan Huff, Utah

  1. Clinical Element Models
  2. Evaluation Framework

1. Clinical Element Models

Clinical element models can be useful for SHARP Area 4 as targets, guides, definitions. Also for data QA, when the data provided are inappropriate for the type of data element (i.e. a negative number for BP).

The number of models covers ~20% of those needed, but has addressed the most common data used so covers ~80% of what would be needed. The group is happy to develop new models.

The models are associated with value sets and standard terminologies. The models functin in the context of the LexGrid Terminology Server.

The models assume that you have access to terminology services.

2. Evalution Framework

See slide for model of widgets/network services use between multiple facilities. Assupmtions:

  • Widgets to provide normalized data instances
  • Normalized data instances shared through NHIN
  • Sharing across research institutions, or between institutions and HIEs
  • Data verification through comparison to data in EDW at either Mayo or Intermountain
  • CEMs


CEM can be compiled into open EHR architectures. Are these parallel? Yes, almost identical. CEMs are available publicly.

NHIN - organizations are starting to share data. Mostly documents, CDA. Can start putting more structured data on NHIN. Could we put tools into NHIN? Query terminologies?

  • Target new v3 ITS, rather than RIM. Want to support the HL7 standards. Hoping they turn into templates on CDA.
  • Trying to support more than one technology if that is what will help people.

UIMA flow where widgets sit in detailed model (see slide). Could be the glue between many parts of the diagram.

What about missing data in the CEM? Thinking in terms of the outputs of NLP.

  • Mandatory vs. optional elements in a CEM.
  • Which fields are necessary to make any sense of the information.
  • Can be specified in the application.
  • Could add the probability associated with how likely this is the correct representation of natural language.
  • NLP can bring context, so may be able to assign probability.

CEM is the instance level. Can it be moved to the patient level?

  • The CEMs are the starting point for that. The next level of aggregation/abstraction are rules that can be applied to the CEM.
  • Episode of care model? This needs further definition, a single visit or all the care received for a particular condition? Can both be modeled.

Final Thoughts

Can we use the wiki as a the primary way to communicate?

  • Do we need a regular newsletter? Perhaps updates from Chris.
  • Any user can watch any page. Can you watch the whole site? You can watch the recent changes page.

High level of enthusiam from the group for these projects and the real chance to make an impact on the ability to use EHR data for clinical research.