Area 4: SHARPn Notes 01032012

From SHARP Project Wiki
Jump to: navigation, search

Pan-SHARP Area 4 Planning

January 3, 2012

Attendees: Lacey Hart, Calvin Beebe, Guergana Savova, Tom Oniki, Christopher Chute

  • Review the vision for Pan-SHARP Area 4
  • Task list (high level buckets, ensure that we aren’t missing anything)
  • F2F at end of January

Discussion:

  • If the Pan-SHARP is going to be credible and successful it has to be demonstrable and deployable in more than one circumstance
  • Concern with the initial Texas view was that they were basically expecting a canned CEM, sort of oblivious to where they came from, it seemed somewhat purposeless to have a medication reconciliation that would only work on data in a box. Where that box was very strictly defined and very non-generalizable
  • If we are going to be realistic about med recon, it’s one thing to have that reconciliation algorithms, that is valuable, Texas will work on them and produce as a SMART access app
  • Important that we demonstrate some data normalized capability from some reasonable formats
    • NLP, CDA and possible HL7 data stream messages of medication orders or something like that
  • This seems more credible; where you can take end to end input from a medication organization or institution do the normalization, get it into a format that Josh Mandel is happy with and that the Texas people can chew on and then spit out a reconciliation list.
  • Without the components of normalization from import it was window dressing for the most part
  • The only minor variation from what earlier expectations were, everybody agreed that information should come from a NLP source, was just insisting that we have structured forms as a valid input
  • Since the target is the presumably on medication CEM structure that you are working on Tom or have worked on
  • Clarification that we got too was whether we were delivering a tool or data
  • Mayo data? Need to start the IRB data and internal permission process at least for HL7 data and whether we can get our hands on any CDA data is being explored. Clinical note data we can get using NLP algorithms.
  • Talked with Rochester Epidemiology Project (REP) about taking data from the community. Olmsted Medical Center and Mayo Clinic and then reconcile across enterprises and as well as within enterprise, thought it would be a useful thing. Do reasonable anatomization by slipping dates an arbitrary period and fuzzing the details dates plus or minus a few days as well as changing demographics, rounding down to a decade and then adding or subtracting a random offset somewhere between a 5 year framework fairly realistically de-identifying these data so that it wouldn’t match anyone’s real time frame nor would it match their demographics in any significant way.
  • Think we can generate fairly real clinical data and demonstrate that within enterprise and across enterprise, a reconciliation process is executable.
  • Furthermore do it in a way that theoretically organizations would be a position to use the end to end pipeline
  • How do we go about defining what it is we need to do to accomplish the vision?
  • Concern how do we break it down into manageable tasks? Several obstacles
    • Normalization capability of CDA and HL7 format. Back in June when we did normalization over the HL7 format it didn’t go very well. It turned out to be a much bigger task than what we expected and really don’t know where we sit right now we the normalization of HL7 streams. Don’t know how much work has been done we the CDA streams. Although, it seems straight forward from the experience back in June with HL7, think it is a significant chunk of work. Caution because of the expectations and the timeline and new end date for the SHARP program
    • Data normalization, on the NLP side we have the MITRE tool and the models are trained for the de-identification of the data. Cannot guarantee that it will be 100% de-identified probably in the low 90’s. The MITRE tool works very well. However, it we are to de-identified additional streams of data that is something that we have not even tasked. Would be expect it to be a very significant chunk of work. The reason for raising this is that given all the other pressures, want to make sure that we commit to something that is doable
  • Two observations
    • Body of work should not affect the NLP efforts in any measureable way. It is a feature that we are pressing to do this given the shorten time frame within which we have to work
  • Pertinent observation is the de-identification bit, it would be the expectation that we would not try to de-identify the notes in any way, shape or form. Would run the data on real Mayo notes here get a surrogate to run them for you and would generate CEMs
  • De-identifying CEM fields, the only potentially identifiable elements within those CEM fields is age and real dates that can fit a fingerprint of date time patterns
  • Two attributes that was planning to literally fudge after we do join and linkage using REP linkage infrastructure
  • Essentially use identified up to and including to the point where we generate CEMs for a given patient, a file of a pile of CEMs that all pertain to the same patient. Then once we have the file, serialize the patient identifiers so that they are arbitrary sequential numbers and then fudge the dates so that the fingerprints aren’t – shift them and fuzz them, truncate ages to decades and then add an arbitrary offset to give a clinical sense of the age of the patient but it would be a synthetic age.
  • Sharing normalized data which would be CEM, the de-identification would happen within the CEM
  • One of the goals of Texas, whatever we do in this demonstration we have to be able to literally publish the test set and the training set which means that the data needs to be fairly solidly de-identified to make it useful to people who want to run it and see how it works and model their data similarly. We have to be able to publish to data set as well
  • We aren’t going to try to de-identify the input data just de-identify it at the time of post-normalization
  • De-identification will be the last step
  • Need a tool that accesses the relevant fields within the CEMs and de-identifies them
  • 100-200 patients, manageable data set
  • Smallest of the data set will be an advantage, we can inspect it, if we have any concerns over the algorithms we are applying as far as the de-identification we can go do a manual inspection on 100 or 200 rows in a data set to verify that we have the effect that we want
  • Theory is that if it works well on 100-200 for demonstration purposes then it should work on larger number
  • Need the full data requirements, need to understand the exact data elements that are expected in the data set which would help with the IRB submission (operate on the on-medication CEM)
  • Have initial ONC sign off need to get through the IRBs and the steps from Olmsted Medical to get a data set
  • Data normalization and documentation are the big buckets, but need to walk through all of the steps and our timing of that.
  • The expectation by Friday is that we have somewhat of a shell of a plan on how long this is going to take, because we are the main dependency for everyone else
  • Steps
    • IRB approval
    • Data access from Olmsted Medical Center, question is if they can produce it
  • Would be cool to have HL7 data, CCD data, NLP text data cross enterprise, within enterprise
  • Within the on-medication, important for the demonstration some sort of providence data, this data came from NLP, this data came from CCD, and this data came from HL7. In the current on-medication CEM do we have that level of providence data. Tom will check. Sounds like something that should be in the wrapper
  • CEMs should have providence within them, within the wrapper. Global across a bunch of CEMs
  • Tom will check on the wrapper
  • Observations
    • Mapping that is one area that we weren’t involved as we should have been for the tracer shot. Verify that Utah is involved with the people that are building the software that are mapping those sources to CEMs to help them with the mapping
    • Occurred the Guergana hasn’t been involved in the discussions that we have had where we have assigned the core noted drug and core lab. We should circle back and help her understand the target models. Was there substantial changes since the June model, subsetting more than anything. Noted drug should be pretty close. Send the current copy of the on-drug CEM. Rational a lot was source specific
    • Pan-SHARP one of the items was a transform from CEMs to the SMART API. The RDF transform tool. Utah is on the hook for that. Josh and Joey have been talking, going to take input from both sides. Not clear on what that means the CEM to SMART transform. Generate a XML blob and have Joey/Josh work on transforming that XML into some kind of RDF that they like. How are you going to expose CEM instances, how you decide to do that becomes the source of the transform. What Josh tells us becomes the target of the transform. CDL what would we be working from is the XML schema. XML data would be the source. What the XML data looks like is defined be the CEM (the source). The XSD governs on how the data looks.
  • Where are we getting the Mayo data? Start with REP; define patients who are Olmsted County residents who have had medical appointments at both OMC and Mayo Clinic within two year period. Who have signed MN research authorization. Who does the search in REP to compare? Need someone who has experience working with the Rochester project data source.
  • Providence stuff has to be manifest within the XML schema. Goal is to load in the database.
  • Really don’t have any experience is consuming CCDs. OHT toolset for extraction of CCDs would be helpful. Try the library, need to figure out how to use
  • Run them first through NLP, Mayo data, is Mayo running through the NLP and is it cTAKES. Whatever NLP software is be used will be run at Mayo on Mayo data, whether it is cTAKES. The public medication extraction that was releases does that cover most of the basis to generate the information that would be in the on medication CEM. Yes it does and that code was used for the June demo. Mayo staff, work with Hongfang
  • Will it be run through the Cloud and the instances? Doesn’t matter
  • Run it through a normalization, through the UIMA pipeline. No further normalization on the NLP output, generate a raw CEM. Downstream process where we would have to take information for the same patient and put it into a patient space and then do the final step of annotation after we have done patient reconciliation. Patient reconciliation that would include output from the NLP
  • Need to put a wrapper on the output of the CEM, Mayo data, NLP sourced
  • HL7 data, were sourced, synthetic source
  • HL7 mapping table lookup, within pipeline within process, normalization for HL7 data, not CCD data if well formed. Phase 1 of meaningful use essentially nationally used priority codes.
  • Principal drug code will be the major issues. Dose, frequency, route
  • Data, run it through, persist it – time frame for 100/200 patients. A second per document, if there are no hiccup
  • Mapping table and running HL7. Setup will take a while. Enhancements to the algorithm. Haven’t done dose, route, and frequency. In the UIMA pipeline for HL7 (Harold/Sridhar)
  • Pan-SHARP meeting at end of January, Joey, Tom, Calvin and Lacey will be going