Annual Gathering/6.11.12Mtg Notes

From SHARP Project Wiki
Jump to: navigation, search

Contents

SHARPn Summit 2012

Monday, June 11, 2012 Presentations & Meeting Notes

SHARPn PI's on Secondary Use of EHR Data

  • 800AM-830AM, Lecture Hall, Rm 414
Christopher G. Chute, M.D., Dr.P.H. Principal Investigator Mayo Clinic
Stanley Huff, M.D., Co-Principal Investigator , University of Utah, Intermountain Healthcare
Chute.jpg
Christopher G. Chute, M.D., Dr.P.H.
Stan huff.jpg
Stanley Huff, M.D.

Clinical Data Normalization - Practical Modeling Issues

  • 830AM-1000AM, Lecture Hall, Rm 414
Stanley Huff, M.D., Co-Principal Investigator , University of Utah, Intermountain Healthcare
  • Presentation
  • Mtg Notes:Stanley M. Huff, M.D.; SHARPn Co-Principal Investigator; Professor (Clinical) - Biomedical Informatics at University of Utah - College of Medicine and Chief Medical Informatics Officer Intermountain Healthcare. Dr. Huff discusses the need to provide patient care at the lowest cost with advanced decision support requires structured and coded data.
  • Detailed clinical models are the basis for retaining computable meaning when data is exchanged between heterogeneous computer systems. Detailed clinical models are also the basis for shared computable meaning when clinical data is referenced in decision support logic.
    • The need for the clinical models is dictated by what we want to accomplish as providers of health care
    • The best clinical care requires the use of computerized clinical decision support and automated data analysis
    • Clinical decision support and automated data analysis can only function against standard structured coded data
    • The detailed clinical models provide the standard structure and terminology needed for clinical decision support and automated data analysis
  • Data normalization & Clinical Models are at the heart of secondary use of clinical data. If the data is not comparable between sources, it can’t be aggregated into large datasets and used for example to reliably to answer research questions or survey populations from multiple health organizations. Without models, there becomes too many ways to say the same thing.
  • In order to represent detailed clinical data models, we have designed The Clinical Element Model (CEM). When we state “The Clinical Element Model” we are referring to the global modeling effort as a whole, or in other words, our approach to representing detailed clinical data models and the instances of data which conform to these models.
  • Discussed pros/cons of pre-coordinated vs. post-coordinated strategies; negation and uncertainty.
  • How are the models used in EMR?
    • Amoung other uses, this serves as a solution for the "curly braces problem" - the model coupled with the terminolgy allows you to represent this within a system.
  • How would the models be used globally?
    • If we could have our way, we would have people collect the data in a consistent model, but that is not practical in today's reality. We do need to collect the essential elements.
  • Discussed modeling as an international collaboration. Collaboration is happening; bringing everyone together and keeping them together is work. There are many existing modeling approaches and efforts. Mission of the Clinical Information Modeling Initative (CIMI) to improve the interoperability of healthcare information systems through shared implementable clinical information models.
  • Question: will there be a repository for other peoples work? Yes. There are a number of other representations. We are looking for people to submit these other representations. Where the original representation is maintained and a CEM is also maintained.
  • Models can be used across different tasks: tasks may have different requirements of granularity. Idea is to make the model as comprehensive as possible, then constrain the model. Otherwise, if you do it the other way (multiple specific models) then you risk having conflicting representations.
  • Goals:
    • Shared repository
    • Single formalizm
    • Based on a common set of base data types
    • Formal bindings to standardized clinical models
    • Free and open for downloading
  • Formalism of the Model
    • The Logical Structure is Preeminent
    • This is like math - you must have a formalized structure - until you have that, you can't express things precisely
  • Mods and Quals of the Value Choice
    • Clinical elements can contain clinical elements and can have modifiers and qualifiers
    • The name of the model is at the top of the hierarchy
  • Qualifiers
    • Within a unique measurement, you can add specifics related to the measurement (e.g., for BP, body location, patient position, etc.)
  • Is there a boundary on the qualifiers?
    • That's one of the issues of modeling
    • In the degrees of freedom, we're trying to focus on what people want to use and what is really needed.
    • There are some qualifiers that would more naturally go into calculations from other data
  • For all of the qualifiers and other coded elements, are all in a standard terminology?
    • We want to use standard terminology wherever possible.
    • We find that about 80% of the terminology can be found in existing vocabularies; for the rest, we are committed to submitting these to LOINC, SNOMED, etc.
    • Modifiers are structurally the same as qualifiers, but are intended for different use.
      • May need to store the baby's blood type in the mother's record
      • In SNOMED - they refer to this as a change of meaning context
      • Never want to count a "Family Hx of cancer" as "this person has cancer"
      • Modifiers provide a context for the information; qualifiers tell you more detail about the information - You can still have valid information without a qualifier, just less detail; if you take out a modifier, you can change the meaning of the element entirely ("history of" versus "patient has")
  • CDL
    • A coded representation of the model; this is the third computational language Intermountain has used for doing this.
  • Modeling styles and strategies
  • How do we avoid having localized versions of the common models?
    • We want to proliferate as much as we need to
  • In SHARPn, we've taken the models we've used at IHC and have tried to apply them to other settings
    • One concept is findings that include the attribute type (hair color) with the finding (brown)
    • Evaluation style - looking at a particular finding related to an attribute
    • Assertion style - look at a particular attribute of a person
    • Creates a different decomposition of the information
    • Typically use the evaluation style when the finding is a number (it would be strange to say (BP80) as a data result because you'd need many, many data results.
    • Both evaluation and assertion styles are accurate and unambiguous
    • Assertion styles allow each assertion to become a present/absent column for statistical analysis
    • Assertion styles are best for reasons, complications, final dx, etc.
    • Conclusion: you need both.
  • Deprecated representation
    • Single code style - sticking a code in a record and not making any additional assertions about it.
      • You need to make an assertion that this code means something that's true about the patient (the code for brown hair color in a record means that this is the patient's hair color)
    • This is generally a bad idea, so we deprecate this information.
  • Q: how do you recognize that two models are representing the same things:
    • We develop "Isosemantic models" where we show the collection of models that end up representing the same thing based on terminology options.
  • When you talk about family history, this is a generalization of the fetal blood type example
    • Family history is a specific case of a subject finding
    • If you model this well, you can look at this as relationships from the general - happened in people - down to more specific - happened in family history >>> specific relative with a family history, etc.
  • Risk with postcoordination is that you miss some of the information if you don't go deep enough
  • Found it difficult to do transplant information
  • Want to have in the public forum for discussing what is essential for interoperability.

Cloud Resource Lab

  • 830AM-1000AM, Innovation Lab, Rm 415
SHARPn Cloud Computing Resource Lab Troy Bleeker, CBAP®
Cloud IT Team.jpg
Cloud IT Team


NLP Research Presentations

  • 830AM-1000AM, Classroom, Rm 417
Part I: NLP Fundamentals: Methods and Shared Lexical Resources Guergana Savova, Ph.D.
Part II: Dependency Parsing and Dependency-based Semantic Role Labeling (SRL) Steven Bethard, Ph.D.
Part III: Discovering Severity and Body Site Modifiers: a Relation Extraction Task Dmitriy Dligach, Ph.D.
Part IV: Applying Dependency Parses and SRL: Subject and Generic Attribute Discovery Stephen Wu, Ph.D.
Part V: Discovering Negation and Uncertainty Modifiers Cheryl Clark, Ph.D.
  • Mtg Notes:

NLP Fundamentals: Methods and Shared Lexical Resources (Guergana Savova)

Guergana.jpg
Guergana Savova, Ph.D.
  • Guergana gave a high-level view of the history and goals of the NLP team.
    • NLP team discussed the six general CEM templates with the data norm team to choose normalization targets for the NLP group
    • The team then met with the Phenotyping group to understand what data they would consume they could understand how to normalize to our end user
  • She also presented a high level over view of NLP areas of research and impact/interest in current scientific endeavors
  • In SHARPn, the NLP team is taking the best all of these areas to create a tool, to directly impact clinical care.
  • Presentation line-up was introduced

Dependency Parsing and Dependency-based Semantic Role Labeling (Steven Bethard)

S.bethard.jpg
Steven Bethard, Ph.D.
  • Steven led a discussion regarding a transition based SRL algorithm that leverages dependency parsing algorithms and stacking to better ascertain grammatical dependencies.

Discovering Severity and Body Site Modifiers: A Relation Extraction Task (Dimitriy Dligach)

D.dligach.jpg
Dimitriy Dligach, Ph.D.
  • Dmitriy led a discussion regarding cTAKES relation extraction research.
  • His research used CEM templates to discover attributes/modifiers for body site and severity
  • Leveraged two types of UMLS relations (LocationOf and DegreeOf)
  • His research heavily leveraged ClearTK to assist with feature extraction, training, and as an evaluation framework.
  • Results
    • Best features are entity and word features
    • Best parameters are linear kernel and the downsampling rate

Applying Dependency Parses and SRL: Subject and Generic Attribute Discovery (Stephen Wu)

S.wu.jpg
Stephen Wu, Ph.D.
  • Stephen presented his work on SHARPn tasks 4&6. His research looking for attributes of named entities.
  • The methodologies for doing this are dependency parsers and semantic role labeling.
  • Two domains of study:
    • Generic attribute discovery
    • Subject attribute discovery
  • Stephen discussed case examples of how important it is to define an attribute correctly.
  • Different types of rules: noun phrase structure, path to root, path between pairs, semantic arguments.
  • These modules are in the assertion module within the cTAKES 2.5 release

Discovering Negation and Uncertainty Modifiers (Cheryl Clark)

C.clark.jpg
Cheryl Clark, Ph.D.
  • Cheryl’s introduced her task which is to extract negation (events not occurred or not existing) and uncertainty (contain a measure of doubt) modifiers
  • Previously, she used an assertion analysis tool and then incorporated in into a UIMA framework to fit within a cTAKES analysis pipeline.
  • The assertion categories were edited to meet I2B2 and SHARPn categories.
  • We are aiming to refactor our assertion module to move from a single, multi-way classifier to multiple classifiers, some of which are binary.
  • Work completed:
    • Simple mapping from I2B2 assertion classes to SHARPn attributes
    • Direct assignment of SHARPn attribute values which will use multiple classifiers on SHARPn data
  • Next steps:
    • Assertions for relations relationship discovery
    • Model retraining for individual attributes
    • Evaluate i2b2 gold annotations vs. accuracy of SHARPn gold annotations

Standards, Data Integration & Semantic Interoperability

  • 1030AM-1200PM, Lecture Hall, Rm 414
Part I: Introduction to SHARPn Normalization Tom Oniki, Ph.D.; Hongfang Liu, Ph.D.
Part II: Semantic Normalization and Interoperability lessons learned Tom Oniki, Ph.D.; Kyle Marchant; Calvin Beebe; Hongfang Liu, Ph.D.
Part III: SHARPn Infrastructure and Normalization Pipeline Demonstration Vinod Kaggal
  • Meeting Notes:
  • Data Normalization Goals
    • To conduct the science for realizing semantic interoperability and integration of diverse data sources
    • To develop tools and resources enabling the generation of normalized EMR data for secondary uses
  • Normalization Targets
    • Clinical Element Models
      • Intermountain Healthcare/GE Healthcare’s detailed clinical models
    • Terminology/value sets associated with the models using standards where possible
  • Panel walked through the SHARPn normalization process
    • Prepared mapping / Two kinds of mappings needed:
      • Model Mappings
      • Terminology Mappings
    • UIMA Pipeline to transform raw EMR data to normalized EMR data based on mappings
  • Discussed how the team adopted a hybrid agile process: combining both top-down and bottom-up approaches
    • Identifed gaps through normalizing real EMR data
    • Modified components to close the gaps
    • Evaluated the CEM results
    • Iteratively improved the pipeline processing
  • Various sources were utilized to obtain Medication information.
  • Legacy application still leverage HL7 2.x messages to convey order information between systems.
  • CDA narrative and structured documents are coming on-line which convey snapshots of patient’s current state and medications.
  • Lessons Learned:
    • Open tools would be a great contribution to the interoperability. Examples:
      • mapping terminology, e.g., local codes to LOINC/HL7/SNOMED
      • mapping models, e.g., HL7 messages/CDA documents to CEMs, CEMs to ADL, etc.
      • generating sample instances
      • communicating information
        • browsers
        • generating documentation
  • Key Take Aways:
  • There is significant progress in creating different models for different use cases including CORE Note Drug Model, CORE Standard Lab, CORE Patient, etc. These models help to define and constrain attributes for uses cases for secondary use, clinical trials, lab work, etc. Applications allow one to leverage attributes from and across different models to focus on specific topics/characteristics
  • There are also ongoing efforts to develop/improve/ and achieve common terminologies, mapping models, and natural language processing applications to support data collection, data extraction, and manipulation for specific research/problem targets
  • CMS meaningful use was mentioned as a new source of data/information that groups may pull data from
  • Lessons Learned
    • Open tools are need to improve terminology, documentation, mapping models, sample instances, communication information etc. We need tools that allow others to implement results
    • One model doesn’t fit all – We have to find a way to share around disparate use cases. We must also understand each organizations decision making processes, data structures, and related assumptions (most issues surround pre and post coordination)
    • Design of pipelines must be flexible – as CEM models are constantly changing
    • UIMA is a good architecture – due to its configuration and flexibility
    • Relational structures for demographics worked well – including a rich data set, collection of new patient information, and mapping capabilities. Challenges still exist with date formatting and storage.
  • High Level Notes:
  • Tom Oniki (CORE Model/Terminology) – focusing the clinical element model and terminology/value sets
    • Lots of data in predecessor models – creating clinical element models relating to EMRs and use cases which includes the “Core Model”
    • Core Model defines attributes for use cases – can constrain attributes most important to for use cases relating to secondary use, clinical trials, or labs
      • Different models are required for different use cases including:
        • CORE Noted Drug Model
        • CORE Standard Lab
        • CORE Patient
      • The secondary use model has been combined with SHARP “reference class” – this compiler creates a structure that list the most important attributes based on the CEM definitions and plugs them into the Lab XSD so that all attributes are expressed
    • Terminology/Value Sets
      • Terminology defines the valid values used in the models
        • Attributes can have qualifiers such as gender, etc. This CEM model is connected to a value set with standard terminology which will show that the appropriate terminology includes M and F (male and female), etc.
    • Q&A
      • Who would use the request site?
        • Right now it’s partners that we are working with on SHARP or GE. Someone that’s in development of a model/application or expanding an existing model. It’s not for the end user or others in the community.
  • Hongfang Liu (Data Normalization) -preparing and transforming raw data to normalize EMR data based on mapping
    • There are two kinds of mapping – including model and terminology
    • Pipeline includes what in the HL7 will be combined with CEMs
    • Model mapping worksheet (excel document) – includes HL7 message, descriptions, CEM attributes, and terminology mapping
    • Terminology Mapping – includes CEM fields, local code, target code, and target code systems to match targets
    • Pipeline – to implement in Unstructured Information Management Architecture (UIMA)
      • Data sources
      • Model mappings
      • Terminology mapping
      • Inference Mapping – (i.e.) ingredients from clinical drugs
    • Q&A
      • Can they start with a general model and get a 60% match?
        • Without seeing the data we don’t know what the local codes will look like. So we have two options: 1) the local institution has adopted a terminology set and 2) it is data driven and we are looking at the trends, to determine how data was generated, etc and try to come up with the mapping. If you have adopted standards, we can do a default configuration. We can sometimes face issues with codes and require updates to normalize the data.
      • Models are multi-valued? What if there is a difference between source and target?
        • We allow an array of structures. We can have two codes in the value set.
  • Calvin Beebe (Documenting Standards) – Identify gaps in normalizing real EMR data; modifying components to close gaps; evaluating CEM results; and iteratively improving
    • Using three sources: HL7pharmacy, CDA R1 clinical documents and CCD R1 Continuity of Care Documents
      • HL7.2x encoding rules – provide 80% of the structure most need. Rules explain the rules including occurrences in the field, etc. Fields are combined into logical topic segments that represents the data in a structured manner
        • Specifications will tell you what you will find and in what location
        • Segments includes patient information, patient visitor information, pharmacy information etc. – provides a template and definition of each field
      • Patient Context in CDA Documents – includes a structured header that allows data to be easily extracted for meta-data, patient information, etc.
        • Natural Language Processing is used to pull additional information to fulfill medication models
        • CDA R1 and R2 support and require narrative content, support section codes, require clinical documents, and NLP
        • CDC Document – Medication Entry
          • Includes vocabularies and defines required sources
          • Documents include landmarks to call out important information/data
      • Summary – (Explains) meaningful use includes new sources that groups are looking at to pull data from
  • Lessons Learned
    • Tom (Lessons Learned Modeling)
      • Open tools would be a great contribution to interoperability – tools that allow functionality (terminology, mapping models, sample instances, communication information). Tools that provide what we implemented and tools to help others do the same thing
      • Documentation is essential (this is hard) – need a tool to assist this
      • One model doesn’t fit – Find a way to share around disparate use cases
        • Most model issues boil down to pre and post-coordination and what to store in the model or terminology base
          • Through SHARPn we discovered that we conduct data storage in different ways which includes various assumptions. People are choosing different things (including differences in LOINC codes, lab test, display names, drug classes, etc)
          • Maybe we are in the experimental phase and some principles will appear. Right now we need to understand everyone’s decision making and learn from them
    • Hongfang Liu – (Lessons Learned Process)
      • Design of pipeline needs to be flexible to accommodate changes
        • CEM models keep changing, etc.
      • UIMA is a nice architecture
        • Configurable, model driven, seamless integration with NLP pipeline
        • Pan SHARP included a medical model into the pipeline
      • Diverse input formats
        • Structured – semantics are different
        • Unstructured – gap between free text and semantics of standards. Free text increases granularity levels
        • Different requirements for different use cases including medical rec, phenotyping, etc.
        • Too many standards to choose from when implementing HL7 standards
          • Need to understand local codes. Standards can also mean different things to different people.
        • Versioning of standards are crucial
          • Different granularities in the CEM – for example, dosage strengthens which require specific matching
        • Inference
    • Kyle Marchant (Lesson Learned Database)
      • Relational structure for demographics data worked well including a relational form
        • This gave us a nice look into the data in an easy way without having a great knowledge of the CEM structure
      • Sample data was useful – the more robust and rich the better for development and useful for clinical mapping
        • Made assumptions that codes were applied – therefore we took the data “as is”
      • Clinical CEM Channels were helpful
        • Provided three tables to be used to store data including Index Data, Source Data, and Patient Data – this was helpful for codes and to identify matching opportunities
      • Challenges
        • Date formatting – to allow for queries, select particular fields, etc.
        • Full level storage vs. XML Tradeoff – may need to look at additional relational fields
      • Leveraged the Mirth tool useful to move between highly structured formats, HL7 formats, etc
        • Choose to store patient data with new ID’s to match up demographic data at different time points – allows matching as data comes into the pipeline including names, addresses, etc.
    • Q&A
      • Some mentioned noted drug, what does that mean?
        • We started with the drug order model. But sometimes it’s not an order but a note that states what the patient is on.
    • Vinod Kaggal (Sharp Normalization Implementation)
      • Architecture
        • Use model-driven implementation - we built upon configuration files and transformed data elements into UIMA representation that led to mapping, semantic normalization, etc
      • Syntactic Mapping Types (Constants, one to one, one to many, inference)
      • Semantic Mapping – including a number of rules and maps
      • Pipeline Execution
    • Q&A
      • Question regarding resources?
        • The National Library of Medicine publishes related data as well.
      • Does UIMA add additional noise?
        • We take sources and determine a format that UIMA can use. Then we generate something that can capture the data to be included in UIMA. There is no data lost.

NLP Systems

  • 1030AM-1200PM, Classroom, Rm 417
Part I (30min): Comparative Study of Two NLP Framework Architectures Yixian Bian; Gunes Koru; Hongfang Liu, Ph.D.
Part II (20min): MCORES: A system for noun phrase coreference resolution for clinical records Andreea Bodnari; Peter Szolovits; Ozlem Uzuner, Ph.D.
Part III (20min): Multi-Scrubber: An Ensemble System for De-Identification of Protected Health Information. Anna Rumshisky Ph.D.; Ken Buford; Ira Goldstein; Ozlem Uzuner, Ph.D.
  • Meeting Notes:

Comparative Study of Two NLP Framework of Architectures (Yixian Bian, Gunes Koru, Hongfang Liu)

  • Yixian provided an introduction to her study comparing UIMA and GATE.\
  • Compared the two frameworks from three perspectives:
    • Software design quality, software maintenance, user’s manual
    • Study concluded that UIMA is better than GATE

MCORES: a system for noun phrase co-reference resolution for clinical records (Andreea Bodnari, Peter Szolovits, Ozlem Uzuner)

  • Andreea introduced her research with MCORES as a fundamental step in textural processing using various perspectives (greedy, etc)
    • Used a medical corpus provided by i2b2 and looked at these concept types: persons, problems, treatments, tests
    • We created a co-reference resolution module with features including: phrase-level lexical, sentence-level lexical, syntactic, semantic as well as token distance, mention distance, all-mention distance, sentence distance, section match and distance
    • Classified using a C4.5 decision tree algorithm and classified pairs based on feature vectors.
    • Evaluated using the feature set, perspectives evaluation, and performance evaluation against an in house baseline and third party system as well as an evaluation metric including un-weighted averages of Recall, Precision, and various F-measures.
  • Conclusions:
    • Greedy perspective performs as well or better than single-perspective systems and multi-perspective system performs as well or better than single-perspective systems.
    • MCORES outperforms third party systems and an in-house baseline, improving co-reference resolution on clinical records.

Multi-Scrubber: An Ensemble System for De-identification of Protected Health Information (Anna Rumshisky, Ira Goldstein, Ken Buford, Ozlem Uzuner)

  • Anna presented a de-identifier system created at MIT and SUNY.
  • Problem is current de-identifiers is clinical records vary between different institutions, the systems that perform well on one format will do poorly on another, sharing annotated data between institutions is difficult due to HIPAA.
  • Multi-Scrubber is a meta-learner built on top of base learners (NER, jCarafe, LBJNET, Simple Dictionary Tagger) to hopefully increase integration/interoperability.
  • Implementation is a trained on the output of the base classifiers over the annotated data
  • Rationale is that it should perform as well as the best of the base models
  • Conclusions:
    • Meta-model performs at least as well as its best performing base model on each PHI category. Especially on smaller data sizes.
    • Models generated on external data are reasonably helpful for new data

High-throughput Phenotyping

  • 100PM-230PM, Lecture Hall, Rm 414
Part I (30min): SHARPn HTP Introduction & Applications Jyotishman Pathak, Ph.D.
Part II (30min): Using EHR for Clinical Research Vitaly Herasevich, M.D., Ph.D.
Part III (30min): Association Rule Mining and Type 2 Diabetes Risk Prediction; Gyorgy Simon, Ph.D.
  • Meeting Notes:

High Level Notes:

  • Jyotishman Pathak (SHARPn HTP)
    • Meaningful use includes increasing adoption of EHRs
    • We are looking for scalable methods/resources to implement across multiple settings
    • Developed a suite of software programs that enable identification of subjects – disease, symptoms, etc.
    • Goal is not to do traditional work – but to create algorithmic and electronic means that allow specific phenotypes for co-variation, population research, clinical workflow, genotype-phenotype association results, and expansion across different settings
  • Lessons Learned
    • Algorithm design and transportability
    • Standardized data access and representation
  • Quality Data Model
    • Used Clinical Element Models to provide a structure that can be used to encode standard terminologies
    • Leveraged the NQF Data Model – that allows you to use meaningful use data (Phase 1 quality measures) vs. data definitions
      • Includes a measure authoring tool that allows view/comparisons across different data sets
      • Generating rules to implement on top of the data models
      • Investigated the JBoss management system which has been effective in healthcare and financial industries
      • Developed National Library – that allows data query
      • Leveraged the SHARP cloud – secure VPN to execute and to avoid hacking
      • Research related activities – experimenting with machine learning and associated rule mining - to determine identification of phenotype definition criteria and work flow presentation including algorithms for decision making
      • Working with the local Transformation Medical Group – this involves going back into the clinical data and running queries. This is now being done proactively and online including a decision support based system to actively monitor patients after blood transfusions to determine adverse events and to support active surveillance.
      • Research also include efforts to conduct clustered research
  • Vitaly Herasevich (Data marts)
    • There are many databases/systems collecting patient information – including many data points
    • Australian Incident Monitoring System (AIMS) was mentioned as being successful and was expanded to additional hospitals
    • Anesthesia Data Mart – 5 min delay, rich data set
      • Rule 1: Leveraged demographic data; and included data feeds that were useful for specific reasons
      • Rule 2: Allowing queries from multiple locations
      • Rule 3: Raw data – including original data sources leveraging clinical and meaningful data.
    • Approach technically
      • Tables are divided by years
      • Involves institution support
      • EAV structure
      • Continuously “Testing-production”
      • Test – production DBs
    • Data integrity
      • Areas of implementation
      • APACHE replacement project – APACHE IV – calculates scores based on the data model
      • Free text search – though there is a need to map readmission
      • Clinical reports
    • METRIC Reports – developed reports (monthly, ad-hock and customized) – for leadership review, etc.
      • Dashboards
        • Generate useful information for clinicians
        • Administrative and Clinical Support
        • Sniffers
    • Future
      • Point of care novel user interfaces, alerts, and decision supports
      • Reporting
      • Research
  • Gyorgy Simon (Diabetes)
    • Focus was on what distinguishes those that progress vs. those that didn’t?
    • Data set included Patient data set, co-morbidities, age, follow-up, and diabetes outcomes
    • Using predictive and computational modeling
      • Looking to identify risk factors and interactions
      • Leveraged regression analysis
      • Association Rule Mining
    • Challenges – missing data, clinical question, computation efficiency
    • Diabetes Disease Network Reconstruction
      • To determine diseases that relate to diabetes and the risk relationships carry
    • Results
      • DM Progression Risk Prediction
      • Comparison to machine learning methods
      • Can determine if patient will have DM in 4.5 years?
    • Q&A
      • Can the Bayesian network be applied?
        • Possible to use Bayesian network. The key is to find interactions. Not sure this network will make it easier to extract interactions. There is a difference between interactions and association rules

NLP Software Demos

  • 100PM-230PM, Innovation Lab, Rm 415
Part I: UIMA Introduction James Masanz
  • Discussion around the UIMA base underneath cTAKES and what was added to that base.
  • Serializing output to DB rather than XML is possible especially with the GUI being discussed next.
  • There was an agreement that the XML descriptors are hard t use. They do allow for tools to be written over top of them which is beginning to happen now.
Part II: cTAKES Tutorial and GUI Demo Pei Chen
  • The GUI designed was to make UIMA/cTAKES easier. It was built on top of UIMAfit. It is in beta.
  • The GUI was demonstrated as to configuration, viewing results, adding annotators, etc.
  • A plug-in framework is the future where a user can look for annotators and the system would download install anything selected by the user. Other potential ideas where discussed like hooking into machine learning.
  • Hypersonic is the current, out-of-the-boX DB, for a central repository.
  • This appears to be useful to UIMA only users not just cTAKES.
  • The UMLS (part that cTAKES uses) is still bundled with this. User will be prompted for credentials at use-time.
Part III: CLEARtk Tutorial Steven Bethard, Ph.D.
  • CLEARtk is additions to the standard UIMA that helps with machine learning.
  • Disscussion around the list of features, what is an outcome object, and how those are unique.
  • We saw the mechanism (a non-XML file) used to launch pipelines.
  • Other additions where discussed, like sequence tagging, chunking, evaluation, regression and ranking, Baselining.
  • There is a tiny bit of overhead for using this and therefore takes a bit more, but that is small compared to other things that take most of the time.
Part IV: Evaluation Workbench Lee Christensen (Paul Rodriguez)
  • The evaluation workbench is all about reliability metrics - comparing 2 annotators.
  • The workbench loads metrics into a central window for you to browse through. You can configure which annotator is primary or secondary.
  • cTAKES annoations have been run through the workbench recently.
  • Discussed the future items that may be coming in a beta.
  • Is it possible to view the CAS? You maybe could but it's not built for that purpose.

Clinical Element Model Presentations

  • 100PM-230PM, Classroom, Rm 417

CDISC CEMs Harmonization (Guoqian Jiang)

Paper I: Harmonization of SHARPn Clinical Element Models with CDISC SHARE Clinical Study Data Standards Julie Evans; Guoqian Jiang, Ph.D.
Paper I: CSHARE CEMs Harmonization Slides

OpenCEM Wiki (Guoqian Jiang)

Poster: OpenCEM Wiki: A Semantic-Web-based Repository for Supporting Harmonization of Clinical Study Data Standards and Clinical Element Models Guoqian Jiang, Ph.D.
Poster: OpenCEMWiki Slides
Demo I: CEMs to CDISCs SHARE Metadata Repository Landen Bain
Paper II: Pharmacogenomics Data Standardization using Clinical Element Models Qian Zhu, Ph.D.; Robert Freimuth, Ph.D.
Demo II: CEM-OWL Cui Tao, Ph.D.
  • Meeting Notes:

CEMS to CDISCs SHARE Demonstration (Landen Bain)

  • Wayne Kubick, CTO CDISC introduced the topic of metadata standards and CDISC. The clinical research process from planning, design, conduct, completion and application. In research the way data is represented across these stages is different. C-DASH is the way to standardize collection of data in clinical research process. Data management aligns data into tables for analysis. Domain-friendly, subdomain-spefici business models has been represented in (BRIDG).
  • CDISC SHARE – a global electronic metadata library, using BRIDG as its standardized data element definitions, relationships and rich metadata. SHARE takes all of the different variables and represents the greater scientific concepts to enable interoperability. Goal to link research terms and elements to healthcare and analytic concepts.
  • Demonstration in three acts
    • Research data-manager creating an annotated case report form pulling data elements from the SHARE repository. Domains included demographics, labs and meds.
    • At healthcare site of clinical care with enrolled clinical trial patient; pre-populating case management form and extracting data to generate CCD.
    • Sponsor side receiving CCD and pre-populate result forms using data contained in CCD. Physician can reconcile.
  • Question: how to control quantity of data - what they think doesn't work is always including provider in the workflow - rather need different rules by site specific algorithms. Discussed CCD 'relevant' medications...how do you compute 'relevant'? Within protocols time windows can be assigned.

Pharmacogenomics Data Standardization using CEMs (Zhu, Friemuth)

  • Pharmacogenomics Research Network (PGRN)
    • Diverse network of PGx research sites
    • Goal: Understand how genetic variations affect an individual's response to medications
  • Normalize data representations
    • Disease phenotypes
    • Drugs and drug classes
  • PGRN standardization effort to collect data dictionaries came to 4483 variables across sites. All the studies in PGRN have a variety of data representations. Took meta-data and ran through data-preprocessing into a centralized database. Mapped components and semantic annotations; categorized variables into: demographics, disease disorder, laboratory, medication, clinical observations, and smoking status. Categorized variables could be mapped to SHARPn CEMs. Results from study, 54% of the variables were able to be mapped to CEMS.
  • Some variables are not currently represented by PHONT (SHARP) CEMs
    • Computed research data (e.g., PK/PD)
    • Genomic data
    • Psychometric data
  • Work with SDOs to address these gaps
    • CIMI community on extant or new CEMs
    • HL7 and CDISC for clinical genomics data
    • W3C, NLM, & SNOMED PGx ontologies
  • Conclusion - demonstrated CEMS can be used to normalize genomics study data dictionaries.

CEM-OWL Demonstration (Tao)

  • A Semantic-Web Representation of Clinical Element Models
  • Semantic Web
    • Explicit and formal semantic knowledge representation
    • Web Ontology Language (OWL) /Resource Description Framework (RDF):
      • Define relationships
      • Define classes
      • Define constraints
  • Consistency checking
  • Link to other domain terminologies
  • Harmonize with other clinical data modeling languages
  • Semantic reasoning
  • Implementation Status
    • Meta Ontology
      • Basic category classes
      • Properties
      • Cardinality constraints
      • Two OWL experts and two CEM experts have evaluated the meta-ontology to ensure it can faithfully cover the original contents
  • Automatic Convertor: detailed CEM ontologies
  • Conclusions and Future Direction
    • Meta-Ontology: semantically defined the basic classes, properties, their relationships, and constraints
    • Convertor: CDLOWL
    • Represent SHARPn normalized data using RDF
    • Investigate SWRL/Drools combination for phenotyping

EHR Use Cases

  • 300PM-500PM, Lecture Hall, Rm 414
Poster (15min): Scenario-based Requirements Engineering for Developing Electronic Health Records Add-ons to Support Comparative Effectiveness Research in Patient Care Settings Junfeng Gao, Ph.D.
Regenstrief Institute (60min): Data Normalization / Clinical Data Repositories Daniel Vreeman, PT, DPT; Marc Rosenman, M.D. :(15min)
Q&A
  • Meeting Notes:
    • Junfeng Gao (EHRs to support CER)
      • Research visiting scheduling is complex and under supported – including workflows, regulation, policies, diversity of clinical settings
      • Secondary use of clinical appointments from EHRs – need to understand requirements
      • Clinical and Research Visits are often separate – need better coordination to improve time windows when visits can serve as a byproduct – this would reduce workflow parameters and inconvenience for patients
      • Objectives
        • Develop EHR add-ons to support CER
        • Scenario based requirements methods to deepen understanding of user requirements
      • Structured interviews and thematic analysis
        • Alignment of research and clinical visits
        • Workflow flexibility support
        • Data Governance policies
        • Constraints visualization and satisfaction
        • Interoperability with other systems
        • Reminder for upcoming visits
        • Privacy and regulatory compliance
      • Discussion
        • Results identified user opinions
        • Guided the development of EHR add-ons
        • Identified limitations
          • Community based research coordinators
          • Coders
        • Conclusion
          • Scenario based requirements engineering approach was effective for identifying user needs to improve the efficiency of patient visit scheduling
    • Daniel Vreeman (Challenges and Successes in Building a Community-Wide EHR)
      • Current systems includes an isolated computer system, we are looking for opportunities to connect systems – we are building bridges across islands of data (canopy computing)
      • Working with the Indiana Network for Patient Care (INPC) – Informatics shop to standardized data and provide services and tools – making available to others in the community
        • Participation includes over 90 institutions
        • Includes a global patient and provider indexes – local terminologies are mapped to standards – processing more than a million transactions per day
      • Public Health Uses
        • Electronic Laboratory Reporting
          • Monitoring system that identifies data to be reported to health departments – could also include infections, STDs, etc
          • 4X greater and faster at identifying this information and sending responses to public health officials
          • Quality Measurement –includes agreed upon measures that include a community wide view to avoid duplicity across researcher/organizations
        • Clinical Research – all researchers leverage this rich data set including review of the clinical recruitment process, etc.
        • Researched organizational connections across Indiana systems/organizations
        • Lessons Learned
          • Iterate and Leverage – start with things that you know vs. things you have to research or solicit from stakeholders
          • Build a Comprehensive View
          • Units of measure creates challenges and/or are missing
          • Mapping local to standard technologies are challenging – requires expertise and effort that are often underestimated
        • Prioritization
          • We must push standardization upstream
        • Q&A:
          • Does 41 include across the states?
            • Yes, 110 out of 114 hospitals.
          • How many individuals opt out of the system?
            • This is very rare to opt out. INPC makes data available with constraints. A provider can ask us to fax over information for a patient and INPC will do this automatically.
            • Patients can opt out of recruitment to RCT. This is more common than opting out of the INPC
    • Marc Rosenman
      • Clinical Data Repository – research and clinical repository, and population management (Beacon and governance committee)
        • Going Live in June 2012 and measures for ONC will occur in July 2012
        • Future plans
          • Add remaining counties
          • Data feeds (as approved by governance)
          • Establish repositories
      • Q&A
        • One slide you showed different types of yellows? Could this result to the characters?
          • In the laboratory they can select and write what they want or choose yellow.
        • Was that the raw dictionary?
          • This was text for display which is not normalized.
        • There are no value sets?
          • Yes, but they did not come with a code for mapping.
          • A solution would be to create a value set for the answers you want then a thesaurus to map to the concept.
        • How much of the data goes back upstream? How much teaching do you do backwards?
          • Yes we do, I let cardiologists know about divergences in their systems and had them look up examples to share with programmers to fix issues.
          • This is good to do but it’s a huge job because there are many different organizations doing different things. There is opportunity to do this but this would involve the senders and require work. Just getting to the point that this is a good thing, then say where do we go from here and move forward.
          • An illustration of this is a service, because this would help motivate others.
        • Where is the value to the individuals? It’s not the group using the data putting in the data – so we need to standardize and normalize up front.
          • We have 75 institutions and obtaining consensus requires a lot of sophistication.
        • Will you be doing mapping into LOINC?
          • In general, a typical laboratory catalog will include 1000-3000 test codes.
        • How many laboratories are we talking about? Are you also asking them to use structured vs. non-structured codes?
          • Yes, they can add a concept to the test catalog and they don’t think about the downstream affect. The test catalog can vary. Sometimes you will see the same test at an organization but another site can have another code. A lot of our thinking is around the dictionaries and we tend to map as we go.
        • One of the challenges state wide in Minnesota, is to encourage them to adopt SNOMED or not? At what point do you realize it’s feasible vs. the early mapping model which is important. CDC is trying to move to the adoption for reportable conditions at the local level.
          • This could be done in parallel with mapping at the local site. We have to balance the incentives. This is an open question regarding the scalability of this. This is an interesting question and this aligns with the right incentives. To your first question, I know there are 80,000 local codes in the labs that are mapped to 8,000 LOINC codes.

cTakes Coding Sprint

  • 300PM-500PM, Innovation Lab, Rm 415
  • Meeting Notes:
  • Many folks installed cTAKES 2.5. One issue was that using UMLS requires a user ID due to licensing. Users will need to get their own license once back at the office.

Data Normalization Deep-Dive

  • 300PM-500PM, Classroom, Rm 417
Part I (30min): Mapping EHR Data to SHARPn Use Cases Tom Oniki, Ph.D.; Hongfang Liu, Ph.D.; Susan Rea Welch, Ph.D.
Part II (15min) Standards Utilized for Source Materials Calvin Beebe
Part III (15 min): Terminology Services Harold Solbrig
Part IV (15min): Persistence DB Structure Kyle Marchant
Part V (30min): An End-To-End Evaluation Framework Peter Haug, M.D.
(15min): Facilitated Discussion on Future Build
  • Meeting Notes: this was a technical deep-dive session.

Day's Report Out

  • Standards:
    • Too many standards to choose when implementing HL7 standards
    • Versioning of standards is crucial
      • Do not assume the mapping will be trivial if the EMR data has already adopted the same standard as SHARPN value sets
  • Modeling:
    • The root of all modeling questions-Precoordination vs. postcoordination and what to store in the model instance vs. leave in the terminology
    • “One model fits all” won’t work
      • Clinical Trials (e.g., CDISC CSHARE) vs Secondary Use (e.g., SHARPn)
      • Proprietary EMR (e.g., GE Qualibria) and Open Secondary Use (e.g., SHARPn)
      • value set differences
    • Code re-use across the Clinical CEM Channels proved very useful
  • Normallization Pipeline:
    • The design of the pipeline needs to be flexible enough to accommodate all kinds of changes – (agile)
    • Different requirements for different use cases - All normalization tasks but the necessary fields can be different for different use cases
    • Relational structure for the Demographics data worked well
    • XML samples proved very valuable for validation against the XSD's and for providing an initial set of test messages to Channels.
    • Mirth support for XML, HL7, etc proved very useful for traversal of structures in code and for field validations.
  • NLP
    • Good discussion, very technical.