SHARP Project Wiki:Project Background
We propose research that will generate a framework of open-source services that can be dynamically configured to transform EHR data into standards-conforming, comparable information suitable for large-scale analyses, inferencing, and integration of disparate health data. We will apply these services to phenotype recognition (disease, risk factor, eligibility, or adverse event) in medical centers and population-based settings. Finally, we will examine data quality and repair strategies with real-world evaluations of their behavior in Clinical and Translational Science Awards (CTSAs), health information exchanges (HIEs), and National Health Information Network (NwHIN) connections.
We have assembled a federated informatics research community committed to open-source resources that can industrially scale to address barriers to the broad-based, facile, and ethical use of EHR data for secondary purposes. We will collaborate to create, evaluate, and refine informatics artifacts that advance the capacity to efficiently leverage EHR data to improve care, generate new knowledge, and address population needs. Our goal is to make these artifacts available to the community of secondary EHR data users, manifest as open-source tools, services, and scalable software. In addition, we have partnered with industry developers who can make these resources available with commercial deployment. We propose to assemble modular services and agents from existing open-source software to improve the utilization of EHR data for a spectrum of use-cases and focus on three themes: Normalization, Phenotypes, and Data Quality/Evaluation. Our six projects span one or more of these themes, though together constitute a coherent ensemble of related research and development. Finally, these services will have open-source deployments as well as commercially supported implementations.
There are six strongly intertwined, mutually dependent projects, including: 1) Semantic and Syntactic Normalization; 2) Natural Language Processing (NLP); 3) Phenotype Applications; 4) Performance Optimization; 5) Data Quality Metrics; and 6) Evaluation Frameworks. The first two projects align with our Data Normalization theme, while Phenotype Applications and Performance Optimization span themes 1 and 2 (Normalization and Phenotyping); while the last two projects correspond to our third theme.
ONC & PAC Reports
- Aberdeen J. NLP techniques for clinical record de-identification, presentation to AcademyHealth Annual Research Meeting, Seattle, June 12-14, 2011.
- Chapman W, Nadkarni P, Hirschman L, D’Avolio L, Savova G, Uzuner O. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. Journal of American Medical Informatics Association. 2011 -:1e4. doi:10.1136/amiajnl-2011-000465.
- Choi J, Palmer M. Getting the most out of Transition-based Dependency Parsing, In the Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011, June 19 - 24, 2011, Portland, OR.
- Choi J, Palmer M. Transition-based Semantic Role Labeling Using Predicate Argument Clustering, In the Proceedings of RELMS 2011: Relational Models of Semantics, held in conjunction with ACL-HLT 2011, June, 2011, Portland, OR.
- Chute CG, Pathak J, Savova GK, Bailey KR, Schor MI, Hart LA, Beebe CE, Huff SM. The SHARPn Project on Secondary Use of Electronic Medical Record Data: Progress, Plans and Possibilities. AMIA 2011 (paper).
- Clark C. Recent efforts in clinical NLP: Uncertainty discovery through NLP, presentation to Natural Language Processing Workshop, i2b2 Academic Users Group, Boston, June 28, 2011.
- Conway MA, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, Linneman JG, Pacheco JA, Pessig PL, Rasmussen L, Weston N, Chute CG, Pathak J. Analyzing Heterogeneity and Complexity of Electronic Health Record Oriented Phenotyping Algorithms. AMIA 2011 (paper).
- Conway MA, Pathak J. Analyzing the Prevalence of Hedges in Electronic Health Record Oriented Phenotyping Algorithms. AMIA 2011 (poster).
- Dligach D, Palmer M. Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling. In the Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011, June 19 - 24, 2011, Portland, OR.
- Dligach D, Palmer M. Reducing the Need for Double Annotation. In the Proceedings of the Fifth Linguistic Annotation Workshop (LAW V) held in conjunction with ACL-HLT 2011, June, 2011, Portland, OR.
- Hirschman L. Evaluation as a driver in Software Communities, presentation to Workshop on Designing an Ecosystem for Clinical NLP, Integrating Data for Analysis, Anonymization and Sharing (iDASH), University of California, San Diego, May 2-3, 2011.
- Liu H, Wagholikar K, Wu S. Using SNOMED CT to encode summary level data - a corpus analysis. AMIA CRI 2012.
- MITRE System for Clinical Assertion Status Classification, JAMIA 2011; Published Online First: 22 April 2011 doi:10.1136/amiajnl-2011-000164.
- Rea S, Pathak J, Savova GK, Oniki TA, Westberg L, Beebe CE, Tao C, Parker CG, Haug PJ, Huff SM, Chute CG. Building a Robust, Scalable and Standards-Driven Infrastructure for Secondary Use of EHR Data: The SHARPn Project. Second stage of review at JAMIA.
- Savova G, Olson J, Murphy S, Cafourek V, Couch F, Goetz M, Ingle J, Suman V, Chute C, Weinshilboum R. The electronic medical record and drug response research: automated discovery of drug treatment patterns for endocrine therapy of breast cancer. Journal of American Medical Informatics Association. 2011.
- Savova GK, Chapman WW, Elhadad N, Palmer M. 2011. Shared annotated resources for the clinical domain. AMIA ann symp. Panel.
- Sohn S, Kocher J-P, Chute CG, Savova GK. Drug side effect extraction from clinical narratives of psychiatry and psychology patients. JAMIA 2011; 18:i144-i149.
- Sohn S, Wu S. Dependency Parser-based Negation Detection in Clinical Narratives. AMIA CRI 2012.
- Tao C, Parker CG, Oniki TA, Pathak J, Huff SM, Chute CG. An OWL Meta-Ontology for Representing the Clinical Element Model. AMIA 2011 (paper).
- Tao C, Welch SR, Wei WQ, Oniki TA, Parker CA, Pathak J, Huff SM, Chute CG. Normalized Representation of Data Elements for Phenotype Cohort Identification in Electronic Health Record. AMIA 2011 (poster).
- Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from multiple data sources. JAMIA 2011 Sep-Oct; 18(5) 580-7
- Wagholikar K, Torii M, Jonnalagadda S, Liu H. Feasibility of pooling annotated corpora for clinical concept extraction. AMIA CRI 2011
- Wu ST, Kaggal VC, Savova GK, Liu H, Dligach D, Zheng J, Chapman WW, Chute CG. Generality and Reuse in a Common Type System for Clinical Natural Language Processing Proceedings of the First International Workshop on Managing Interoperability and compleXity in Health Systems. Glasgow, Scotland. 2011.
- Wu S, Liu H. Semantic Characteristics of NLP-extracted Concepts in Clinical Notes vs. Biomedical Literature Proceedings of the Annual AMIA Fall Symposium. Washington DC. 2011.
- Wu S, Liu H, Li D, Tao C, Musen M, Chute CG, Shah N. UMLS Term Occurrences in Clinical Notes: A Large-scale Corpus Analysis. AMIA CRI 2012.
- Wu S, Wagholikar K, Sohn S, Kaggal V, Liu H. Empirical Ontologies for Cohort Identification. Text REtrieval Conference. 2011.
- Zheng J, Chapman W, Miller T, Lin C, Crowley R, Savova G. In Press. A system for coreference resolution for the clinical narrative. Journal of the American Medical Informatics Association.
- Open Health Natural Language Processing (OHNLP) Consortium, www.ohnlp.org, Last Access Date: January 20, 2010
- clinical Text Analysis and Knowledge Extraction System (cTAKES), www.ohnlp.org, Last Access Date:January 20, 2010
- Clinical Element Model (CEM), www.clinicalelement.com, Last Access Date: January 20, 2010
- Chute C, Beck S, Fisk T, et al.: The Enterprise Data Trust at Mayo Clinic: A semantically integrated warehouse of biomedical data. . JAMIA in press
- Health Open Source Software Collaborative, https://mi.regenstrief.org/wiki/display/hoss/Health+Open+Source+Software+Collaborative;jsessionid=C5FA8654DE95870C84B9925C66082FC8, Last Access Date: January 20, 2010
- LexBig and LexEVS, https://cabig-kc.nci.nih.gov/Vocab/KC/index.php/LexBig_and_LexEVS, Last Access Date: January 20, 2010
- National Center For Biomedical Ontology, http://www.bioontology.org, Last Access Date: January 20, 2010
- Pathak J, Solbrig HR, Buntrock JD, et al.: LexGrid: A Framework for Representing, Storing, and Querying Biomedical Terminologies from Simple to Sublime. Journal of Americal Medical Informatics Association 16:305-315, 2009
- MirthConnect, www.mirthcorp.com/community/mirth-connect, Last Access Date: January 20, 2010
- Institute (ANSI) of its Common Terminology Services (CTS): ANSI/HL7 CTS, V1-2005 Health Level Seven Standard: Common Terminology Services, Version 1, 2005
- International Organization for Standardization: ISO International Standard (IS) 27951 Common Terminology Services Version 1, 2009
- Noy NF, Shah NH, Whetzel PL, et al.: BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res 37:W170-173, 2009 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=19483092
- Savova G, Masanz J, Ogren P, et al.: Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, evaluation and applications. JAMIA, under review
- Chapman W, Dowling J, Hripcsak G: Evaluation of Training with an Annotation Schema for Manual Annotation of Clinical Conditions from Emergency Department Reports. Int J Med Inf 77:107-113, 2008
- Chapman W, Dowling J, Wagner M: Generating a reliable reference standard set for syndromic case classification. J Am Med Inform Assoc 12:618-629, 2005
- Coden A, Savova G, Sominsky I, et al.: Automatically extracting cancer disease characteristics from pathology reports into a cancer disease knowledge model. Journal of Biomedical Informatics 42 (2009):937-949, 2009, doi:10.1016/j.jbi.2008.12.005
- Ogren P, Savova G, Chute C: Constructing evaluation corpora for automated clinical named entity recognition, in LREC, Marakesh, Morrocco, 2008, pp 3143-3150, http://www.lrec-conf.org/proceedings/lrec2008/
- Savova G, Bethard S, Styler W, et al.: Towards temporal relation discovery from the clinical narrative, in AMIA, San Francisco, CA, 2009
- Uzuner Ö: Recognizing Obesity and Co-morbidities in Sparse Data. Journal of the American Medical Informatics Association. 16:561-570, 2009
- Uzuner Ö, Goldstein I, Luo Y, et al.: Identifying Patient Smoking Status from Medical Discharge Records. Journal of the American Medical Informatics Association 15:14-24, 2008
- Uzuner Ö, Luo T, Szolovits P: Evaluating the State-of-the-Art in Automatic De-identification. Journal of the American Medical Informatics Association 14:550-563, 2007
- Chen J, Schein A, Ungar L, et al.: An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation, in Human Language Technology conference - North American chapter of the Association for Computational Linguistics annual meeting (HLT-NAACL), New York, NY, 2006
- CLEAR-TK, http://code.google.com/p/cleartk/ Last Access Date:January 20, 1999
- Palmer M, Gildea D, Kingsbury P: The Proposition Bank: A Corpus Annotated with Semantic Roles. Computational Linguistics 31, 2005
- Pradhan S, Hacioglu K, Krugler V, et al.: Support vector learning for semantic argument classification. Machine Learning 60:11-39, 2005
- Hacioglu K, Pradhan S, Ward W, et al.: Semantic Role Labeling by Tagging Syntactic Chunks, in Proceedings of the Eighth Conference on Natural Language Learning (CONLL-2004), 2004
- Bethard S, Lu Z, Martin J, et al.: Semantic Role Labeling for Protein Transport Predicates. BMC Bioinformatics Jun 11:9:277, 2008
- Bethard S, Martin J, Klingenstein S: Finding Temporal Structure in Text: Machine Learning of Syntactic Temporal Relations. International Journal of Semantic Computing (IJSC) 1, 2007
- Chapman W, Bridewell W, Hanbury P, et al.: A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics 34:301-310, 2001
- Harkema H, Thornblade T, Dowling J, et al.: ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform 42:839-851, 2009
- Chapman W, Chu D, Dowling J: ConText: An algorithm for identifying contextual features from clinical text., in BioNLP Workshop of the Association for Computational Linguistics, Prague, Czech Republic, 2007, pp 81-88
- Christensen L, Harkema H, Irwin J, et al.: ONYX: A System for the Semantic Analysis of Clinical Text, in Proceedings of the BioNLP2009 Workshop of the ACL Conference, Denver, CO, 2009
- Aronsky D, Fiszman M, Chapman WW, et al.: Combining decision support methodologies to diagnose pneumonia. Proc AMIA Symp:12-16, 2001 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=11825148
- He T: Coreference Resolution on Entities and Events for Hospital Discharge Summaries, in EECS, Cambridge, MA, MIT. M.Eng, 2007
- Sibanda T: Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records, in EECS, Cambridge, MA, MIT, 2006
- Uzuner Ö, Mailoa J, Sibanda T: Semantic Relations for Problem-Oriented Medical Records, in Fall Symposium of the American Medical Informatics Association (AMIA 2009), San Francisco, CA, 2009, p 661
- Uzuner Ö, Zhang X, Sibanda T: Two Approaches to Assertion Classification, in Fall Symposium of the American Medical Informatics Association (AMIA 2008), Washington, DC, 2008, p 752
- Uzuner Ö, Zhang X, Sibanda T: Machine Learning and Rule-based Approaches to Assertion Classification. Journal of the American Medical Informatics Association 16:109-115, 2009, DOI 10.1197/jamia.M2950
- Unstructured Information Management Architecture (UIMA), http://incubator.apache.org/uima/ Last Access Date:January 20, 2010
- HL7, www.hl7.org/v3ballot/html/welcome/environment/index.htm, Last Access Date: January 20, 2010
- Poesio M, Vieira R: A corpus-based investigation of definite description use. Computational Linguistics 24:183-216, 1998
- Hripcsak G, Rothschild A: Agreement, the F-Measure, and Reliability in Information Retrieval. J American Medical Informatics Association 12:296-298, 2005
- Marcus M, Santorini B, Marcinkiewicz M: Building a large annotated corpus of english: The
- penn treebank. Computational Linguistics 19:313-330, 1994
- Kipper K, Korhonen A, Ryant N, et al.: Extensive Classifications of English verbs., in Proceedings of the 12th EURALEX International Congress., Turin, Italy, 2006
- Uzuner Ö, Sibanda T, Luo Y, et al.: A De-identifier for Medical Discharge Summaries. International Journal Artificial Intelligence in Medicine 42:13-35, 2008
- Sowa J: Conceptual graphs for a database inference. IBM Journal of Research and Development 20:336-357, 1976
- Sowa J: Conceptual structures: information processing in mind and machine. Reading, MA, 1984
- The eMERGE Network, https://www.mc.vanderbilt.edu/victr/dcc/projects/acc/index.php/Main_Page, Last Access Date: January 20, 2010
- CDISC, http://www.cdisc.org, Last Access Date: January 20, 2010
- Biomedical Research Integrated Domain Group (BRIDG), http://www.bridgmodel.org, Last Access Date: January 20, 2010
- Liu B, Hsu W, Ma Y: Integrating classification and association rule mining, in Intelligence AAfA, New York, 1998
- Thabtah F: A review of associative classification mining. Knowledge Engineering Review 22:37-65, 2007
- Wei W, Chute C: Identification of Type 2 Diabetes Mellitus Patients by SNOMED CT Concept Frequency. . AMIA Annual Symposium, 2009
- CDISC Share, http://www.cdisc.org/cdisc-share, Last Access Date: January 20, 2010
- LexWiki, https://cabig-kc.nci.nih.gov/Vocab/KC/index.php/LexWiki, Last Access Date: January 21, 2010
- International Health Terminology Standards Development Organization (IHTSDO ), http://www.ihtsdo.org/fileadmin/user_upload/Docs_01/About_IHTSDO/Publications/CompositionalGrammar_20081223.pdf, Last Access Date: January 20, 2010
- Rector AL, Brandt S: Why do it the hard way? The case for an expressive description logic for SNOMED. J Am Med Inform Assoc 15:744-751, 2008, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=18755993
- caBIG® Vocabulary Knowledge Center, https://cabig-kc.nci.nih.gov/Vocab/KC/index.php/Main_Page. Last Access Date: January 21, 2010
- Agrawal R, Imielinkski T, Swami A: Mining Associations between Sets of Items in Large Databases. ACM SIGMOD Int'l Conf on Management of Data:Washington, DC, 1993.
- Witten I, Frank E: Data Mining: Practical Machine Learning Tools and Techniques with JAVA Implementations, Morgan Kaufmann Publishers, 2000
- Apache Open Source, http://incubator.apache.org/, Last Access Date: January 20, 2010
- Apache UIMA, http://incubator.apache.org/uima/, Last Access Date: January 20, 2010
- OASIS, http://www.oasis-open.org/news/oasis-news-2009-03-19.php, Last Access Date: January 20, 2010
- Text Analytics Tools and Runtime for IBM LanguageWare, http://www.alphaworks.ibm.com/tech/lrw, Last Access Date: January 20, 2010
- Open Health Natural Language Processing (OHNLP) Consortium, https://cabig-kc.nci.nih.gov/Vocab/KC/index.php/OHNLP, Last Access Date: January 20, 2010
- U-Compare, http://u-compare.org/, Last Access Date: January 20, 2010
- Getting Started: UIMA Asynchronous Scaleout, http://incubator.apache.org/uima/doc-uimaas-what.html, Last Access Date: January 20, 2010
- “Question Answering” is technology's next grand challenge, http://www.research.ibm.com/deepqa/index.shtml, Last Access Date: January 20, 2010
- Rubin D: Inference and missing data. Biometrika 63:581-592, 1976,
- rJAVA, http://rosuda.org/rJava/ Last Access Date: January 21, 2010
- Melton LJ, 3rd: History of the Rochester Epidemiology Project. Mayo Clin Proc 71:266-274, 1996, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=8594285
- Kurland LT, Molgaard CA: The patient record in epidemiology. Sci Am 245:54-63, 1981, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=7027437
- Project Management Institute: A Guide to the Project Management Body of Knowledge (PMBOK ® Guide) (4th ed.). Newtown Square, PA Project Management Institute, Inc., 2008
- Blue Gene, http://www.research.ibm.com/bluegene. Last Access Date: January 21, 2010