Skip to content

Biomedical Information Extraction

Awesome BioIE Logo
Awesome
How to extract information from unstructured biomedical data and text.

What is BioIE? It includes any effort to extract structured information from unstructured (or, at least inconsistently structured) biological, clinical, or other biomedical data. The data source is often some collection of text documents written in technical language. If the resulting information is verifiable and consistent across sources, we may then consider it knowledge. Extracting information and producing knowledge from bio data requires adaptations upon methods developed for other types of unstructured data.

BioIE has undergone massive changes since the introduction of language models like BERT and the more recently created Large Language Models (LLMs; e.g., GPT-¾, LLAMA⅔, Gemini, etc).

Resources included here are preferentially those available at no monetary cost and limited license requirements. Methods and datasets should be publicly accessible and actively maintained.

See also awesome-nlp, awesome-biology and Awesome-Bioinformatics.

Please read the contribution guidelines before contributing. Please add your favourite resource by raising a pull request.

Research Overviews

LLMs in Biomedical IE

Pre-LLM Overviews

Groups Active in the Field

Organizations

  • AMIA - Many—but certainly not all—individuals studying biomedical informatics are members of the American Medical Informatics Association. AMIA publishes a journal, JAMIA (see below).
  • IMIA - The International Medical Informatics Association. Publishes the IMIA Yearbook of Medical Informatics.

Journals and Events

The interdisciplinary nature of BioIE means researchers in this space may share their findings and tools in a variety of ways. They may publish papers in journals, as is common in the biomedical and life sciences. They may publish conference papers and, upon acceptance, give a poster and/or oral presentation at an event; this is common practice in computer science and engineering fields. Conference papers are often published in collections of proceedings. Preprint publication is an increasingly popular and institutionally-accepted way to publish findings as well. Surrounding these formal, written products are the ideas of open science, open data, and open source: the code, data, and software BioIE researchers develop are valuable resources to the community.

Journals

For preprints, try arXiv, especially the subjects Computation and Language (cs.CL) and Information Retrieval (cs.IR); bioRxiv; or medRxiv, especially the Health Informatics subject area.

  • Database - Its subtitle is "The Journal of Biological Databases and Curation". Open access.
  • NAR - Nucleic Acids Research. Has a broad biomolecular focus but is particularly notable for its annual database issue.
  • JAMIA - The Journal of the American Medical Informatics Association. Concerns "articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy".
  • JBI - The Journal of Biomedical Informatics. Not open access by default, though it does have an open-access "X" version.
  • Scientific Data - An open-access Springer Nature journal publishing "descriptions of scientifically valuable datasets, and research that advances the sharing and reuse of scientific data".

Conferences and Other Events

  • ACM-BCB - The ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Held annually since 2010.
  • BIBM - The IEEE International Conference on Bioinformatics and Biomedicine.
  • PSB - The Pacific Symposium on Biocomputing.

Challenges

Some events in BioIE are organized around formal tasks and challenges in which groups develop their own computational solutions, given a dataset.

  • BioASQ - Challenges on biomedical semantic indexing and question answering. Challenges and workshops held annually since 2013.
  • BioCreAtIvE workshop - These workshops have been organized since 2004, with BioCreative VI happening February 2017 and the BioCreative/OHNLP Challenge held in 2018. See Datasets below.
  • SemEval workshop - Tasks and evaluations in computational semantic analysis. Tasks vary by year but frequently cover scientific and/or biomedical language, e.g. the SemEval-2019 Task 12 on Toponym Resolution in Scientific Papers.
  • eHealth-KD - Challenges for encouraging "development of software technologies to automatically extract a large variety of knowledge from eHealth documents written in the Spanish Language". Previously held as part of TASS, an annual workshop for semantic analysis in Spanish.

Tutorials

The field changes rapidly enough that tutorials any older than a few years are missing crucial details. A few more recent educational resources are listed below. A good foundational understanding of text mining techniques is very helpful, as is some basic experience with the Python and or R languages. The best option may be to learn by doing.

LLM Guides

TBD - watch this space!

Pre-LLM Guides, Lectures, and Courses

Code Libraries

  • Biopython - paper - code - Python tools primarily intended for bioinformatics and computational molecular biology purposes, but also a convenient way to obtain data, including documents/abstracts from PubMed (see Chapter 9 of the documentation).
  • Bio-SCoRes - paper - A framework for biomedical coreference resolution.
  • medaCy - A system for building predictive medical natural language processing models. Built on the spaCy framework.
  • ScispaCy - paper - A version of the spaCy framework for scientific and biomedical documents.
  • rentrez - R utilities for accessing NCBI resources, including PubMed.
  • Med7 - paper - code - a Python package and model (for use with spaCy) for doing NER with medication-related concepts.

Repos for Specific Datasets

  • mimic-code - Code associated with the MIMIC-III dataset (see below). Includes some helpful tutorials.

Tools, Platforms, and Services

  • cTAKES - paper - code - A system for processing the text in electronic medical records. Widely used and open source.
  • CLAMP - paper - A natural language processing toolkit intended for use with the text in clinical reports. Check out their live demo first to see what it does. Usable at no cost for academic research.
  • DeepPhe - A system for processing documents describing cancer presentations. Based on cTAKES (see above).
  • DNorm - paper - A method for disease normalization, i.e., linking mentions of disease names and acronyms to unique concept identifiers. Downloadable version includes the NCBI Disease Corpus and BC5CDR (see Annotated Text Data below).
  • PubTator Central - paper - A web platform that identifies five different types of biomedical concepts in PubMed articles and PubMed Central full texts. The full annotation sets are downloadable (see Annotated Text Data below).
  • Pubrunner - A framework for running text mining tools on the newest set(s) of documents from PubMed.
  • SemEHR - paper - an IE infrastructure for electronic health records (EHR). Built on the CogStack project.
  • TaggerOne - paper - Performs concept normalization (see also DNorm above). Can be trained for specific concept types and can perform NER independent of other normalization functions.
  • TabInOut - paper - a framework for IE from tables in the literature.

Annotation Tools

  • Anafora - paper - An annotation tool with adjudication and progress tracking features.
  • brat - paper - code - The brat rapid annotation tool. Supports producing text annotations visually, through the browser. Not subject specific; appropriate for many annotation projects. Visualization is based on that of the stav tool.
  • MedTator - paper - code - An annotation tool designed to have minimal dependencies.

Techniques and Models

Large Language Models

TBD - watch this space!

BERT models

GPT-2 models

  • BioGPT - paper - A GPT-2 model pre-trained on 15 million PubMed abstracts, along with fine-tuned versions for several biomedical tasks.

Other models

  • Flair embeddings from PubMed - A language model available through the Flair framework and embedding method. Trained over a 5% sample of PubMed abstracts until 2015, or > 1.2 million abstracts in total.

Text Embeddings

  • This paper from Hongfang Liu's group at Mayo Clinic demonstrates how text embeddings trained on biomedical or clinical text can, but don't always, perform better on biomedical natural language processing tasks. That being said, pre-trained embeddings may be appropriate for your needs, especially as training domain-specific embeddings can be computationally intensive.
  • BioASQword2vec - paper - Qord embeddings derived from biomedical text (>10 million PubMed abstracts) using the popular word2vec tool.
  • BioWordVec - paper - code - Word embeddings derived from biomedical text (>27 million PubMed titles and abstracts), including subword embedding model based on MeSH.

Datasets

Some of the datasets listed below require a UMLS Terminology Services (UTS) account to access. Please note that the license granted with the UTS account requires users to submit an annual report about their use of UMLS resources. This is less challenging than it sounds.

Biomedical Text Sources

The following resources contain indexed text documents in the biomedical sciences. * OHSUMED - paper - 348,566 MEDLINE entries (title and sometimes abstract) from between 1987 and 1991. Includes MeSH labels. Primarily of historical significance. * PubMed Central Open Access Subset - A set of PubMed Central articles usable under licenses other than traditional copyright, though the exact licenses vary by publication and source. Articles are available as PDF and XML. * CORD-19 - A corpus of scholarly manuscripts concerning COVID-19. Articles are primarily from PubMed Central and preprint servers, though the set also includes metadata on papers without full-text availability.

Annotated Text Data

  • SPL-ADR-200db - paper - A pilot dataset containing standardised information, and annotations of occurence in text, about ~5,000 known adverse reactions for 200 FDA-approved drugs.
  • BioCreAtIvE 1 - paper - 15,000 sentences (10,000 training and 5,000 test) annotated for protein and gene names. 1,000 full text biomedical research articles annotated with protein names and Gene Ontology terms.
  • BioCreAtIvE 2 - paper - 15,000 sentences (10,000 training and 5,000 test, different from the first corpus) annotated for protein and gene names. 542 abstracts linked to EntrezGene identifiers. A variety of research articles annotated for features of protein–protein interactions.
  • BioCreAtIvE V CDR Task Corpus (BC5CDR) - paper - 1,500 articles (title and abstract) published in 2014 or later, annotated for 4,409 chemicals, 5,818 diseases and 3116 chemical–disease interactions. Requires registration.
  • BioCreative VI CHEMPROT Corpus - paper - >2,400 articles annotated with chemical-protein interactions of a variety of relation types. Requires registration.
  • CRAFT - paper - 67 full-text biomedical articles annotated in a variety of ways, including for concepts and coreferences. Now on version 5, including annotations linking concepts to the MONDO disease ontology.
  • n2c2 (formerly i2b2) Data - The Department of Biomedical Informatics (DBMI) at Harvard Medical School manages data for the National NLP Clinical Challenges and the Informatics for Integrating Biology and the Bedside challenges running since 2006. They require registration before access and use. Datasets include a variety of topics. See the list of data challenges for individual descriptions.
  • NCBI Disease Corpus - paper - A corpus of 793 biomedical abstracts annotated with names of diseases and related concepts from MeSH and OMIM.
  • PubTator Central datasets - paper - Accessible through a RESTful API or FTP download. Includes annotations for >29 million abstracts and ∼3 million full text documents.
  • Word Sense Disambiguation (WSD) - paper - 203 ambiguous words and 37,888 automatically extracted instances of their use in biomedical research publications. Requires UTS account.
  • Clinical Questions Collection - also known as CQC or the Iowa collection, these are several thousand questions posed by physicians during office visits along with the associated answers.
  • BioNLP ST 2013 datasets - data from six shared tasks, though some may not be easily accessible; try the CG task set (BioNLP2013CG) for extensive entity and event annotations.
  • BioScope - paper - a corpus of sentences from medical and biological documents, annotated for negation, speculation, and linguistic scope.
  • BioRED - paper - a set of >6.5K biomedical relation annotations, plus labels for novel findings.

Protein-protein Interaction Annotated Corpora

Protein-protein interactions are abbreviated as PPI. The following sets are available in BioC format. The older sets (AIMed, BioInfer, HPRD50, IEPA, and LLL) are available courtesy of the WBI corpora repository and were originally derived from the original sets by a group at Turku University.

  • AIMed - paper - 225 MEDLINE abstracts annotated for PPI.
  • BioC-BioGRID - paper - 120 full text articles annotated for PPI and genetic interactions. Used in the BioCreative V BioC task.
  • BioInfer - paper - 1,100 sentences from biomedical research abstracts annotated for relationships (including PPI), named entities, and syntactic dependencies. Additional information and download links are here.
  • HPRD50 - paper - 50 scientific abstracts referenced by the Human Protein Reference Database, annotated for PPI.
  • IEPA - paper - 486 sentences from biomedical research abstracts annotated for pairs of co-occurring chemicals, including proteins (hence, PPI annotations).
  • LLL - paper - 77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions (so, fairly close to PPI annotations). Additional information is here.

Other Datasets

  • Columbia Open Health Data - paper - A database of prevalence and co-occurrence frequencies of conditions, drugs, procedures, and patient demographics extracted from electronic health records. Does not include original record text.
  • Comparative Toxicogenomics Database - paper - A database of manually curated associations between chemicals, gene products, phenotypes, diseases, and environmental exposures. Useful for assembling ontologies of the related concepts, such as types of chemicals.
  • MIMIC-III - paper - Deidentified health data from ~60,000 intensive care unit admissions. Requires completion of an online training course (CITI training) and acceptance of a data use agreement prior to use.
  • MIMIC-IV - An update to MIMIC-III's multimodal patient data, now covering more recent years of admissions, plus a new data structure, emergency department records, and links to MIMIC-CXR images.
  • eICU Collaborative Research Database - paper - a database of observations from more than 200 thousand intensive care unit admissions, with consistent structure. Requires registration, training course completion, and data use agreement.

Ontologies and Controlled Vocabularies

  • Disease Ontology - paper - An ontology of human diseases. Has cross-links to MeSH, ICD, NCI Thesaurus, SNOMED, and OMIM. Public domain. Available on GitHub and on the OBO Foundry.
  • RxNorm - paper - Normalized names for clinical drugs and drug packs, with combined ingredients, strengths, and form, and assigned types from the Semantic Network (see below). Released monthly.
  • SPECIALIST Lexicon - paper - A general English lexicon that includes many biomedical terms. Updated yearly since 1994 and still updated as of 2019. Part of UMLS but does not require UTS account to download.
  • UMLS Metathesaurus - paper - Mappings between >3.8 million concepts, 14 million concept names, and >200 sources of biomedical vocabulary and identifiers. It's big. It may help to prepare a subset of the Metathesaurus with the MetamorphoSys installation tool but we're still talking about ~30 Gb of disk space required for the 2019 release. See the manual here. Requires UTS account.
  • UMLS Semantic Network - paper - Lists of 133 semantic types and 54 semantic relationships covering biomedical concepts and vocabulary. Is the Metathesaurus too complex for your needs? Try this. Does not require UTS account to download.

Data Models

Do you need a data model? If you are working with biomedical data, then the answer is probably "Yes".

  • Biolink - code - A data model of biological entities. Provided as a YAML file.
  • BioUML - paper - An architecture for biomedical data analysis, integration, and visualization. Conceptually based on the visual modeling language UML.
  • OMOP Common Data Model - a standard for observational healthcare data.

Credits

Credits for curators and sources.

License

CC0

License