Computational Biology

Awesome Computational Biology ¶

A curated collection of databases, software, and papers related to computational biology.

Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, ecological, behavioural, and social systems. — Wikipedia

Interface¶

Browse and search the resources via the GitHub Pages UI: https://inoue0426.github.io/awesome-computational-biology/

Databases¶

scRNA¶

CZ CELLxGENE — Single-cell dataset repository and interactive explorer from the Chan Zuckerberg Initiative.
Gene Expression Omnibus — Public functional genomics database.
Human Cell Atlas — Open global atlas of all cells in the human body.
Single Cell PORTAL — Public database for single-cell RNA.
Single Cell Expression Atlas — Public database for single-cell RNA.

Compound¶

PubChem — One of the largest chemical databases (compounds, genes, and proteins).
ChEBI — Database focused on small chemical compounds.
ChEMBL — Bioactive molecules with drug-like properties.
ChemSpider — Chemical structure database.
HMDB (Human Metabolome Database) — Comprehensive database of small molecule metabolites found in the human body.
KEGG COMPOUND — Collection of small molecules and biopolymers.
LIPID MAPS — Database of lipids.
Rhea — Database of chemical reactions.
DrugCentral — Online drug compendium with drug mode of action and indication information.
Drug Repurposing Hub — Collections of drug repurposing data (drug, MoA, target, etc).
Therapeutic Target Database — Drug-target, target-disease, and drug-disease datasets.
ZINC ligand discovery database — Free database of commercially-available compounds for virtual screening.

Pathway¶

PathwayCommons — Database of pathways and interactions.
KEGG PATHWAY — Collection of pathway maps.
WikiPathways — Database of biological pathways.
Reactome — Expert-curated, peer-reviewed pathway database with detailed reaction mechanisms.
BioCyc — Collection of pathway/genome databases across thousands of organisms.
SIGNOR — Database of causal signaling interactions and pathways.
MSigDB (Molecular Signatures Database) — Curated gene sets derived from pathways and biological processes.

Mass Spectra¶

MassBank — Open source databases and tools for mass spectrometry reference spectra.
MoNA MassBank of North America — Meta-database of metabolite mass spectra, metadata, and associated compounds.

Protein¶

THE HUMAN PROTEIN ATLAS — Comprehensive human protein database (cells, tissues, organs).
PROTEIN DATA BANK (PDB) — 3D structures of proteins, nucleic acids, complexes.
UniProt — Functional information on proteins.
AlphaFold Protein Structure Database — 3D protein structure predictions.
RCSB Protein Data Bank — Repository for structural data of biological molecules.
Critical Assessment of Structure Prediction (CASP) — Assessing methods for protein structure prediction.
Uniclust — Clustered protein sequence databases.
CATH database — Hierarchical classification of protein domain structures.
SAbDab — Structural Antibody Database containing all antibody structures in the PDB.
OADB (Observed Antibody Space Database) — Database of antibody sequences from immune repertoire sequencing.

Genome¶

ENCODE — Encyclopedia of DNA Elements; regulatory and functional genomic elements across the genome.
Ensembl — Genome browser and annotation database for vertebrate and other eukaryotic genomes.
Human Genome Resources at NCBI — Database for genomics, proteomics, transcriptomics, and systems biology.
GenBank — NCBI's database of genetic sequences.
UCSC Genome Browser — UCSC's genome browser.
cBioPortal — Cancer genomics database; aggregating many patient datasets.
10x Genomics Dataset — Collection of single-cell datasets.
The Genotype-Tissue Expression (GTEx) — Human gene expression and regulation resource.
Dependency Map (DepMap) — CRISPR-Cas9 screens in cancer cell lines.
Catalogue Of Somatic Mutations In Cancer (COSMIC) — Resource on somatic mutations in cancers.
MGnify — Resource for metagenomic and metatranscriptomic data.
JASPAR — Database of transcription factor binding profiles.
gnomAD — Genome Aggregation Database; genetic variation from large-scale sequencing projects.
Rfam — Database of RNA families with sequence alignments and consensus structures.

Disease¶

KEGG DRUG — Comprehensive, approved drug information.
DrugBank — Database of drugs and targets (University of Alberta).
DisGeNET — Database of gene-disease associations integrating expert-curated and GWAS data.
OMIM (Online Mendelian Inheritance in Man) — Comprehensive database of human genes and genetic disorders.

Interaction¶

Drug-Gene Interaction¶

DGIdb — Drug-gene interactions and the druggable genome.
Comparative Toxicogenomics Database — Chemical-gene interactions, chemical-disease and gene-disease associations, chemical-phenotype associations.
SNAP — Dataset of drug-gene interactions.

Drug (Cell Line) Response¶

NCI60 — Focuses on 60 cancer cell lines and many drugs.
Genomics of Drug Sensitivity in Cancer (GDSC) — Drug sensitivity for ~1000 human cancer cell lines and hundreds of compounds.
Cancer Cell Line Encyclopedia — Database of ~1000 cancer cell lines.
CellMiner Cross Database (CellMinerCDB) — Integrates multiple cancer cell line databases.

Chemical-Protein Interaction¶

STITCH — Chemical-protein interactions.
BindingDB — Compounds and target database.
PDBBind — Binding affinity data for biomolecular complexes.

Protein-Protein Interaction¶

STRING — PPI networks for multiple organisms.
BioGRID — Protein, genetic, and chemical interactions.
HIPPIE — Human protein-protein interaction database.
IntAct — Open-source molecular interaction database and analysis system from EMBL-EBI.

Knowledge Graph¶

Drug Mechanism Database (DrugMechDB) — Mechanisms of action from drug to disease.
DRKG — Large-scale biological knowledge graph for drug discovery.
Hetionet — Heterogeneous network integrating genes, diseases, drugs, pathways, and more.
PrimeKG — Multi-modal precision medicine knowledge graph integrating clinical, genetic, and drug data.

Clinical Trial¶

ClinicalTrials.gov — Privately and publicly funded clinical studies.
ICD10 — International Classification of Diseases, 10^th revision.
EU Drug Regulating Authorities Clinical Trials DB (EudraCT) — European clinical trial database.
MIMIC-IV — Freely accessible critical care database.

Benchmarks & Datasets¶

BindingDB Curated Sets — Curated binding affinity datasets for protein–ligand interaction benchmarking.
Cancer Therapeutics Response Portal (CTRP) — Drug sensitivity profiles across ~900 cancer cell lines for >400 compounds.
CrossDocked2020 — Large-scale dataset for structure-based virtual screening.
Genomics of Drug Sensitivity in Cancer (GDSC) — Drug sensitivity for ~1000 human cancer cell lines and hundreds of compounds.
GuacaMol — Benchmark suite for generative molecular design models.
MoleculeNet — Benchmark datasets for molecular machine learning.
MOSES — Benchmarking platform for molecular generation models.
NCI60 — Drug sensitivity benchmark across 60 diverse human cancer cell lines.
OpenBioLink — Benchmark datasets for biological knowledge graph completion.
Therapeutics Data Commons (TDC) — Unified benchmark suite covering ADMET, drug-target interaction, drug response, and more.

API¶

PubMed E-utilities (esearch/efetch) — APIs for searching and retrieving biomedical literature from PubMed.
NCBI E-utilities — Unified APIs for accessing NCBI databases (Gene, GEO, SRA, PubChem, etc).
UniProt REST API — Programmatic access to protein sequence and functional annotation data.
Ensembl REST API — API for genomic annotations, variants, genes, and comparative genomics.
KEGG REST API — API for accessing KEGG pathways, compounds, genes, and reactions.
ChEMBL Web Services — REST API for bioactive molecules, targets, and bioassays.
Open Targets Platform API — API for target–disease associations integrating genetics, genomics, and drug data.
ClinicalTrials.gov API — API for querying clinical trial metadata and results.

Preprocessing Tools¶

Chemistry Development Kit — Cheminformatics software & machine learning tools.
Biopython — Collection of Python tools for biological computation including sequence analysis, structure parsing, and database access.
FlashDeconv — High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min).
RDKit — Cheminformatics software & machine learning toolkit.
DeepChem — Deep learning library for drug discovery, quantum chemistry, and materials science.
ChatSpatial — MCP server for spatial transcriptomics analysis via natural language.
Scanpy — Python library for scRNA-seq analysis.
Seurat — R library for scRNA-seq analysis.
scvi-tools — Probabilistic models for single-cell omics data analysis.
CellTypist — Automated cell type annotation for scRNA-seq.
Squidpy — Python library for spatial single-cell analysis.
GROMACS — Molecular dynamics simulation package for biochemical molecules.
MDAnalysis — Python library for analyzing and altering molecular dynamics simulation trajectories.
OpenMM — High-performance toolkit for molecular simulation and GPU-accelerated MD.

Machine Learning Tasks and Models¶

Drug Discovery¶

Drug Response Prediction¶

drGAT — Attention-based model for drug response prediction with gene explainability.
MOFGCN — GCN + heterogeneous network.
DeepDSC — Autoencoder + fully connected NN.
DGDRP — Multi-view embedding neural network.
DeepAEG — GNN embedding + attention mechanism.

Drug Repurposing¶

DeepPurpose — Deep learning library for drug repurposing.

Drug Target Interaction¶

NeoDTI — Library for drug-target interaction prediction.
DTINet — Network-based framework integrating heterogeneous biological data for DTI prediction.
DeepDTA — Deep learning model using CNNs on protein sequences and drug SMILES.
GraphDTA — Graph neural network–based DTI prediction using molecular graphs.
MolTrans — Transformer-based DTI model leveraging molecular substructures.
DrugBAN — Bilinear attention network for interpretable DTI prediction.

Compound-Protein Interaction¶

MCPINN — Drug discovery via compound-protein interaction and machine learning.
TransformerCPI — CPI prediction using Transformer.

Molecular Generation¶

REINVENT — Reinforcement learning for de novo drug design.
MolGPT — Transformer-based model for molecular generation.
Molecular Transformer — Sequence-to-sequence model for retrosynthesis prediction.
TargetDiff — 3D equivariant diffusion model for structure-based drug design.

LLM for Biology¶

AI4Chem/ChemLLM-7B-Chat — LLM for chemical & molecular science.
BioGPT — LLM for biomedical text generation.
GeneGPT — LLM for biomedical information, integrated with various APIs.
GenePT — Foundation LLM for single-cell data.
scPRINT — Pretrained on 50M cells for scRNA-seq denoising & zero imputation.
ClawBio — Bioinformatics-native AI agent skill library with local-first pharmacogenomics, ancestry PCA, semantic similarity, nutrigenomics, and metagenomics skills.

Foundation Models¶

Single-cell Foundation Models¶

Transcriptomics Foundation Models¶

scFoundation — Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks.
scGPT — Transformer-based foundation model pretrained on millions of single-cell profiles.
Geneformer — Context-aware, attention-based deep learning model pretrained on a large corpus of single-cell transcriptomes.
BulkFormer — Foundation model for bulk RNA-seq data; learns general transcriptomic representations.
scBERT — BERT-based foundation model pretrained on large-scale scRNA-seq data for cell type annotation.
CellPLM — Cell pre-trained language model with inter-cell transformer architecture for diverse single-cell analysis tasks.

Spatial Foundation Models¶

GigaPath — Slide-level digital pathology foundation model pretrained on 1.3 billion pathology image tokens from whole-slide images.
UNI — General-purpose self-supervised pathology foundation model trained on 100K+ whole-slide images for diverse computational pathology tasks.
CONCH — Vision-language foundation model for computational pathology trained with contrastive captioning on pathology image–text pairs.
Phikon — ViT-based pathology foundation model pretrained with iBOT self-supervision on TCGA whole-slide images.

Multi-Omics Foundation Models¶

scMulan — Single-cell multi-omic language model pretrained on ~10M cells spanning transcriptomics, epigenomics, and proteomics for cross-omics transfer tasks.
totalVI — Probabilistic framework for joint analysis of paired scRNA-seq and protein (CITE-seq) data enabling multi-modal cell state representation across single-cell datasets.
MultiVI — Multi-modal variational autoencoder for integrating paired and unpaired single-cell RNA-seq and ATAC-seq measurements into a unified latent space.
MIRA — Probabilistic multimodal topic model jointly modeling single-cell transcriptomics and chromatin accessibility for regulatory network inference.
GLUE — Graph-Linked Unified Embedding framework for unpaired single-cell multi-omics data integration across RNA, ATAC, methylation, and protein modalities.
BABEL — Cross-modality translation model enabling prediction between scRNA-seq and scATAC-seq profiles without requiring paired single-cell measurements.
Multigrate — Asymmetric multi-omics variational autoencoder for integrating single-cell data across RNA, ATAC, and protein modalities with missing-modality support.
MOFA+ — Multi-Omics Factor Analysis framework identifying shared axes of variation across bulk and single-cell datasets including RNA, ATAC, proteomics, methylation, and copy number.
GeneCompass — Large-scale foundation model integrating DNA regulatory sequences and single-cell transcriptomics from 120M+ cells across multiple species for gene regulation prediction.
UnitedNet — Interpretable multi-task deep neural network for single-cell multi-omics integration spanning transcriptomics, chromatin accessibility, and proteomics.
SpatialGlue — Graph attention network for spatial multi-omics integration jointly embedding spatial transcriptomics with chromatin accessibility or proteomics.
MIDAS — Mosaic integration and differential accessibility model for single-cell multi-omics data that handles arbitrary missing-modality combinations across transcriptomics, chromatin accessibility, and proteomics.

Domain Alignment¶

scArches — Transfer learning framework for mapping new single-cell datasets onto pre-trained reference atlases across batches, conditions, and modalities.
TOSICA — Transformer-based framework for one-stop interpretable cell-type annotation supporting cross-dataset and cross-species transfer.

Protein Foundation Models¶

Pre-trained Embedding¶

Evolutionary Scale Modeling (ESM) — Protein embeddings.
ChemBERTa-2 — Chemical embeddings & prediction.

Protein Structure Prediction and Design¶

AlphaFold3 — Predicts structures of proteins, nucleic acids, small molecules, and their complexes.
Boltz-1 — Open-source all-atom biomolecular structure prediction model for proteins, nucleic acids, small molecules, and their complexes achieving AlphaFold3-level accuracy.
Chai-1 — Unified molecular structure prediction model covering proteins, nucleic acids, small molecules, and complexes.
ESM3 — Multimodal protein language model that jointly reasons over sequence, structure, and function for generative protein design and engineering.
ESMFold — Fast protein structure prediction using language model embeddings.
RFdiffusion — Generative model for protein backbone design using diffusion.
ProteinMPNN — Deep learning model for protein sequence design given backbone structure.
OmegaFold — High-resolution de novo protein structure prediction from sequence.
RoseTTAFold — Three-track neural network for protein structure prediction.

CHIEF — Clinical Histopathology Imaging Evaluation Foundation model integrating histology images and clinical context for pan-cancer analysis.
BiomedCLIP — CLIP-based vision-language foundation model for biomedical images and text trained on PubMed figure–caption pairs.

Genomics Foundation Models¶

Nucleotide Transformer — Foundation model for genomic sequences across multiple species.
DNABERT — Pre-trained bidirectional encoder for DNA sequence analysis.
DNABERT-2 — Improved genome foundation model with efficient tokenization.
Enformer — Transformer model predicting gene expression from DNA sequence.
Basenji — Sequential regulatory activity prediction from DNA sequences.
Caduceus — Bidirectional equivariant long-range DNA sequence model based on Mamba.
Evo — Long-context genomic foundation model (up to 1M tokens).
HyenaDNA — Long-range genomic foundation model handling sequences up to 1M tokens with sub-quadratic attention.

Computational Biology

Awesome Computational Biology ¶

Interface¶

Databases¶

scRNA¶

Compound¶

Pathway¶

Mass Spectra¶

Protein¶

Genome¶

Disease¶

Interaction¶

Drug-Gene Interaction¶

Drug (Cell Line) Response¶

Chemical-Protein Interaction¶

Protein-Protein Interaction¶

Knowledge Graph¶

Clinical Trial¶

Benchmarks & Datasets¶

API¶

Preprocessing Tools¶

Machine Learning Tasks and Models¶

Drug Discovery¶

Drug Response Prediction¶

Drug Repurposing¶

Drug Target Interaction¶

Compound-Protein Interaction¶

Molecular Generation¶

LLM for Biology¶

Foundation Models¶

Single-cell Foundation Models¶

Transcriptomics Foundation Models¶

Spatial Foundation Models¶

Multi-Omics Foundation Models¶

Domain Alignment¶

Protein Foundation Models¶

Pre-trained Embedding¶

Protein Structure Prediction and Design¶

Multi-Modal Foundation Models¶

Genomics Foundation Models¶