Skip to content

Computational Biology

Awesome Computational Biology Awesome

A curated collection of databases, software, and papers related to computational biology.

Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, ecological, behavioural, and social systems. — Wikipedia



Databases

scRNA

Compound

  • PubChem — One of the largest chemical databases (compounds, genes, and proteins).
  • ChEBI — Database focused on small chemical compounds.
  • ChEMBL — Bioactive molecules with drug-like properties.
  • ChemSpider — Chemical structure database.
  • KEGG COMPOUND — Collection of small molecules and biopolymers.
  • LIPID MAPS — Database of lipids.
  • Rhea — Database of chemical reactions.
  • Drug Repurposing Hub — Collections of drug repurposing data (drug, MoA, target, etc).
  • Therapeutic Target Database — Drug-target, target-disease, and drug-disease datasets.
  • ZINC ligand discovery database — Free database of commercially-available compounds for virtual screening.
  • MoleculeNet — Benchmark datasets for molecular machine learning.

Pathway

  • PathwayCommons — Database of pathways and interactions.
  • KEGG PATHWAY — Collection of pathway maps.
  • WikiPathways — Database of biological pathways.
  • Reactome — Expert-curated, peer-reviewed pathway database with detailed reaction mechanisms.
  • BioCyc — Collection of pathway/genome databases across thousands of organisms.
  • SIGNOR — Database of causal signaling interactions and pathways.
  • MSigDB (Molecular Signatures Database) — Curated gene sets derived from pathways and biological processes.

Mass Spectra

  • MassBank — Open source databases and tools for mass spectrometry reference spectra.
  • MoNA MassBank of North America — Meta-database of metabolite mass spectra, metadata, and associated compounds.

Protein

Genome

Disease

  • KEGG DRUG — Comprehensive, approved drug information.
  • DrugBank — Database of drugs and targets (University of Alberta).

Interaction

Drug-Gene Interaction

  • DGIdb — Drug-gene interactions and the druggable genome.
  • Comparative Toxicogenomics Database — Chemical-gene interactions, chemical-disease and gene-disease associations, chemical-phenotype associations.
  • SNAP — Dataset of drug-gene interactions.
  • Therapeutics Data Commons — Datasets for drug-target, response, drug-drug interaction, etc.

Drug (Cell Line) Response

Chemical-Protein Interaction

  • STITCH — Chemical-protein interactions.
  • BindingDB — Compounds and target database.
  • PDBBind — Binding affinity data for biomolecular complexes.
  • CrossDocked2020 — Large-scale dataset for structure-based virtual screening.

Protein-Protein Interaction

  • STRING — PPI networks for multiple organisms.
  • BioGRID — Protein, genetic, and chemical interactions.
  • HIPPIE — Human protein-protein interaction database.

Knowledge Graph

  • Drug Mechanism Database (DrugMechDB) — Mechanisms of action from drug to disease.
  • DRKG — Large-scale biological knowledge graph for drug discovery.
  • Hetionet — Heterogeneous network integrating genes, diseases, drugs, pathways, and more.
  • OpenBioLink — Benchmark datasets for biological knowledge graph completion.
  • PrimeKG — Multi-modal precision medicine knowledge graph integrating clinical, genetic, and drug data.

Clinical Trial


API


Preprocessing Tools

  • Chemistry Development Kit — Cheminformatics software & machine learning tools.
  • FlashDeconv — High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min).
  • RDKit — Cheminformatics software & machine learning toolkit.
  • ChatSpatial — MCP server for spatial transcriptomics analysis via natural language.
  • Scanpy — Python library for scRNA-seq analysis.
  • Seurat — R library for scRNA-seq analysis.
  • Squidpy — Python library for spatial single-cell analysis.

Machine Learning Tasks and Models

Drug Response Prediction

  • drGAT — Attention-based model for drug response prediction with gene explainability.
  • MOFGCN — GCN + heterogeneous network.
  • DeepDSC — Autoencoder + fully connected NN.
  • DGDRP — Multi-view embedding neural network.
  • DeepAEG — GNN embedding + attention mechanism.

Drug Repurposing

  • DeepPurpose — Deep learning library for drug repurposing.

Drug Target Interaction

  • NeoDTI — Library for drug-target interaction prediction.
  • DTINet — Network-based framework integrating heterogeneous biological data for DTI prediction.
  • DeepDTA — Deep learning model using CNNs on protein sequences and drug SMILES.
  • GraphDTA — Graph neural network–based DTI prediction using molecular graphs.
  • MolTrans — Transformer-based DTI model leveraging molecular substructures.
  • DrugBAN — Bilinear attention network for interpretable DTI prediction.

Compound-Protein Interaction

  • MCPINN — Drug discovery via compound-protein interaction and machine learning.
  • TransformerCPI — CPI prediction using Transformer.

Pre-trained Embedding

LLM for Biology

  • AI4Chem/ChemLLM-7B-Chat — LLM for chemical & molecular science.
  • BioGPT — LLM for biomedical text generation.
  • GeneGPT — LLM for biomedical information, integrated with various APIs.
  • GenePT — Foundation LLM for single-cell data.
  • scPRINT — Pretrained on 50M cells for scRNA-seq denoising & zero imputation.

Foundation Models

  • scFoundation — Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks.
  • scGPT — Transformer-based foundation model pretrained on millions of single-cell profiles.
  • BulkFormer — Foundation model for bulk RNA-seq data; learns general transcriptomic representations.