Skip to content
pyds


Awesome Python Data Science


Probably the best curated list of data science software in Python

Machine Learning

General Purpose Machine Learning

  • Shogun - Machine learning toolbox.
  • xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package.
  • mlpack - A scalable C++ machine learning library (Python bindings).
  • dlib - Toolkit for making real-world machine learning and data analysis applications in C++ (Python bindings).
  • pyGAM - Generalized Additive Models in Python.

Gradient Boosting

  • NGBoost - Natural Gradient Boosting for Probabilistic Prediction.

Ensemble Methods

Imbalanced Datasets

Random Forests

Kernel Methods

Deep Learning

PyTorch

TensorFlow

MXNet

JAX

  • JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
  • FLAX - A neural network library for JAX that is designed for flexibility.
  • Optax - A gradient processing and optimization library for JAX.

Others

  • Tangent - Source-to-Source Debuggable Derivatives in Pure Python.
  • autograd - Efficiently computes derivatives of numpy code.
  • Caffe - A fast open framework for deep learning.
  • nnabla - Neural Network Libraries by Sony.

Automated Machine Learning

  • AutoGluon - AutoML for Image, Text, Tabular, Time-Series, and MultiModal Data.
  • MLBox - A powerful Automated Machine Learning python library.

Natural Language Processing

  • spaCy - Industrial-Strength Natural Language Processing.
  • NLTK - Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
  • CLTK - The Classical Language Toolkik.
  • gensim - Topic Modelling for Humans.
  • pyMorfologik - Python binding for Morfologik.
  • Phonemizer - Simple text-to-phonemes converter for multiple languages.
  • flair - Very simple framework for state-of-the-art NLP.

Computer Audition

  • librosa - Python library for audio and music analysis.
  • Yaafe - Audio features extraction.
  • aubio - A library for audio and music analysis.
  • Essentia - Library for audio and music analysis, description, and synthesis.
  • LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
  • Marsyas - Music Analysis, Retrieval, and Synthesis for Audio Signals.
  • muda - A library for augmenting annotated audio data.
  • madmom - Python audio and music signal processing library.

Computer Vision

  • OpenCV - Open Source Computer Vision Library.
  • Decord - An efficient video loader for deep learning with smart shuffling that's super easy to digest.
  • scikit-image - Image Processing SciKit (Toolbox for SciPy).
  • imgaug - Image augmentation for machine learning experiments.
  • imgaug_extension - Additional augmentations for imgaug.
  • Augmentor - Image augmentation library in Python for machine learning.
  • albumentations - Fast image augmentation library and easy-to-use wrapper around other libraries.
  • LAVIS - A One-stop Library for Language-Vision Intelligence.

Time Series

  • darts - A python library for easy manipulation and forecasting of time series.
  • statsforecast - Lightning fast forecasting with statistical and econometric models.
  • mlforecast - Scalable machine learning-based time series forecasting.
  • neuralforecast - Scalable machine learning-based time series forecasting.
  • greykite - A flexible, intuitive, and fast forecasting library next.
  • Prophet - Automatic Forecasting Procedure.
  • PyFlux - Open source time series library for Python.
  • bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
  • luminol - Anomaly Detection and Correlation library.
  • dateutil - Powerful extensions to the standard datetime module
  • maya - makes it very easy to parse a string and for changing timezones
  • Chaos Genius - ML powered analytics engine for outlier/anomaly detection and root cause analysis

Reinforcement Learning

  • Gymnasium - An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym).
  • PettingZoo - An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities.
  • MAgent2 - An engine for high performance multi-agent environments with very large numbers of agents, along with a set of reference environments.
  • Stable Baselines3 - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
  • Shimmy - An API conversion tool for popular external reinforcement learning environments.
  • EnvPool - C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
  • RLlib - Scalable Reinforcement Learning.
  • Acme - A library of reinforcement learning components and agents.
  • d3rlpy - An offline deep reinforcement learning library.
  • Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
  • garage - A toolkit for reproducible reinforcement learning research.
  • Horizon - A platform for Applied Reinforcement Learning.
  • cleanrl - High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG).

Graph Machine Learning

  • Auto Graph Learning -An autoML framework & toolkit for machine learning on graphs.
  • Auto Graph Learning - An autoML framework & toolkit for machine learning on graphs.
  • Karate Club - An unsupervised machine learning library for graph-structured data.
  • Little Ball of Fur - A library for sampling graph structured data.
  • Jraph - A Graph Neural Network Library in Jax.

Learning-to-Rank & Recommender Systems

  • LightFM - A Python implementation of LightFM, a hybrid recommendation algorithm.
  • Spotlight - Deep recommender models using PyTorch.
  • Surprise - A Python scikit for building and analyzing recommender systems.

Probabilistic Graphical Models

  • pgmpy - A python library for working with Probabilistic Graphical Models.
  • pyAgrum - A GRaphical Universal Modeler.

Probabilistic Methods

  • PyMC - Bayesian Stochastic Modelling in Python.
  • PyStan - Bayesian inference using the No-U-Turn sampler (Python interface).
  • emcee - The Python ensemble sampling toolkit for affine-invariant MCMC.
  • hsmmlearn - A library for hidden semi-Markov models with explicit durations.
  • pyhsmm - Bayesian inference in HSMMs and HMMs.

Model Explanation

  • Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
  • Alibi - Algorithms for monitoring and explaining machine learning models.
  • anchor - Code for "High-Precision Model-Agnostic Explanations" paper.
  • aequitas - Bias and Fairness Audit Toolkit.
  • ELI5 - A library for debugging/inspecting machine learning classifiers and explaining their predictions.
  • L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
  • PDPbox - Partial dependence plot toolbox.
  • PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
  • Skater - Python Library for Model Interpretation.
  • AI Explainability 360 - Interpretability and explainability of data and machine learning models.
  • Auralisation - Auralisation of learned features in CNN (for audio).
  • CapsNet-Visualization - A visualization of the CapsNet layers to better understand how it works.
  • lucid - A collection of infrastructure and tools for research in neural network interpretability.
  • Netron - Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
  • FlashLight - Visualization Tool for your NeuralNetwork.
  • tensorboard-pytorch - Tensorboard for PyTorch (and chainer, mxnet, numpy, ...).

Genetic Programming

  • DEAP - Distributed Evolutionary Algorithms in Python.
  • monkeys - A strongly-typed genetic programming framework for Python.

Optimization

  • Optuna - A hyperparameter optimization framework.
  • Spearmint - Bayesian optimization.
  • scikit-opt - Heuristic Algorithms for optimization.
  • SMAC3 - Sequential Model-based Algorithm Configuration.
  • Optunity - Is a library containing various optimizers for hyperparameter tuning.
  • hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python.
  • Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
  • SafeOpt - Safe Bayesian Optimization.
  • scikit-optimize - Sequential model-based optimization with a scipy.optimize interface.
  • Solid - A comprehensive gradient-free optimization framework written in Python.
  • PySwarms - A research toolkit for particle swarm optimization in Python.
  • Platypus - A Free and Open Source Python Library for Multiobjective Optimization.
  • POT - Python Optimal Transport library.
  • Talos - Hyperparameter Optimization for Keras Models.
  • nlopt - Library for nonlinear optimization (global and local, constrained or unconstrained).
  • OR-Tools - An open-source software suite for optimization by Google; provides a unified programming interface to a half dozen solvers: SCIP, GLPK, GLOP, CP-SAT, CPLEX, and Gurobi.

Feature Engineering

General

  • Featuretools - Automated feature engineering.
  • OpenFE - Automated feature generation with expert-level performance.

Feature Selection

  • scikit-feature - Feature selection repository in Python.
  • zoofs - A feature selection library based on evolutionary algorithms.

Visualization

General Purposes

  • Matplotlib - Plotting with Python.
  • seaborn - Statistical data visualization using matplotlib.
  • prettyplotlib - Painlessly create beautiful matplotlib plots.
  • python-ternary - Ternary plotting library for Python with matplotlib.
  • missingno - Missing data visualization module for Python.
  • chartify - Python library that makes it easy for data scientists to create charts.
  • physt - Improved histograms.

Interactive plots

  • animatplot - A python package for animating plots built on matplotlib.
  • plotly - A Python library that makes interactive and publication-quality graphs.
  • Bokeh - Interactive Web Plotting for Python.
  • Altair - Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
  • bqplot - Plotting library for IPython/Jupyter notebooks

Map

  • folium - Makes it easy to visualize data on an interactive open street map
  • geemap - Python package for interactive mapping with Google Earth Engine (GEE)

Automatic Plotting

  • HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
  • AutoViz: Visualize data automatically with 1 line of code (ideal for machine learning)
  • SweetViz: Visualize and compare datasets, target values and associations, with one line of code.

NLP

  • pyLDAvis: Visualize interactive topic model

Deployment

  • fastapi - Modern, fast (high-performance), a web framework for building APIs with Python
  • streamlit - Make it easy to deploy the machine learning model
  • streamsync - No-code in the front, Python in the back. An open-source framework for creating data apps.
  • gradio - Create UIs for your machine learning model in Python in 3 minutes.
  • Vizro - A toolkit for creating modular data visualization applications.
  • datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
  • binder - Enable sharing and execute Jupyter Notebooks

Statistics

  • statsmodels - Statistical modeling and econometrics in Python.
  • stockstats - Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline stock statistics/indicators support.
  • weightedcalcs - A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
  • scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests.
  • Alphalens - Performance analysis of predictive (alpha) stock factors.

Data Manipulation

Data Frames

  • pandas - Powerful Python data analysis toolkit.
  • polars - A fast multi-threaded, hybrid-out-of-core DataFrame library.
  • Arctic - High-performance datastore for time series and tick data.
  • pandas_profiling - Create HTML profiling reports from pandas DataFrame objects
  • xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
  • swifter - A package that efficiently applies any function to a pandas dataframe or series in the fastest available manner.
  • pandas-log - A package that allows providing feedback about basic pandas operations and finds both business logic and performance issues.
  • vaex - Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
  • xarray - Xarray combines the best features of NumPy and pandas for multidimensional data selection by supplementing numerical axis labels with named dimensions for more intuitive, concise, and less error-prone indexing routines.

Pipelines

  • pdpipe - Sasy pipelines for pandas DataFrames.
  • SSPipe - Python pipe (|) operator with support for DataFrames and Numpy, and Pytorch.
  • Dataset - Helps you conveniently work with random or sequential batches of your data and define data processing.
  • meza - A Python toolkit for processing tabular data.
  • Prodmodel - Build system for data science pipelines.
  • Hamilton - A microframework for dataframe generation that applies Directed Acyclic Graphs specified by a flow of lazily evaluated Python functions.

Data-centric AI

  • cleanlab - The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
  • snorkel - A system for quickly generating training data with weak supervision.
  • dataprep - Collect, clean, and visualize your data in Python with a few lines of code.

Synthetic Data

Distributed Computing

  • Veles - Distributed machine learning platform.
  • Jubatus - Framework and Library for Distributed Online Machine Learning.
  • DMTK - Microsoft Distributed Machine Learning Toolkit.
  • PaddlePaddle - PArallel Distributed Deep LEarning.
  • Distributed - Distributed computation in Python.

Experimentation

  • mlflow - Open source platform for the machine learning lifecycle.
  • Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
  • dvc - Data Version Control | Git for Data & Models | ML Experiments Management.
  • envd - 🏕️ machine learning development environment for data science and AI/ML engineering teams.
  • Sacred - A tool to help you configure, organize, log, and reproduce experiments.

Data Validation

  • great_expectations - Always know what to expect from your data.
  • pandera - A lightweight, flexible, and expressive statistical data testing library.
  • evidently - Evaluate and monitor ML models from validation to production.
  • TensorFlow Data Validation - Library for exploring and validating machine learning data.

Evaluation

  • recmetrics - Library of useful metrics and plots for evaluating recommender systems.
  • Metrics - Machine learning evaluation metric.
  • AI Fairness 360 - Fairness metrics for datasets and ML models, explanations, and algorithms to mitigate bias in datasets and models.

Computations

  • numpy - The fundamental package needed for scientific computing with Python.
  • bottleneck - Fast NumPy array functions written in C.
  • CuPy - NumPy-like API accelerated with CUDA.
  • scikit-tensor - Python library for multilinear algebra and tensor factorizations.
  • numdifftools - Solve automatic numerical differentiation problems in one or more variables.
  • quaternion - Add built-in support for quaternions to numpy.
  • adaptive - Tools for adaptive and parallel samping of mathematical functions.
  • NumExpr - A fast numerical expression evaluator for NumPy that comes with an integrated computing virtual machine to speed calculations up by avoiding memory allocation for intermediate results.

Web Scraping

  • BeautifulSoup: The easiest library to scrape static websites for beginners
  • Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the core
  • Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
  • Pattern: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
  • twitterscraper: Efficient library to scrape Twitter

Spatial Analysis

  • PySal - Python Spatial Analysis Library.

Quantum Computing

  • qiskit - Qiskit is an open-source SDK for working with quantum computers at the level of circuits, algorithms, and application modules.
  • cirq - A python framework for creating, editing, and invoking Noisy Intermediate Scale Quantum (NISQ) circuits.
  • PennyLane - Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
  • QML - A Python Toolkit for Quantum Machine Learning.

Conversion

  • sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript, and others.
  • ONNX - Open Neural Network Exchange.
  • MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
  • treelite - Universal model exchange and serialization format for decision tree forests.

Contributing

Contributions are welcome! 😎
Read the contribution guideline.

License

This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0