Awesome Python Data Science
Probably the best curated list of data science software in Python
Machine Learning¶
General Purpose Machine Learning¶
- Shogun - Machine learning toolbox.
- xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package.
- mlpack - A scalable C++ machine learning library (Python bindings).
- dlib - Toolkit for making real-world machine learning and data analysis applications in C++ (Python bindings).
- pyGAM - Generalized Additive Models in Python.
Gradient Boosting¶
- NGBoost - Natural Gradient Boosting for Probabilistic Prediction.
Ensemble Methods¶
Imbalanced Datasets¶
Random Forests¶
Kernel Methods¶
- liquidSVM - An implementation of SVMs.
Deep Learning¶
PyTorch¶
TensorFlow¶
MXNet¶
JAX¶
- JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
- FLAX - A neural network library for JAX that is designed for flexibility.
- Optax - A gradient processing and optimization library for JAX.
Others¶
- Tangent - Source-to-Source Debuggable Derivatives in Pure Python.
- autograd - Efficiently computes derivatives of numpy code.
- Caffe - A fast open framework for deep learning.
- nnabla - Neural Network Libraries by Sony.
Automated Machine Learning¶
- AutoGluon - AutoML for Image, Text, Tabular, Time-Series, and MultiModal Data.
- MLBox - A powerful Automated Machine Learning python library.
Natural Language Processing¶
- spaCy - Industrial-Strength Natural Language Processing.
- NLTK - Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
- CLTK - The Classical Language Toolkik.
- gensim - Topic Modelling for Humans.
- pyMorfologik - Python binding for Morfologik.
- Phonemizer - Simple text-to-phonemes converter for multiple languages.
- flair - Very simple framework for state-of-the-art NLP.
Computer Audition¶
- librosa - Python library for audio and music analysis.
- Yaafe - Audio features extraction.
- aubio - A library for audio and music analysis.
- Essentia - Library for audio and music analysis, description, and synthesis.
- LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
- Marsyas - Music Analysis, Retrieval, and Synthesis for Audio Signals.
- muda - A library for augmenting annotated audio data.
- madmom - Python audio and music signal processing library.
Computer Vision¶
- OpenCV - Open Source Computer Vision Library.
- Decord - An efficient video loader for deep learning with smart shuffling that's super easy to digest.
- scikit-image - Image Processing SciKit (Toolbox for SciPy).
- imgaug - Image augmentation for machine learning experiments.
- imgaug_extension - Additional augmentations for imgaug.
- Augmentor - Image augmentation library in Python for machine learning.
- albumentations - Fast image augmentation library and easy-to-use wrapper around other libraries.
- LAVIS - A One-stop Library for Language-Vision Intelligence.
Time Series¶
- skforecast - Time series forecasting with machine learning models
- darts - A python library for easy manipulation and forecasting of time series.
- statsforecast - Lightning fast forecasting with statistical and econometric models.
- mlforecast - Scalable machine learning-based time series forecasting.
- neuralforecast - Scalable machine learning-based time series forecasting.
- greykite - A flexible, intuitive, and fast forecasting library next.
- Prophet - Automatic Forecasting Procedure.
- PyFlux - Open source time series library for Python.
- bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
- luminol - Anomaly Detection and Correlation library.
- dateutil - Powerful extensions to the standard datetime module
- maya - makes it very easy to parse a string and for changing timezones
- Chaos Genius - ML powered analytics engine for outlier/anomaly detection and root cause analysis
Reinforcement Learning¶
- Gymnasium - An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym).
- PettingZoo - An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities.
- MAgent2 - An engine for high performance multi-agent environments with very large numbers of agents, along with a set of reference environments.
- Stable Baselines3 - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
- Shimmy - An API conversion tool for popular external reinforcement learning environments.
- EnvPool - C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
- RLlib - Scalable Reinforcement Learning.
- Acme - A library of reinforcement learning components and agents.
- d3rlpy - An offline deep reinforcement learning library.
- Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
- garage - A toolkit for reproducible reinforcement learning research.
- Horizon - A platform for Applied Reinforcement Learning.
- cleanrl - High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG).
Graph Machine Learning¶
- Auto Graph Learning -An autoML framework & toolkit for machine learning on graphs.
- Auto Graph Learning - An autoML framework & toolkit for machine learning on graphs.
- Karate Club - An unsupervised machine learning library for graph-structured data.
- Little Ball of Fur - A library for sampling graph structured data.
- Jraph - A Graph Neural Network Library in Jax.
Learning-to-Rank & Recommender Systems¶
- LightFM - A Python implementation of LightFM, a hybrid recommendation algorithm.
- Spotlight - Deep recommender models using PyTorch.
- Surprise - A Python scikit for building and analyzing recommender systems.
Probabilistic Graphical Models¶
- pgmpy - A python library for working with Probabilistic Graphical Models.
- pyAgrum - A GRaphical Universal Modeler.
Probabilistic Methods¶
- PyMC - Bayesian Stochastic Modelling in Python.
- PyStan - Bayesian inference using the No-U-Turn sampler (Python interface).
- emcee - The Python ensemble sampling toolkit for affine-invariant MCMC.
- hsmmlearn - A library for hidden semi-Markov models with explicit durations.
- pyhsmm - Bayesian inference in HSMMs and HMMs.
Model Explanation¶
- Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
- Alibi - Algorithms for monitoring and explaining machine learning models.
- anchor - Code for "High-Precision Model-Agnostic Explanations" paper.
- aequitas - Bias and Fairness Audit Toolkit.
- ELI5 - A library for debugging/inspecting machine learning classifiers and explaining their predictions.
- L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
- PDPbox - Partial dependence plot toolbox.
- PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
- Skater - Python Library for Model Interpretation.
- AI Explainability 360 - Interpretability and explainability of data and machine learning models.
- Auralisation - Auralisation of learned features in CNN (for audio).
- CapsNet-Visualization - A visualization of the CapsNet layers to better understand how it works.
- lucid - A collection of infrastructure and tools for research in neural network interpretability.
- Netron - Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
- FlashLight - Visualization Tool for your NeuralNetwork.
- tensorboard-pytorch - Tensorboard for PyTorch (and chainer, mxnet, numpy, ...).
Genetic Programming¶
- DEAP - Distributed Evolutionary Algorithms in Python.
- monkeys - A strongly-typed genetic programming framework for Python.
Optimization¶
- Optuna - A hyperparameter optimization framework.
- pymoo - Multi-objective Optimization in Python.
- pycma - Python implementation of CMA-ES.
- Spearmint - Bayesian optimization.
- scikit-opt - Heuristic Algorithms for optimization.
- SMAC3 - Sequential Model-based Algorithm Configuration.
- Optunity - Is a library containing various optimizers for hyperparameter tuning.
- hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python.
- Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
- SafeOpt - Safe Bayesian Optimization.
- scikit-optimize - Sequential model-based optimization with a
scipy.optimize
interface. - Solid - A comprehensive gradient-free optimization framework written in Python.
- PySwarms - A research toolkit for particle swarm optimization in Python.
- Platypus - A Free and Open Source Python Library for Multiobjective Optimization.
- POT - Python Optimal Transport library.
- Talos - Hyperparameter Optimization for Keras Models.
- nlopt - Library for nonlinear optimization (global and local, constrained or unconstrained).
- OR-Tools - An open-source software suite for optimization by Google; provides a unified programming interface to a half dozen solvers: SCIP, GLPK, GLOP, CP-SAT, CPLEX, and Gurobi.
Feature Engineering¶
General¶
- Featuretools - Automated feature engineering.
- OpenFE - Automated feature generation with expert-level performance.
Feature Selection¶
- scikit-feature - Feature selection repository in Python.
- zoofs - A feature selection library based on evolutionary algorithms.
Visualization¶
General Purposes¶
- Matplotlib - Plotting with Python.
- seaborn - Statistical data visualization using matplotlib.
- prettyplotlib - Painlessly create beautiful matplotlib plots.
- python-ternary - Ternary plotting library for Python with matplotlib.
- missingno - Missing data visualization module for Python.
- chartify - Python library that makes it easy for data scientists to create charts.
- physt - Improved histograms.
Interactive plots¶
- animatplot - A python package for animating plots built on matplotlib.
- plotly - A Python library that makes interactive and publication-quality graphs.
- Bokeh - Interactive Web Plotting for Python.
- Altair - Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
- bqplot - Plotting library for IPython/Jupyter notebooks
Map¶
- folium - Makes it easy to visualize data on an interactive open street map
- geemap - Python package for interactive mapping with Google Earth Engine (GEE)
Automatic Plotting¶
- HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
- AutoViz: Visualize data automatically with 1 line of code (ideal for machine learning)
- SweetViz: Visualize and compare datasets, target values and associations, with one line of code.
NLP¶
- pyLDAvis: Visualize interactive topic model
Deployment¶
- fastapi - Modern, fast (high-performance), a web framework for building APIs with Python
- streamlit - Make it easy to deploy the machine learning model
- streamsync - No-code in the front, Python in the back. An open-source framework for creating data apps.
- gradio - Create UIs for your machine learning model in Python in 3 minutes.
- Vizro - A toolkit for creating modular data visualization applications.
- datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
- binder - Enable sharing and execute Jupyter Notebooks
Statistics¶
- statsmodels - Statistical modeling and econometrics in Python.
- stockstats - Supply a wrapper
StockDataFrame
based on thepandas.DataFrame
with inline stock statistics/indicators support. - weightedcalcs - A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
- scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests.
- Alphalens - Performance analysis of predictive (alpha) stock factors.
Data Manipulation¶
Data Frames¶
- pandas - Powerful Python data analysis toolkit.
- polars - A fast multi-threaded, hybrid-out-of-core DataFrame library.
- Arctic - High-performance datastore for time series and tick data.
- pandas_profiling - Create HTML profiling reports from pandas DataFrame objects
- xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
- swifter - A package that efficiently applies any function to a pandas dataframe or series in the fastest available manner.
- pandas-log - A package that allows providing feedback about basic pandas operations and finds both business logic and performance issues.
- vaex - Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
- xarray - Xarray combines the best features of NumPy and pandas for multidimensional data selection by supplementing numerical axis labels with named dimensions for more intuitive, concise, and less error-prone indexing routines.
Pipelines¶
- pdpipe - Sasy pipelines for pandas DataFrames.
- SSPipe - Python pipe (|) operator with support for DataFrames and Numpy, and Pytorch.
- Dataset - Helps you conveniently work with random or sequential batches of your data and define data processing.
- meza - A Python toolkit for processing tabular data.
- Prodmodel - Build system for data science pipelines.
- Hamilton - A microframework for dataframe generation that applies Directed Acyclic Graphs specified by a flow of lazily evaluated Python functions.
Data-centric AI¶
- cleanlab - The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
- snorkel - A system for quickly generating training data with weak supervision.
- dataprep - Collect, clean, and visualize your data in Python with a few lines of code.
Synthetic Data¶
Distributed Computing¶
- Veles - Distributed machine learning platform.
- Jubatus - Framework and Library for Distributed Online Machine Learning.
- DMTK - Microsoft Distributed Machine Learning Toolkit.
- PaddlePaddle - PArallel Distributed Deep LEarning.
- Distributed - Distributed computation in Python.
Experimentation¶
- mlflow - Open source platform for the machine learning lifecycle.
- Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
- dvc - Data Version Control | Git for Data & Models | ML Experiments Management.
- envd - 🏕️ machine learning development environment for data science and AI/ML engineering teams.
- Sacred - A tool to help you configure, organize, log, and reproduce experiments.
Data Validation¶
- great_expectations - Always know what to expect from your data.
- pandera - A lightweight, flexible, and expressive statistical data testing library.
- evidently - Evaluate and monitor ML models from validation to production.
- TensorFlow Data Validation - Library for exploring and validating machine learning data.
- DataComPy- A library to compare Pandas, Polars, and Spark data frames. It provides stats and lets users adjust for match accuracy.
Evaluation¶
- recmetrics - Library of useful metrics and plots for evaluating recommender systems.
- Metrics - Machine learning evaluation metric.
- AI Fairness 360 - Fairness metrics for datasets and ML models, explanations, and algorithms to mitigate bias in datasets and models.
Computations¶
- numpy - The fundamental package needed for scientific computing with Python.
- bottleneck - Fast NumPy array functions written in C.
- CuPy - NumPy-like API accelerated with CUDA.
- scikit-tensor - Python library for multilinear algebra and tensor factorizations.
- numdifftools - Solve automatic numerical differentiation problems in one or more variables.
- quaternion - Add built-in support for quaternions to numpy.
- adaptive - Tools for adaptive and parallel samping of mathematical functions.
- NumExpr - A fast numerical expression evaluator for NumPy that comes with an integrated computing virtual machine to speed calculations up by avoiding memory allocation for intermediate results.
Web Scraping¶
- BeautifulSoup: The easiest library to scrape static websites for beginners
- Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the core
- Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
- Pattern: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
- twitterscraper: Efficient library to scrape Twitter
Spatial Analysis¶
- PySal - Python Spatial Analysis Library.
Quantum Computing¶
- qiskit - Qiskit is an open-source SDK for working with quantum computers at the level of circuits, algorithms, and application modules.
- cirq - A python framework for creating, editing, and invoking Noisy Intermediate Scale Quantum (NISQ) circuits.
- PennyLane - Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
- QML - A Python Toolkit for Quantum Machine Learning.
Conversion¶
- sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript, and others.
- ONNX - Open Neural Network Exchange.
- MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
- treelite - Universal model exchange and serialization format for decision tree forests.
Contributing¶
Contributions are welcome! Read the contribution guideline.
License¶
This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0