Best data science cheat sheets

We’ve collated a collection of cheat sheets for you to get to grips with the main libraries used in data science.

They are grouped into the fields for which each library is designed: Basics, Databases, Data Manipulation, Data Visualization, Analysis, Machine Learning, Deep Learning and Natural Language Processing (NLP).

Basics

If you're just starting out in the world of data science, it's important to understand how at least two of the basic libraries work: Python and NumPy. These two libraries are used throughout the entire development process. The third library, Scipy, is a mathematical tool that can handle more complex calculations than NumPy.

Python basics

Level: Beginner - Intermediate
Area: Basics
Description: Python is a standard library upon which the data science methodology has been developed. The way of tackling and structuring a project is inherited from how we work in Python.
Source: DataQuest

NumPy basics

Level: Beginner - Intermediate
Area: Basics
Description: NumPy is the mathematical Python library par excellence (its name is taken from Numerical Python). It allows us to work more efficiently with vectors and matrices.
Source: DataCamp
Cheat sheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

SciPy

Level: Advanced
Area: Basics
Description: The SciPy library has been developed to work with NumPy and is designed for more complex numerical calculations, more closely related to scientific computing.
Source: DataCamp
Cheat sheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_SciPy_Cheat_Sheet_Linear_Algebra.pdf

Database

Data can be stored in sets or, sometimes, in relational or non-relational databases that are imported into the working platform.

SQL

Level: Beginner - Intermediate
Area: Relational databases
Description: relational databases use a structure of separate tables that store data more efficiently and create relations between them using keys. SQL is the best language for querying data stored in these tables, thanks to its versatility.
Source: sqltutorial
Cheat sheet: https://www.sqltutorial.org/sql-cheat-sheet/

MongoDB

Level: Beginner - Intermediate
Area: Non-relational databases
Description: non-relational databases are increasingly popular, especially due to the rise in big data companies and apps, as they make it possible to overcome the barriers of data structures posed by relational databases. MongoDB is the leader in distributed databases.
Source: codecentric
Cheat sheet: https://blog.codecentric.de/files/2012/12/MongoDB-CheatSheet-v1_0.pdf

Data Manipulation

Before getting started with data analytics, it's essential to organise the data set's information so that it's easier to perform the necessary analytical operations. This process is known as data manipulation.

Pandas

Level: Beginner - Intermediate
Area: Data manipulation
Description: Pandas is the library per excellence for processing data in DataFrames, in other words, it allows us to read records, manipulate data, group them and organise them in a way that facilitates our analysis. This cheat sheet shows you some essential steps to help you use the library.
Source: DataCamp
Cheat sheet: http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3

Data Wrangling

Level: Beginner - Intermediate
Area: Data manipulation
Description: Prior to conducting an analysis, it's important to clean the DataFrame and organise our data, since we sometimes find duplicate, void or invalid records. The process of cleaning the DataFrame so we can use it for our analysis is known as Data Cleaning or Data Wrangling.
Source: pandas
Cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Data Visualization

Data visualization is the graphic representation of data and is particularly important for conducting analyses or portraying analysis results, which can help us discover trends, outliers and patterns in the data.

Matplotlib

Level: Beginner
Area: Data visualization
Description: matplotlib is the first library that's been developed for map plotting and projections in Python. It offers a huge range of options for drawing graphs and personalising them, from the most simple to the most complicated of visualizations.
Source: DataCamp
Cheat sheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf

Seaborn

Level: Intermediate
Area: Data Visualization
Description: The Seaborn library is more advanced than matplotlib and was developed to facilitate the statistical analysis of data directly onto graphs.
Source: DataCamp
Cheat sheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf

Folium

Level: Intermediate
Area: Data visualization
Description: Within the field of visualization, maps are a very useful form of representation that allows us to depict geospacial positioning and distances. Folium is a library that allows us to generate maps and easily depict data from a data set, rendering a representation such as a mapbox or OpenStreetMap and adding layers of visual data like cluster points or a heatmap.
Source: AndrewChallis

Machine Learning

Machine learning algorithms allow us to make predictions based on available data. These are known either as regression or classification algorithms, depending on the type of data in question. These processes can be supervised or non-supervised, depending on whether the machine learning model is trained using labelled data, or not, which is known as 'ground truth'.

Scikit-Learn

Level: Advanced
Area: Machine learning
Description: Scikit-Learn is a library developed on top of SciPy and designed for data modelling: clustering, feature manipulation, outlier detection, model selection and validation. It is known for being robust and easy to integrate with other Python libraries.
Source: DataCamp
Cheat sheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf

Deep Learning

Within the field of machine learning, there is a more specific field known as deep learning, which uses artificial neural networks to make predictions.

Keras

Level: Advanced
Area: Deep leaning
Description: The Keras library is written in Python and is capable of running on top of CNTK, TensorFlow and Theano, making it possible to generate and evaluate neural network models.
Source: DataCamp
Cheat sheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf

Tensorflow

Level: Advanced
Area: Deep learning
Description: This is a second-generation deep learning library developed by Google. It allows users to create models using an API with an inferior or superior abstraction layer, outlining mathematical operations or neural networks, depending on the user's preference.
Source: Altoros
Cheat sheet: https://cdn-images-1.medium.com/max/2000/1*dtOZSuYDonyyBvEULpJALw.png

PyTorch

Level: Advanced
Area: Deep learning
Description: PyTorch is a deep learning library developed by Facebook. It is one of the newest libraries on the market and offers an interface for working with tensors at a more affordable price than TensorFlow or Keras, for example.
Source: PyTorch
Cheat sheet: https://pytorch.org/tutorials/beginner/ptcheat.html

Natural Language Processing (NLP)

Within the field of data science, language analysis is an area that's increasingly gaining ground, with algorithms that have been developed to help us analyse text.

NLTK

Level: Beginner - Intermediate
Area: NLP
Description: NLTK is one of the first libraries developed for natural language analysis and allows users to carry out processes such as tokenization, stemming (lemma analysis), character or word count, in order to read and understand the text under analysis.
Source: Cheatography
Cheat sheet: https://cheatography.com/murenei/cheat-sheets/natural-language-processing-with-python-and-nltk/

spaCy

Level: Advanced
Area: NLP
Description: spaCy is a natural language processing library that analyses texts at difference levels: NER (name, entity, recognition), parser (syntactic analysis) or similarity, from a model trained in one language. It also allows us to create models from scratch with our own examples that recognises the entities we define.
Source: DataCamp
Cheat sheet: http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06

These cheat sheets contain each library's most useful functions and working methods to help you in your day-to-day development tasks. Happy Coding!

The best data science cheat sheets

Basics

Python basics

NumPy basics

SciPy

Database

SQL

MongoDB

Data Manipulation

Pandas

Data Wrangling

Data Visualization

Matplotlib

Seaborn

Folium

Machine Learning

Scikit-Learn

Deep Learning

Keras

Tensorflow

PyTorch

Natural Language Processing (NLP)

NLTK

spaCy

Related Articles

Looking for Creative Data Science Career Paths? Here’s What You Need to Know

AI in Recruitment: How Machine Learning is Shaping the Future of Hiring

Top 10 Pandas Functions Every AI Expert Should Know

TensorFlow vs. PyTorch: Which Deep Learning Framework Should You Learn?

How to Properly Implement Data Classification

Internal Knowledge Processing with Retrieved - Augmented Generation

Observability and Evaluation of LLM Systems & Agents

AI-Driven Data Science Jobs: Career Paths and Salary Insights

Feature Engineering Explained: Unlocking the Power of Data for Machine Learning

Big Data and AI: How Do They Work Together?

Help Data Tell a Story with Data Visualization and Python

From Data to Insights: The Journey of a Data Scientist in the Modern World

Ready to join?