Data Visualization Using Seaborn
Seaborn is used for data visualization, and it is based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Data visualization is used for finding extremely meaningful insights from the data. It is used to visualize the distribution of data, the relationship between two variables. When data are visualized properly, the human visual system can see trends and patterns that indicate a relationship.
Let’s learn about different types of seaborn plots in this article.
Table of contents
- Visualizing the distribution of the dataset
2. Visualizing associations among two or more quantitative variables
3. Plotting categorical data
I have taken a small dataset for easy understanding.
Visualizing the distribution of the dataset.
1. Univariate distribution
2. Bivariate distribution
- joint plot
A histogram is used for visualizing the distribution of a single variable(univariate distribution). A histogram is a bar plot where the axis representing the data variable is divided into a set of bins, and the count of observations falling under each bin is shown on the other axis.
Data variable vs count
Importing libraries and dataset
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns
Now we will create histogram plots.
Creating histogram for “Marks” variable. Let’s see the distribution of marks in this dataset.
- From the plot, we can see the range of marks. (5 to 100)
- This plot also clearly shows that more students get marks of more than 80.
In seaborn, the hue parameter determines which column in the data frame should be used for color encoding.
We can include the “Grade” variable as a hue parameter.
Now, after adding the hue parameter, we get more information like which range of marks belongs to which grade.
2. KDE plot
A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, similar to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions.
KDE →Kernel density estimation is the way to determine the probability density function of a continuous variable.
Data variable vs density
By using the KDE plot, we can infer the probability density function of the continuous variable.
hue parameter in KDE plot
Distplot is a combination of a histogram with a line (density plot) on it. Distplot is also used for visualizing the distribution of a single variable(univariate distribution).
In distplot, the y-axis represents density. So the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
To visualize only a density plot, we can give
To visualize only the histogram, we can give
A Jointplot displays the relationship between two numeric variables. It is a combination of scatterplot and histogram.
The joint plot also draws a regression line if we mention kind=” reg”.
Using hue as a parameter
Pairplot is used to describe pairwise relationships in a dataset. Pairplot is used to visualize the univariate distribution of all variables in a dataset along with all of their pairwise relationships. For n variables, it produces n*n grid.
The diagonal plots are histograms and all the other plots are scatter plots.
Data distribution should show some trends. In this example, Marks vs Study_hours gives a linear relationship(positive correlation).
Student_Id column is not showing any relationship with the “Marks” and also the “Study_hours” variable.
Student_Id column can be dropped from the dataset.
Using hue parameter in pairplot.
Visualizing associations among two or more quantitative variables
KDE plot can be used for bivariate distribution also.
Let’s see the distribution of data for “Study_hours” vs “Marks”
2. Scatter plot
The scatterplot shows the relationship between two numerical variables.
From scatterplot, we can determine the correlation between the variables
- Positive Correlation: Relationship between two variables when two variables move in the same direction.
- Negative Correlation: Relationship between two variables when two variables move in a different direction.
- Zero Correlation: No relationship between the two variables.
Example 1: Let’s see the relationship between “Marks” and “Study_hours”
Indicates Positive correlation. Marks increases when Study_hours increases.
Example 2: Let’s see the relationship between “Student_Id” and “Marks”.
We can see that there is zero correlation between “Student_Id” and “Marks”. We can drop the column “Student_Id” from the dataset since it’s not related to the “Marks” variable.
Example 3: Using the hue parameter in a scatterplot.
In a scatterplot, we can add a third variable by mentioning in hue parameter. It will be shown in colors.
3. Line plot
The relationship between the two variables can be shown by a line plot.
Example 1: Let’s see the relationship between “Marks” and “Study_hours”
“Marks” increases when “Study_hours” increases.
Using the hue parameter in lineplot
A lmplot is a scatterplot with a trend line. A lmplot is used to plot the regression line.
Both “Marks” and “Study_hours” variables have a linear relationship. Shade distribution around the regression line indicates the data distribution.
Using hue parameter in lmplot
“Marks” increase when “Study_hours” increases. The vertical line shows the range of values(“Marks”) for that particular “Study_hours”.
In many scenarios, we may want to visualize the magnitude of some set of numbers. Like the total number of students in each class or the total number of employees working in different companies.
To visualize amounts, bar plots are used.
A bar plot represents an estimate of central tendency for a numeric variable [Mean] with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars.
Barplot will only show the mean value of a numerical variable for each level of the categorical variable.
If we want the distribution of values at each level of the categorical variable, we can use a boxplot or violin plot.
Example 1: Let’s do a barplot between “Grade” and “Marks” [Categorical vs numeric variable)
For each grade level, the mean value of “Marks” is shown.
Let’s calculate the mean value of Marks in grade A.
In grade level A, “Marks” are 95,72,80,97,75,100,90,98.
Mean = (95+72+80+97+75+100+90+98)/7 =707/8
Error bar is shown from 72 to 100 (Since “Marks” range from 72 to 100 in Grade A)
Countplot shows the count of observations in each categorical bin using bars.
We come to know the count of students who got A grade, B grade, and C grade.
Adding hue parameter in countplot
We come to know the count of male and female students in each grade.
Boxplot is used to describe how the data is distributed in the dataset. This graph represents
five-point summary(minimum, maximum, median, lower quartile, and upper quartile). This graph is used to identify outliers.
- whiskers — denote the spread of data
- The length of the upper whisker is the largest value that is no greater than the
third quartile(Q3)plus 1.5 times the
- The data points above the upper whisker and lower whisker are detected as outliers.
- box — represents the IQR- 50% of data lies within this range
Violinplot helps to see both the distribution of data in terms of kernel density estimate and box plot.
The white dot in the middle is the median value and the thick black bar in the center represents the interquartile range. The thin black line extended from it represents the max and min values in the data.
The density plot is rotated and kept on each side to show the distribution of data.
A Stripplot is a one-dimensional scatterplot of the given data where one variable is categorical. This is usually used when the sample size is small.
Swarmplot is similar to stripplot but the points are adjusted along with the categorical data so that they do not overlap. Swarmplot will describe the data better than stripplot.
Heatmap is a two-dimensional graphical representation of data where individual values that are contained in a matrix are represented using colors.
Let’s see how to check correlation using a heatmap.
Correlation is a statistical technique that is used to check how two variables are related.
If the correlation is 1 or near to 1, two variables are strongly correlated. In this dataset, “Marks” and “Study_hours” have a strong correlation.
- Stripplot and Swarmplot are categorical scatterplots
- Box plot and violin plot are categorical distribution plots
- Barplot and Countplot are categorical estimate plots.
- Histogram, kdeplot, distplot are univariate distribution plots
- jointplot is a bivariate distribution plot.
The relationship between
y can be shown for different subsets of the data using the
If the hue parameter is given as numeric variables means they are represented with a sequential colormap by default.
If the hue parameter is given as a categorical variable means they are represented in different colors.
This covers some of the data visualizations using seaborn.
I hope that you have found this article helpful. Thanks for reading!