Data Visualization Using Seaborn

Photo by Marcin Dampc on Pexels.com

Data Visualization

Seaborn plots

Data Visualization Using Seaborn

Seaborn is used for data visualization, and it is based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Data visualization is used for finding extremely meaningful insights from the data. It is used to visualize the distribution of data, the relationship between two variables. When data are visualized properly, the human visual system can see trends and patterns that indicate a relationship.

Let’s learn about different types of seaborn plots in this article.

Table of contents

  1. Visualizing the distribution of the dataset
  • Histogram
  • Kdeplot
  • distplot
  • jointplot
  • pairplot

2. Visualizing associations among two or more quantitative variables

  • scatterplot
  • lineplot
  • lmplot
  • pointplot

3. Plotting categorical data

  • barplot
  • countplot
  • boxplot
  • violinplot
  • stripplot
  • swarmplot

Dataset

I have taken a small dataset for easy understanding.

Dataset

Visualizing the distribution of the dataset.

1. Univariate distribution

  • histogram
  • kdeplot
  • distplot

2. Bivariate distribution

  • joint plot
  • pairplot

Univariate distribution

1. Histogram

A histogram is used for visualizing the distribution of a single variable(univariate distribution). A histogram is a bar plot where the axis representing the data variable is divided into a set of bins, and the count of observations falling under each bin is shown on the other axis.

Data variable vs count

Importing libraries and dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df=pd.read_csv("Results.csv")
df.head(3)

Now we will create histogram plots.

Creating histogram for “Marks” variable. Let’s see the distribution of marks in this dataset.

sns.histplot(x="Marks",data=df)

Inference

  1. From the plot, we can see the range of marks. (5 to 100)
  2. This plot also clearly shows that more students get marks of more than 80.

hue

In seaborn, the hue parameter determines which column in the data frame should be used for color encoding.

We can include the “Grade” variable as a hue parameter.

sns.histplot(x="Marks",data=df,bins=10,hue="Grade")

Inference.

Now, after adding the hue parameter, we get more information like which range of marks belongs to which grade.

2. KDE plot

A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, similar to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions.

KDE →Kernel density estimation is the way to determine the probability density function of a continuous variable.

Data variable vs density

sns.kdeplot(x="Marks",data=df)

Inference

By using the KDE plot, we can infer the probability density function of the continuous variable.

hue parameter in KDE plot

sns.kdeplot(x="Marks",data=df,hue="Grade")

3. Distplot

Distplot is a combination of a histogram with a line (density plot) on it. Distplot is also used for visualizing the distribution of a single variable(univariate distribution).

In distplot, the y-axis represents density. So the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.

sns.distplot(df[“Marks”])

To visualize only a density plot, we can give hist=False.

sns.distplot(df[“Marks”],hist=False)

To visualize only the histogram, we can give kde=False.

sns.distplot(df[“Marks”],kde=False)

Bivariate distribution

1. jointplot

A Jointplot displays the relationship between two numeric variables. It is a combination of scatterplot and histogram.

sns.jointplot(x=”Marks”,y=”Study_hours”,data=df)

The joint plot also draws a regression line if we mention kind=” reg”.

sns.jointplot(x=”Marks”,y=”Study_hours”,data=df,kind=”reg”)

Using hue as a parameter

sns.jointplot(x=”Marks”,y=”Study_hours”,data=df,hue=”Grade”)

Pairplot

Pairplot is used to describe pairwise relationships in a dataset. Pairplot is used to visualize the univariate distribution of all variables in a dataset along with all of their pairwise relationships. For n variables, it produces n*n grid.
The diagonal plots are histograms and all the other plots are scatter plots.

sns.pairplot(df)

Inference

Data distribution should show some trends. In this example, Marks vs Study_hours gives a linear relationship(positive correlation).

Student_Id column is not showing any relationship with the “Marks” and also the “Study_hours” variable.

Student_Id column can be dropped from the dataset.

Using hue parameter in pairplot.

sns.pairplot(df,hue=”Grade”)


Visualizing associations among two or more quantitative variables

1.KDE plot

KDE plot can be used for bivariate distribution also.

Let’s see the distribution of data for “Study_hours” vs “Marks”

sns.kdeplot(x="Study_hours",y="Marks",data=df)

2. Scatter plot

The scatterplot shows the relationship between two numerical variables.

Inference

From scatterplot, we can determine the correlation between the variables

  • Positive Correlation: Relationship between two variables when two variables move in the same direction.
  • Negative Correlation: Relationship between two variables when two variables move in a different direction.
  • Zero Correlation: No relationship between the two variables.

Example 1: Let’s see the relationship between “Marks” and “Study_hours”

sns.scatterplot(x="Marks",y="Study_hours",data=df)

Inference:

Indicates Positive correlation. Marks increases when Study_hours increases.

Example 2: Let’s see the relationship between “Student_Id” and “Marks”.

sns.scatterplot(x="Marks",y="Student_Id",data=df)

Inference:

We can see that there is zero correlation between “Student_Id” and “Marks”. We can drop the column “Student_Id” from the dataset since it’s not related to the “Marks” variable.

Example 3: Using the hue parameter in a scatterplot.

In a scatterplot, we can add a third variable by mentioning in hue parameter. It will be shown in colors.

sns.scatterplot(x=”Marks”,y=”Study_hours”,data=df,hue=”Grade”)

3. Line plot

The relationship between the two variables can be shown by a line plot.

Example 1: Let’s see the relationship between “Marks” and “Study_hours”

sns.lineplot(x=”Marks”,y=”Study_hours”,data=df)

Inference

“Marks” increases when “Study_hours” increases.

Using the hue parameter in lineplot

sns.lineplot(x=”Marks”,y=”Study_hours”,data=df,hue=”Grade”,style=”Grade”)

4. lmplot

A lmplot is a scatterplot with a trend line. A lmplot is used to plot the regression line.

sns.lmplot(x=”Marks”,y=”Study_hours”,data=df)

Inference:

Both “Marks” and “Study_hours” variables have a linear relationship. Shade distribution around the regression line indicates the data distribution.

Using hue parameter in lmplot

sns.lmplot(x=”Marks”,y=”Study_hours”,data=df,hue=”Grade”)

5. pointplot

sns.pointplot(x=”Study_hours”,y=”Marks”,data=df)

Inference:

“Marks” increase when “Study_hours” increases. The vertical line shows the range of values(“Marks”) for that particular “Study_hours”.

sns.pointplot(x=”Study_hours”,y=”Marks”,data=df,hue=”Grade”)


Categorical Plots

Visualizing Amounts

In many scenarios, we may want to visualize the magnitude of some set of numbers. Like the total number of students in each class or the total number of employees working in different companies.

To visualize amounts, bar plots are used.

1.Bar plot

A bar plot represents an estimate of central tendency for a numeric variable [Mean] with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars.

Barplot will only show the mean value of a numerical variable for each level of the categorical variable.

If we want the distribution of values at each level of the categorical variable, we can use a boxplot or violin plot.

Example 1: Let’s do a barplot between “Grade” and “Marks” [Categorical vs numeric variable)

sns.barplot(x=”Grade”,y=”Marks”,data=df)

Inference:

For each grade level, the mean value of “Marks” is shown.

Let’s calculate the mean value of Marks in grade A.

In grade level A, “Marks” are 95,72,80,97,75,100,90,98.

Mean = (95+72+80+97+75+100+90+98)/7 =707/8
Mean=88.375

Error bar is shown from 72 to 100 (Since “Marks” range from 72 to 100 in Grade A)

Countplot

Countplot shows the count of observations in each categorical bin using bars.

sns.countplot(x=”Grade”,data=df)

Inference:

We come to know the count of students who got A grade, B grade, and C grade.

Adding hue parameter in countplot

sns.countplot(x=”Grade”,data=df,hue=”Gender”)

Inference:

We come to know the count of male and female students in each grade.

Boxplot

Boxplot is used to describe how the data is distributed in the dataset. This graph represents five-point summary(minimum, maximum, median, lower quartile, and upper quartile). This graph is used to identify outliers.

  • whiskers — denote the spread of data
  • The length of the upper whisker is the largest value that is no greater than the third quartile(Q3) plus 1.5 times the interquartile range(IQR)
  • The data points above the upper whisker and lower whisker are detected as outliers.
  • box — represents the IQR- 50% of data lies within this range

sns.boxplot(x=”Gender”,y=”Marks”,data=df)

ViolinPlot

Violinplot helps to see both the distribution of data in terms of kernel density estimate and box plot.

sns.violinplot(x=”Grade”,y=”Marks”,data=df)

Inference

The white dot in the middle is the median value and the thick black bar in the center represents the interquartile range. The thin black line extended from it represents the max and min values in the data.

The density plot is rotated and kept on each side to show the distribution of data.

Stripplot

A Stripplot is a one-dimensional scatterplot of the given data where one variable is categorical. This is usually used when the sample size is small.

sns.stripplot(x=”Grade”,y=”Marks”,data=df)

Swarmplot

Swarmplot is similar to stripplot but the points are adjusted along with the categorical data so that they do not overlap. Swarmplot will describe the data better than stripplot.

sns.swarmplot(x=”Grade”,y=”Marks”,data=df)


Heatmap

Heatmap is a two-dimensional graphical representation of data where individual values that are contained in a matrix are represented using colors.
Let’s see how to check correlation using a heatmap.

Correlation is a statistical technique that is used to check how two variables are related.

sns.heatmap(df.corr(),annot=True,vmin=-1,vmax=1)

Inference:

If the correlation is 1 or near to 1, two variables are strongly correlated. In this dataset, “Marks” and “Study_hours” have a strong correlation.


Key Takeaways

Image by Author
  1. Stripplot and Swarmplot are categorical scatterplots
  2. Box plot and violin plot are categorical distribution plots
  3. Barplot and Countplot are categorical estimate plots.
  4. Histogram, kdeplot, distplot are univariate distribution plots
  5. jointplot is a bivariate distribution plot.

Hue Parameter

The relationship between x and y can be shown for different subsets of the data using the hue, size, and style parameters.

If the hue parameter is given as numeric variables means they are represented with a sequential colormap by default.

If the hue parameter is given as a categorical variable means they are represented in different colors.


This covers some of the data visualizations using seaborn.

I hope that you have found this article helpful. Thanks for reading!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s