Descriptive Statistics in Python
Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution.
- Measure of central tendency
- Measure of spread/dispersion
- Measure of symmetry [ will save this for the future post]
Imported all the libraries needed for statistical plots and created a dataframe from the dataset given in
This dataset contains Height, Weight, Age, BMI, and Gender columns. Let’s calculate descriptive statistics for this dataset.
The code used in this project is available as a Jupyter Notebook on GitHub.
import numpy as np import pandas as pd import matplotlib.pyplot as plt % matplotlib inline
Measure of Central Tendency
Measure of central tendency is used to describe the middle/center value of the data.
Mean, Median, Mode are measures of central tendency.
- Mean is the
average valueof the dataset.
- Mean is calculated by adding all values in the dataset divided by the number of values in the dataset.
- We can calculate the mean for only numerical variables
Formula to calculate mean
- Calculating the mean of the “Age” column in our dataset.
- Mathematical Calculation
- Calculating Mean for a particular variable (“Age”) using Python.
- Calculating the mean for all the columns in the dataframe.
Calculated mean only for numerical datatype. “Gender” column in the dataset is excluded.
- The Median is the
middle numberin the dataset.
- Median is the best measure when we have outliers.
Find the median of the “Age” column in our dataset.
If we have an even number of data, find the average of the middle two items.
Age → 4,12,24,8,16,20
Sort → 4,8,12,16,20,24
Pick the middles ones → 12,16
Find the average → 28/2 =14
Calculating the median for a particular variable (“Age”) using Python.
Calculating the median for all the columns in the dataframe.
Calculated median only for the numerical datatype. The “Gender” column in the dataset is excluded.
The mode is used to find the common number in the dataset.
Calculating mode for a particular variable (“Age”) using a mathematical calculation
Calculating the mode for a particular variable (“Age”) using Python.
Calculating the mode for all the columns in the dataframe.
bmi column, all numbers are unique. So, all numbers are displayed.
Weight column,70.0, and 80.0 are repeated more times, so both are displayed.
Measure of spread
- The measure of spread/dispersion is used to describe how data is spread. It also describes the variability of the dataset.
- Standard Deviation, Variance, Range, IQR, are used to describe the measure of spread/dispersion
- The measure of spread can be shown in graphs like boxplot.
Variance is used to describe how far each number in the dataset is from the mean.
Formula to calculate population variance
σ2 — Population Variance
μ -Population Mean
N -Total number of data in the dataset.
Calculating variance for “Age” column in the dataset
Calculating variance for all columns in the dataframe
Calculated variance only for numeric data types. The “Gender” column is excluded.
- Standard Deviation is the measure of the spread of data from the mean.
- Standard deviation is the square root of variance.
- More the standard deviation, more the spread.
Calculating the standard deviation for
“Age” column in the dataset
Calculating the standard deviation for all columns in the dataset.
- The range is the difference between the largest number and the smallest number
- Larger the range, the more the dispersion.
Calculating the range of “Age” column in the dataset
m1=df["Age"].max() m1 #Output:32 m2=df["Age"].min() m2 #Output:25 range=m1-m2 range #Output: 7
4. Interquartile range (IQR)
- Quartiles describe the spread of data by breaking into quarters. The median exactly divides the data into two parts.
- Q1(Lower quartile) is the middle value in the first half of the sorted dataset.
- Q2– is the median value
- Q3 (Upper quartile) is the middle value in the second half of the sorted dataset
- The interquartile range is the difference between the 75th percentile(Q3) and the 25th percentile(Q1).
- 50% of data fall within this range.
Calculating IQR for
“Age” column in the dataset.
Q1=df["Age"].quantile(0.25) Q1 #Output : 27.0 Q3=df["Age"].quantile(0.75) Q3 #Output: 30.0 IQR=Q3-Q1 IQR #Output: 3.0
describe() function generates descriptive statistics.It is used to view some basic statistical details like mean, median, min, max, percentiles, count of a dataframe, or series of numeric values.
25% → Q1-Lower quartile
50% → Median
75% → Q3-Upper quartile
All columns of the input will be included in the output.
The five-point summary consists of five values
- Minimum value
- Q1 -Lower quartile
- Q3-Upper quartile
- Maximum value
Statistical plots are used to identify outliers, visualize distributions, discover relationships, and the correlation between variables in a dataset.
Boxplot is used to describe how the data is distributed in the dataset. This graph represents
five-point summary(minimum, maximum, median, lower quartile, and upper quartile). This graph is used to identify outliers.
- whiskers — denote the spread of data
- box — represents the IQR- 50% of data lies within this range
Explaining the boxplot
In the “Age” category, no outliers are there.
Boxplot grouped by gender category
Here we have an outlier in the “Male” category.
Let’s calculate a five-point summary for
Calculating Mean of “Age” of Male alone.
df1=df.set_index("Gender") df2=df1.loc["Male","Age"] df2.mean() #Output: 28.0
Calculating Q1, Q3, and IQR
q1=df2.quantile(0.25) print ("Q1 :",q1) q3=df2.quantile(0.75) print ("Q3 :",q3) IQR=q3-q1 print ("IQR :",IQR
Q1 : 27.0 Q3 : 28.75 IQR : 1.75
Calculate the length of the upper whisker
The length of the upper whisker is the largest value that is no greater than the
third quartile(Q3) plus 1.5 times the
whisker=(IQR*1.5)+Q3 print("Length of Upper Whisker :",whisker) #Output:Length of Upper Whisker : 31.375
The length of the upper whisker should not be greater than 31.375
Age of Male → 25,27,28,29,32,27
32 falls above the upper whisker range. So, it’s an
So the length of the upper whisker is taken as 29(second maximum) which falls under the upper whisker range.
Boxplot using seaborn
Boxplot grouped by gender category
Visualizing distributions of data
Plotting univariate distributions
- This plot will show the distribution of univariate(single variable).
- A histogram is a bar plot where the axis representing the data variable is divided into a set of discrete bins and the count of observations falling within each bin is shown using the height of the corresponding bar.
- Histogram → plotting variable vs their count/frequencies in each bin.
Different ways to plot a histogram
- Using pandas
Histogram for all columns in the dataframe
Histogram for a particular column “Age”
In the above graph, only age 27 appears twice, so the count of age 27 is shown as 2. Remaining all ages appear once only. So all other age count is shown as 1.
Choosing the bin size
2. Using matplotlib
Plotting Multivariate distribution
Pairplot or Scatterplot
Pairplot is used to describe pairwise relationships in a dataset. Pairplot is used to visualize the univariate distribution of all variables in a dataset along with all of their pairwise relationships.
The diagonal plots are histograms and all the other plots are scatter plots.