Data Cleaning — How to Handle Missing Values in Pandas

Drop Values, Fill Values, Replace Values

Photo by Gabby K from Pexels

Handling Missing Values in Pandas

Data Cleaning is one of the important steps in EDA. Data cleaning can be done in many ways. One of them is handling missing values.

Let’s learn about how to handle missing values in a dataset.

Table of Content

  1. Identify Missing Values
  2. Replace Missing Values
  3. Fill missing values
  4. Drop missing values

Identify Missing Values

Different types of missing values:

  • Standard Missing Values
  • Non-standard Missing Values
  • Unexpected Missing Values

Standard Missing Values

Blank cells, NaN, n/a → These will be treated by default as null values in Pandas.

Standard missing values only can be detected by pandas.

Example: I have created a simple dataset having different types of null values

 

student.csv(Image by Author)

Let’s import the dataset

df=pd.read_csv(“student.csv”)
df.head(10)

2. df.isnull().sum() → It will return the count of null values in each column.It will count only the standard null values.

df.isnull().sum()

Only 3 standard missing values are counted. [Blank cells, NaN, n/a]

3. df[“Math”].isnull() → It will return a boolean array. True -means null values.

df["Math"].isnull()

Non-Standard Missing Values

Sometimes missing values will be entered like .., __, — , missing, na, @,??,***, etc.

 

Different types of missing values — highlighted.

Now let’s check all the unique values in the “Math” and “English” columns. By checking this, we can identify all the non-standard missing values in the dataset.

df[“Math”].unique()
df[“English”].unique()

It will return all the unique values in that particular column. We can identify what are the other values, which are to be treated as “missing values”.

From the two columns unique values, we can identify some missing values like na and ??

These are non-standard missing values.

Unexpected Missing values

Unexpected missing values are identified based on the context of the dataset. Here marks range from 0 to 100 only. So 999999 and X also identified as missing values.

Replace Missing Values

  1. df.replace(old_value, new_value) old_value will be replaced by new_value
missing_values=['??','na','X','999999']
df=df.replace(missing_values,np.NaN)
df

Now all the missing values are replaced with NaN. Now it’s easy to treat all the null values. We can either drop all null values or fill those by mean/median.

Fill Missing Values

Different ways to fill the missing values

  1. Mean/Median, Mode
  2. bfill,ffill
  3. interpolate
  4. replace

1. Mean/Median, Mode

  • Numerical Data →Mean/Median
  • Categorical Data →Mode

In columns having numerical data, we can fill the missing values by mean/median.

Mean — When the data has no outliers. Mean is the average value. Mean will be affected by outliers.

[Example. If we are calculating, mean salary of the employees in a room and if the company CEO walks in, the mean will tend to be higher. It can’t be the representative amount. In that scenario, we can choose median]

Median — When the data has more outliers, it’s best to replace them with the median value. Median is the middle value (50%)

In columns having categorical data, we can fill the missing values by mode

Mode — Most common value.

Let’s fill the missing values by mean

  1. First, we have to check the data type of the column.
df.dtypes

“Math” and “English” are in “object” datatype. We have to convert to “float” datatype to calculate mean.

2. Changing the datatype of the column and calculating mean

df["Math"]=df["Math"].astype("float64")
m=round(df["Math"].mean(),2)
m

Output: 75.83

3. Now filling the missing values in the “Math” column by mean -75.83

df["Math"].fillna(m,inplace=True)
df

Categorical Data

If we want to replace missing values in categorical data, we can replace them with mode(most common value)

 

employee.csv
  1. Let’s replace the null value in the “Role” column
df=pd.read_csv("employee.csv")
m=df["Role"].mode()
m
#Output
0    Developer
dtype: object

2. It returns a series object, containing the most common value. We have to select the particular value in the series.

m=df["Role"].mode()[0]
m
#Output:
'Developer'

3. Fill in the value

df["Role"].fillna(m,inplace=True)
df

 

The null value is replaced with “Developer” in the “Role” column

2. bfill,ffill

bfill — backward fill — It will propagate the first observed non-null value backward.
ffill — forward fill — it propagates the last observed non-null value forward.

If we have temperature recorded for consecutive days in our dataset, we can fill the missing values by bfill or ffill. It will be more accurate when compared to the mean/median for this type of dataset.

 

weather.csv
  1. bfill
df1=pd.read_csv("weather.csv")
df1.bfill(inplace=True)
df1

 

bfill

2. ffill

df1.ffill(inplace=True)
df1

 

ffill

3. Interpolate

interpolate() is one of the methods to fill the missing values.

 

weather2.csv

We have missing values for three consecutive days. Instead of filling all three rows with the same value, we can use interpolate method.

bfill/ffill will fill the same value for consequent missing values.

df2=pd.read_csv("weather1.csv")
df2=df2.interpolate()
df2

 

interpolate vs bfill [Image by Author]

Let’s visualize the dataset after filling consecutive missing values with interpolate method vs bfill

import seaborn as sns
sns.pointplot(x="Day",y="Temp",data=df3,color='darkorange')

From the plot, we could see how the missing values are filled by interpolate method [ by default linear method is used]

4. replace

Sometimes we can replace the specific missing values by using replace method.

Like we can get data from an external source and replace it.

Drop Missing Values

  1. Removing all the null values in the dataset
df.dropna()

2. Removing the null values in a specific column

df.dropna(subset=["Math"])

Conclusion

Data cleaning involves many steps. I have covered only how to handle missing values in the dataset. There are many more like fixing invalid values, splitting columns, merging columns, filtering subset, standardizing data, scaling data.

Thanks for reading!


Watch this space for more articles on Python and DataScience. If you like to read more of my tutorials, follow me on Medium, LinkedIn, Twitter.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

Buy Me a CoffeeBuy Me a CoffeeBuy Me a Coffee

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s