#### Assumptions of linear regression, residuals, Q-Q plot

### Residual Analysis in Linear Regression

Assumptions in Linear regression are about residuals. Let’s learn about residuals and assumptions in linear regression about residuals.

### Residuals:

Residuals in Linear Regression are the difference between the actual value and the predicted value.

How is the predicted value calculated?

**ε → Residuals or Error term.**

### Assumptions in Linear Regression are about residuals:

- Residuals should be independent of each other.
- Residuals should have constant variance.
- The expected value or mean of the residuals should be zero.
**E[ε]=0** - Residuals should follow a normal distribution

### Residual Plots

- Residual vs target variable
- Residual vs predicted variable
- Distplot of residuals

### Scenario 1: All assumptions are satisfied

Example: I have taken a simple dataset.

x- independent variable, y-target variable.

Before building a linear regression model, let’s check scatterplot,regplot, and heatmap.

df=pd.read_csv("ex1.csv")

**Scatterplot**

sns.scatterplot(df['x'],df['y'],color='darkorange')

**regplot**

sns.regplot(df['x'],df['y'],color='darkorange')

`df.corr()`

**Building model and calculating residuals**

importstatsmodels.apiassm

X_train_sm = sm.add_constant(X)

fit1 = sm.OLS(y, X_train_sm).fit()

#Calculating y_predict and residuals

y_predict=fit1.predict(x_train_sm)

residual=fit1.resid

#### Assumption 1: Residuals are independent of each other.

Assumption 2: Residuals should have a constant variance.

To check these assumptions, we have to plot residual plot [y_predict vs residuals’]

sns.residplot(y_predict,residual,color='darkorange')

plt.xlabel("y_predict")

plt.ylabel("Residuals")

plt.title("Residual Plot")

From the above residual plot, we could infer that the residuals didn’t form any pattern. So, the residuals are independent of each other.

And also, the residuals have constant variance. Variance doesn’t seem to increase/decrease constantly with the y_predict value.

#### Assumption 3: Residuals are normally distributed

To check whether the residuals are normally distributed from Q-Q plot,distplot

**Q-Q plot → quantile-quantile plot**

If the residuals are normally distributed, then the Q-Q plot of residuals will be a straight line

fromscipyimportstatsimportstatsmodels.apiassm

residual=fit1.resid

probplot=sm.ProbPlot(residual,stats.norm,fit=True)

fig=probplot.qqplot(line='45')

plt.title('qqplot')

**2. distplot**

sns.distplot(residual,color='darkorange')

### Scenario 2: Residuals are not independent of each other

I have created a small data set that contains x and y and **y = x²** with some noise added to it.

y →target variable

x →independent variable

**Scatterplot**

df1=pd.read_csv("ex2.csv")

sns.scatterplot(df1['x'],df1['y'],color='darkorange')

**Regplot**

sns.regplot(df1['x'],df1['y'],color='darkorange')

**Correlation value**

**Building Model and calculating y_predict and residuals**

X1 = np.array(df1['x']).reshape(-1,1)# predictor variabley1 = np.array(df1['y']).reshape(-1,1)# response variableimportstatsmodels.apiassm

x1_train_sm = sm.add_constant(X1)

fit2 = sm.OLS(y1, x1_train_sm).fit()

y1_predict=fit2.predict(x1_train_sm)

residual_1=fit2.resid

**Residual plot**

sns.residplot(y1_predict,residual_1,color='darkorange')

plt.xlabel("y1_predict")

plt.ylabel("Residuals")

plt.title("Residual Plot")

From the residual plot, we could see that the residuals follow a pattern. They are dependent on each other. Non-linearity is present in the data.

Since the residuals are dependent on each other, we can now build a slightly different model.

*Let’s build a degree 2 polynomial model*

fromsklearn.preprocessingimportPolynomialFeatures

polynomial_features= PolynomialFeatures(degree=2)

xp = polynomial_features.fit_transform(X1)

xp_train=sm.add_constant(xp)

fit_p= sm.OLS(y,xp_train).fit()

yp_predict=fit_p.predict(xp_train)

residual_p=fit_p.resid

**Let’s check the residual plot for the new model**

sns.residplot(yp_predict,residual_p,color='darkorange')

plt.xlabel("yp_predict")

plt.ylabel("Residuals")

plt.title("Residual Plot")

Now, the residuals are independent of each other.

### Scenario 3: Residuals doesn’t have constant variance

If the residuals don’t have constant variance, we can try transforming independent variables (log-transformation,box-cox transformation)

If the residuals don’t have constant variance, we can infer it from the residual plot. If we get a residual plot like the one mentioned below, it means that residuals don’t have constant variance. Here, the residuals spread is not constant.

### Conclusion:

In this article, I have covered residuals, the assumptions of residuals in linear regression, and plots to check the assumptions of residuals.

Thanks for reading!

*If you like to read more of my tutorials on Python and Data Science,follow me on *

*Medium*

*,**Twitter*

*https://indhumathychelliah.medium.com/membership*

#### Make a one-time donation

#### Make a monthly donation

#### Make a yearly donation

Choose an amount

Or enter a custom amount

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

Buy Me a CoffeeBuy Me a CoffeeBuy Me a Coffee