Assumptions Of Linear Regression – How to Validate and Fix

Introduction

I will assume that you have a fair understanding of Linear Regression. If not, I have written a simple and easy to understand post with example in python here. Read it before continuing further.

Linear Regression makes certain assumptions about the data and provides predictions based on that. Naturally, if we don’t take care of those assumptions Linear Regression will penalise us with a bad model (You can’t really blame it!).
We will take a dataset and try to fit all the assumptions and check the metrics and compare it with the metrics in the case that we hadn’t worked on the assumptions.
So, without any further ado let’s jump right into it.

Linear Regression without Assumption Validation

Let’s take an example of the famous Advertisement dataset.

Advertising Dataset

Let’s perform Linear Regression on this dataset without validating the assumptions.

# Creating DataFrame out of Advertising.csv
df = pd.read_csv("Advertising.csv")
df.drop("Unnamed: 0", axis=1,inplace=True)

# Separating Independent and dependent variables
X=df.drop(['sales'],axis=1)
Y=df.sales

# Fit Linear Regression
lr = LinearRegression()
model=lr.fit(X,Y)
y_pred1 = model.predict(X)
print("R-squared: {0}".format(metrics.r2_score(Y,y_pred1)))

Output: R-squared: 0.8972106381789522

Let’s plot the Residuals vs Fitted Values to see if there is any pattern.

plt.scatter(ypred, (Y-ypred1))
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

We can see a pattern in the Residual vs Fitted values plot which means that the non-linearity of the data has not been well captured by the model.

Now let’s work on the assumptions and see if R-squared value and the Residual vs Fitted values graph improves.


Assumption 1:

The Dependent variable and Independent variable must have a linear relationship.

How to Check?

A simple pairplot of the dataframe can help us see if the Independent variables exhibit linear relationship with the Dependent Variable.

How to Fix?

To fix non-linearity, one can either do log transformation of the Independent variable, log(X) or other non-linear transformations like √X or X^2.

Let’s plot a pair plot to check the relationship between Independent and dependent variables.

sns.pairplot(df)

We can clearly see that Radio has a somewhat linear relationship with sales, but not newspaper and TV.

An equation of first order will not be able to capture the non-linearity completely which would result in a sub-par model. In order to square the variables and fit the model, we will use Linear Regression with Polynomial Features.

from sklearn.preprocessing import PolynomialFeatures 
  
poly = PolynomialFeatures(degree = 2) 
X_poly = poly.fit_transform(X) 
  
poly.fit(X_poly, Y) 
X_poly = sm.add_constant(X_poly)
results = sm.OLS(Y,X_poly).fit()

print(results.summary())
R-squared: 0.987 and Durbin-Watson: 2.136

Assumption 2:

No Autocorrelation in residuals.

How to Check?

Use Durbin-Watson Test.
DW = 2 would be the ideal case here (no autocorrelation)
0 < DW < 2 -> positive autocorrelation
2 < DW < 4 -> negative autocorrelation
statsmodels’ linear regression summary gives us the DW value amongst other useful insights.

How to Fix?

  • Add a column thats lagged with respect to the Independent variable
  • Center the Variable (Subtract all values in the column by its mean).

As we can see, Durbin-Watson :~ 2 (Taken from the results.summary() section above) which seems to be very close to the ideal case. So, we don’t have to do anything


Assumption 3:

No Heteroskedasticity.

How to Check?

Residual vs Fitted values plot can tell if Heteroskedasticity is present or not.
If the plot shows a funnel shape pattern, then we say that Heteroskedasticity is present.

Residuals are nothing but the difference between actual and fitted values

How to fix?

We could do a non linear transformation of the dependent variable such as log(Y) or √Y. Also, you can use weighted least square method to tackle heteroskedasticity.

plt.subplots(figsize=(10,5))
plt.subplot(1,2,1)
plt.title("Before")
plt.scatter(ypred1, (Y-ypred1))
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

plt.subplot(1,2,2)
plt.title("After")
plt.scatter(ypred2, (Y-ypred2))
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

Here, we have plots of Residuals vs Fitted values for both Before and After working on Assumptions. We don’t see a funnel like pattern in the After or Before section, so no heteroskedacity.


Assumption 4:

No Perfect Multicollinearity.

I have written a post regarding multicollinearity and how to fix it. Please read it if you’re not familiar with multicollinearity.

How to Check?

In case of very less variables, one could use heatmap, but that isn’t so feasible in case of large number of columns.
Another common way to check would be by calculating VIF (Variance Inflation Factor) values.
If VIF=1, Very Less Multicollinearity
VIF<5, Moderate Multicollinearity
VIF>5 , Extreme Multicollinearity (This is what we have to avoid)

How to fix?

The Variables with high Multicollinearity can be removed altogether, or if you can find out which 2 or more variables have high correlation with each other, you could simply merge these variables into one. Make sure that VIF < 5.

# Function to calculate VIF
def calculate_vif(data):
    vif_df = pd.DataFrame(columns = ['Var', 'Vif'])
    x_var_names = data.columns
    for i in range(0, x_var_names.shape[0]):
        y = data[x_var_names[i]]
        x = data[x_var_names.drop([x_var_names[i]])]
        r_squared = sm.OLS(y,x).fit().rsquared
        vif = round(1/(1-r_squared),2)
        vif_df.loc[i] = [x_var_names[i], vif]
    return vif_df.sort_values(by = 'Vif', axis = 0, ascending=False, inplace=False)

X=df.drop(['sales'],axis=1)
calculate_vif(X)
VIF<5 for all Independent variables

Great! we have all VIFs<5 . If you want to know what to do in case of higher VIF values, check this out.


Assumption 5:

Residuals must be normally distributed.

How to Check?

Use Distribution plot on the residuals and see if it is normally distributed.

How to Fix?

If the Residuals are not normally distributed, non–linear transformation of the dependent or independent variables can be tried.

plt.subplots(figsize=(8,4))

plt.subplot(1,2,1)
plt.title("Before")
sns.distplot(Y-ypred1 , fit=norm);
plt.xlabel('Residuals')

plt.subplot(1,2,2)
plt.title("After")
sns.distplot(Y-ypred2 , fit=norm);
plt.xlabel('Residuals')
Before vs After Residual Distributions

The black line in the graph shows what a normal distribution should look like and the blue line shows the current distribution.

‘Before’ section shows a slight shift in the distribution from normal distribution, whereas ‘After’ section is almost aligned with normal distribution.

Another way how we can determine the same is using Q-Q Plot (Quantile-Quantile)

plt.subplots(figsize=(8,4))

plt.subplot(1,2,1)
stats.probplot(Y-ypred1, plot=plt)

plt.subplot(1,2,2)
stats.probplot(Y-ypred2, plot=plt)
plt.show()

In the ‘Before’ section , you will see that the Residual Quantiles don’t exactly follow the straight line like it should, which means that the distribution isn’t normal.
Whereas After working on assumption validation, we can see that the Residual Quantiles are following a straight line, meaning the distribution is normal.


That marks the end of Assumption validation. Now let’s compare metrics of both the models.

Comparison

Let’s compare the two models and see if there is any improvement.

Before

R-squared: 0.8972

plt.title("Before")
plt.plot(Y,Y, color="red")
plt.scatter(ypred1, Y)
plt.xlabel("Fitted values")
plt.ylabel("Actuals")

After

R-squared: 0.987

plt.title("After")
plt.plot(Y,Y, color="red")
plt.scatter(ypred2, Y)
plt.xlabel("Fitted values")
plt.ylabel("Actuals")

R-squared value has been improved and also In the above plots we can see the Actual vs Fitted values for Before and After assumption validations.
More than 98%+ Fitted values agree with the actual values. which means that the model is able to capture and learn from the non-linearity of the dataset.

Conclusion

We have now validated that all the Assumptions of Linear Regression are taken care of and we can safely say that we can expect good results if we take care of the assumptions.

So, basically if your Linear Regression model is giving sub-par results, make sure that these Assumptions are validated and if you have fixed your data to fit these assumptions, then your model will surely see improvements.


That’s it for this post!.

Here’s my GitHub for Jupyter Notebooks on Linear Regression.Look for the notebook used for this post -> media-sales-linear-regression-verify-assumptions.ipynb
Please feel free to check it out and suggest more ways to improve metrics here in the responses.

Thank you for reading!

Please check out my posts at Medium and follow me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s