In statistical hypothesis testing, the p–value or probability value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct.Wikipedia
Ok! let’s break that down. Now some of you who are wondering what is Hypothesis Testing, p-value and null hypothesis in the above definition of p-value by Wikipedia. The folks who already know about this can jump to the good stuff right away!.
Hypothesis is nothing but an assumption which has not been tested yet and Hypothesis Testing is simply checking if that assumption is correct or not.
Take an example of School scores,
Your teacher says that students score an average of 70% or more and you want to prove that it is total rubbish and that it is way lesser than that.
As a general rule, we set null hypothesis (H0) to be the opposite of what we want to test and alternate hypothesis(Ha) to be what we want to test.
In our case,
H0: Students score an average of 70% or more
Ha: Students score an average lesser than 70%.
At first, H0 is assumed to be true just like how an accused in a court trial is innocent until proven guilty. It is the prosecutors job to prove that accused is guilty. Right now H0 is on trial and we have to provide evidence to reject the null hypothesis H0. But what if we don’t find any evidence to support our claim? In that case , we say that we have ‘Failed to Reject the null hypothesis’. We don’t say that H0 is True just because we have not found suitable evidence. It could be because we haven’t looked at the right place!
So, how do we find that evidence?
Let’s say we have data of the past couple of years of School scores (we call this population data or simply – population). From the population, we take samples of data and try to find out evidence that the students score an average lesser than 70%.
Let’s go deep down into how exactly we do it.
Assume that we take 1000 samples at random (so that, nobody can accuse us of foul play!). Then we take averages of those 1000 samples and plot them.
Now, the interesting part is that even if the population is not normally distributed, The means of the samples will always be a normal distribution curve with mean very close to the population mean (For more info , Read Central Limit Theorem)
Here μ refers to the Mean and σ refers to the Standard Deviation. μ-2σ to μ+2σ covers 95% of the curve. This is true in all cases (check out 68-95-99.7 Rule)
Now that we have plotted all the sample means, what next?
Now, We need evidence to Reject our null hypothesis. Enter P-Value.
P-value is simply the Random Chance Probability value. Assuming null hypothesis is true (remember, innocent until proven guilty), It tells us what is the probability that observed value comes out to be lesser than 70 just by random chance.
So, if this value is higher, then we say that it is just random chance that x<70, and we ‘fail to reject null hypothesis’. But if this value is low, then we say that it is highly unlikely that observed value came out to be lesser than 70 just by random chance and we reject the null hypothesis.
But this p-value is quite elusive. To find p-value we must first find Z-value.
Z-Value basically tells us how many standard deviation away from mean is the observed value.
Where x= observed value, μ refers to the Mean and σ refers to the Standard Deviation.
After calculating the z-value, we get the p-values associated with each z-value in Z-table.
Let’s say the p-value came up to be 0.15, or even 0.05. What do we consider as a threshold? Even before the experiment starts we need to set a significance level α (0.05) Generally, this is the significance level that is used in most business scenarios. Use this if its not specified in the problem statement.
If the p-value came up to be lets say 0.20 (> α), It means that probability of students scoring an average lesser than 70% just by random chance is 20% and therefore insignificant.
And if the p-value came up to be 0.03 (< α), then it means that probability of students scoring an average lesser than 70% just by random chance is only 3% and therefore has some truth to it.
In this case, we can confidently say that we have Rejected the null hypothesis (and thereby proved the alternate hypothesis that Students score an average lesser than 70%.)
Now that you have got a fair idea of what is p-value , what it signifies and how to use it in hypothesis testing, we can go ahead and see how it helps us in feature elimination during fitting a linear regression model.
Feature Elimination using p-value
Let’s take a Medical Insurance dataset and try to predict the Medical expenses from an individual, based on factors like age, sex, bmi etc. so that the Insurance company can set the premium accordingly.
How does Hypothesis testing and p-value fit into this?
We want to find out if the columns/features do indeed affect the medical expenses.
H0: Column/Feature does not affect medical expenses.
H1: Column/Feature affects medical expenses.
So, if a column shows p-value <=0.05 then we reject the null hypothesis and say that ‘Column/Feature affects medical expenses.‘
We don’t have to actually calculate p-values for each and every column. We can simply use OLS from statsmodels.api which basically helps to fit linear regression model and also lets us know what the p-values are.
Let’s jump right into the code. But, if you’re feeling you need to brush up on Hypothesis testing and p-value go here .
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import statsmodels.api as sm from sklearn import metrics import warnings warnings.filterwarnings('ignore') %matplotlib inline # Let's load our csv data into DataFrame df = pd.read_csv("insurance.csv") df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): age 1338 non-null int64 sex 1338 non-null object bmi 1338 non-null float64 children 1338 non-null int64 smoker 1338 non-null object region 1338 non-null object expenses 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
# Take a peek into data df.head()
age sex bmi children smoker region expenses 0 19 female 27.9 0 yes southwest 16884.92 1 18 male 33.8 1 no southeast 1725.55 2 28 male 33.0 3 no southeast 4449.46 3 33 male 22.7 0 no northwest 21984.47 4 32 male 28.9 0 no northwest 3866.86
After Cleaning the data, using one-hot-encoding technique on region column and removing outliers on dependent column (expenses), we get the below data.
age sex bmi children smoker expenses region_northwest region_southeast region_southwest 19 0 27.9 0 1 16884.92 0 0 1 18 1 33.8 1 0 1725.55 0 1 0 28 1 33.0 3 0 4449.46 0 1 0 33 1 22.7 0 0 21984.47 1 0 0
Now we will try to fit a model to this data and try to predict the expenses (dependent variable).
x = df[df.columns[df.columns != 'expenses']] y = df.expenses # Statsmodels.OLS requires us to add a constant. x = sm.add_constant(x) model = sm.OLS(y,x) results = model.fit() print(results.summary())
As we can see ,
Adj. R-squared: 0.752
p-values can be found under P>|t|
We also have p-values >0.05 for columns sex, region_northwest. We will remove these columns one by one and check the difference in the metrics of the model.
x.drop('sex',axis=1, inplace=True) model = sm.OLS(y,x) results = model.fit() print(results.summary())
Adj. R-squared: 0.752
R-squared remains the same but Adj. R-squared increased. That is because, Adj.R-squared takes the number of columns into consideration, whereas R-squared does not. So it’s always good to look at Adj. R-squared while removing/adding columns. In this case, removal of region_northwest has improved the model since Adj. R-squared increased and moved closer towards R-squared.
x.drop('region_northwest',axis=1, inplace=True) model = sm.OLS(y,x) results = model.fit() print(results.summary())
Adj. R-squared: 0.752
We can see that region_southwest and region_southeast have p-values 0.056 and 0.053. We can choose to ignore this since it is very close to α (0.05).
predicted_expense = (age x 255.3) + (bmi x 318.62) + (children x 509.21) + (smoker x 23240) – (region_southeast x 777.08) – (region_southwest x 765.40)
So, as we can see the highest factor that affects is if the person is a smoker or not! A smoker tends to pay 23,240 more medical expense than a non-smoker.
If you’ve learnt nothing from this post, at least you would have learnt that smoking not only burns your lungs, but your wallet too!
Until next time.
The notebook along with the dataset can be found at my Github.