Naive Bayes Classifier

Introduction

We will be discussing about Naive Bayes Classifier in this post as a part of Classification Series. First, we will look at what Naive Bayes Classifier is, little bit of math behind it, which applications are Naive Bayes Classifier typically used for, and finally an example of SMS Spam Filter using Naive Bayes Classifier.

What is Naive Bayes Classifier ?

Naive Bayes classifier is based on Bayes’ theorem from which it gets its name. It is a simple to understand probabilistic model which gives really quick predictions.

{\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}}
Bayes Theorem

where A and B are events and P ( B ) ≠ 0.

  • P ( A ∣ B ) is a conditional probability: the likelihood of event A occurring given that B is true.
  • P ( B ∣ A ) is also a conditional probability: the likelihood of event B occurring given that A is true.
  • P ( A ) and P ( B ) are the probabilities of observing A and B independently of each other; this is known as the marginal probability.

Why is it called ‘Naive’ ?

The fundamental assumption of Naive Bayes Classifier is that all the features/predictor variables are independent of each other. Meaning, changing the value of one feature, does not directly influence or change the value of any of the other features in the dataset.

Math behind Naive Bayes Classifier

A classifiers job is to classify set of data into classes. Naive Bayes classifier computes the probability of data falling into a particular class, and then we take a call based on the threshold for that class.

So, from the above theorem and definitions, Lets say y = class 1 , x1 …xn = predictor variables, then the conditional probability of data falling into class 1 given independent predictor variables is

 P(y|x_1,...,x_n) = \frac{ P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}

P(y) = class probability and P(xi | y) = conditional probability. The above equation can be simplified as

 P(y|x_1,...,x_n) = \frac{P(y)\prod_{i=1}^{n}P(x_i|y)}{P(x_1)P(x_2)...P(x_n)}

Since the predictor variables are assumed to be independent, the marginal probabilities P(x1)…P(xn) are constant. So, we can write equation as below

 P(y|x_1,...,x_n)\propto P(y)\prod_{i=1}^{n}P(x_i|y)

P(y | x1 , … ,xn) is the output of Naive Bayes Algorithm.

Where is Naive Bayes Classifier Used?

Naive Bayes classifiers mostly used in text classification (since it gives better result in multi-class problems where features are independent) as they have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments) as well as in recommender systems to recommend products to customers based on history data.

Now let’s get a hands on experience of using Naive Bayes Algorithm.

Naive Bayes Algorithm – Hands ON

Lets take a very simple example of data of children playing outside based on the weather and use Naive Bayes Algorithm to predict if children will play out or not given a particular weather condition

Weather/PlayYesNoP(weather | Yes)P(weather | No)
Sunny
323/82/9
Windy
343/84/9
Rainy232/83/9
Total89
Likelihood Table

Here Weather is the predictor variable and Play is the class that we have to predict (Yes / No).

Lets try to predict if children will play on a sunny day.

From the Naive Bayes equation,

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/8 = 0.375, P(Sunny) = 5/17 = 0.294, P(Yes)= 8/17 = 0.47

Now, P (Yes | Sunny) = 0.375 * 0.47 / 0.294 = 0.599. [If we have set threshold > 0.5, then we say that data belongs to class Play=Yes]

This was a very simple example with only a single predictor, In the real world , Usually we have lot of predictors(100s) and classes. Doing this exercise by hand then would take good amount of time

Now that we are done with the basics and math behind Naive Bayes Classifier, Lets go ahead and write python code to filter Spam SMS using Naive Bayes Classifier.

SMS Spam Filter – Python

Lets start off by importing required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import feature_extraction, model_selection, naive_bayes, metrics
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

Exploring the Dataset

data = pd.read_csv('../input/spam.csv', encoding='latin-1')
data.head(10)

Distribution spam/non-spam plots

count_Class=pd.value_counts(data["v1"], sort= True)
count_Class.plot(kind= 'bar', color= ["blue", "orange"])
plt.title('Bar chart')
plt.show()

Text Analytics

We want to find the frequencies of words in the spam and non-spam messages. The words of the messages will be model features.

count1 = Counter(" ".join(data[data['v1']=='ham']["v2"]).split()).most_common(20)
df1 = pd.DataFrame.from_dict(count1)
df1 = df1.rename(columns={0: "words in non-spam", 1 : "count"})
count2 = Counter(" ".join(data[data['v1']=='spam']["v2"]).split()).most_common(20)
df2 = pd.DataFrame.from_dict(count2)
df2 = df2.rename(columns={0: "words in spam", 1 : "count_"})
df1.plot.bar(legend = False)
y_pos = np.arange(len(df1["words in non-spam"]))
plt.xticks(y_pos, df1["words in non-spam"])
plt.title('More frequent words in non-spam messages')
plt.xlabel('words')
plt.ylabel('number')
plt.show()
df2.plot.bar(legend = False, color = 'orange')
y_pos = np.arange(len(df2["words in spam"]))
plt.xticks(y_pos, df2["words in spam"])
plt.title('More frequent words in spam messages')
plt.xlabel('words')
plt.ylabel('number')
plt.show()

We can see that the majority of frequent words in both classes are stop words such as ‘to’, ‘a’, ‘or’ and so on.
With stop words we refer to the most common words in a language, there is no single, universal list of stop words.

Feature engineering

Text pre-processing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors.

We remove the stop words in order to improve the analytics

f = feature_extraction.text.CountVectorizer(stop_words = 'english')
X = f.fit_transform(data["v2"])
np.shape(X)

We have created more than 8400 new features. The new feature j in the row i is equal to 1 if the word wj appears in the text example i. It is zero if not.

Predictive Analysis

My goal is to predict if a new SMS is spam or non-spam. I assume that is much worse misclassify non-spam than misclassify an spam. (I don’t want to have false positives)

The reason is because I normally don’t check the spam messages.

The two possible situations are:

  1. New spam SMS in my inbox. (False negative).
    OUTCOME: I delete it.
  2. New non-spam SMS in my spam folder (False positive).
    OUTCOME: I probably don’t read it.
    I prefer the first option!!!

First we transform the variable spam/non-spam into binary variable, then we split our data set in training set and test set.

data["v1"]=data["v1"].map({'spam':1,'ham':0})
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, data['v1'], test_size=0.33, random_state=42)
print([np.shape(X_train), np.shape(X_test)])

Naive Bayes Classifier

We train different bayes models changing the regularization parameter alpha.
We evaluate the accuracy, recall and precision of the model with the test set.

list_alpha = np.arange(1/100000, 20, 0.11)
score_train = np.zeros(len(list_alpha))
score_test = np.zeros(len(list_alpha))
recall_test = np.zeros(len(list_alpha))
precision_test= np.zeros(len(list_alpha))
count = 0
for alpha in list_alpha:
    bayes = naive_bayes.MultinomialNB(alpha=alpha)
    bayes.fit(X_train, y_train)
    score_train[count] = bayes.score(X_train, y_train)
    score_test[count]= bayes.score(X_test, y_test)
    recall_test[count] = metrics.recall_score(y_test, bayes.predict(X_test))
    precision_test[count] = metrics.precision_score(y_test, bayes.predict(X_test))
    count = count + 1

matrix = np.matrix(np.c_[list_alpha, score_train, score_test, recall_test, precision_test])
models = pd.DataFrame(data = matrix, columns = 
             ['alpha', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])
models.head(10)

Lets get the model with most precision

best_index = models['Test Precision'].idxmax()
models.iloc[best_index, :]

My best model does not produce any false positive, which is our goal.

Let’s see if there is more than one model with 100% precision !

models[models['Test Precision']==1].head(5)

Between these models with the highest possible precision, we are going to select which has more test accuracy.

best_index = models[models['Test Precision']==1]['Test Accuracy'].idxmax()
bayes = naive_bayes.MultinomialNB(alpha=list_alpha[best_index])
bayes.fit(X_train, y_train)
models.iloc[best_index, :]

Lets get the confusion matrix

m_confusion_test = metrics.confusion_matrix(y_test, bayes.predict(X_test))
pd.DataFrame(data = m_confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

We have misclassified 56 spam emails as ham and 0 ham as spam(which is what we wanted)

This concludes the post on Naive Bayes Classifier.

The complete notebook along with dataset for the above example can be found at my Github. Please do check out other projects as well.

My handle at Medium is @fahadanwar10 . Do let me know your feedback.

3 thoughts on “Naive Bayes Classifier

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s