Implement Multinomial and Bernoulli Naive Bayes classifiers in python | by Pankaj Agrawal

An illustration comparing Multinomial and Bernoulli Naive Bayes classifiers. The left side depicts Multinomial Naive Bayes with word frequency bars, while the right shows Bernoulli Naive Bayes with binary presence/absence vector

example_train.csv: Contains all the training sentences.

example_test.csv: Contains all the Test sentences.

Test Dataset

output:

convert label to a numerical variable

output:

split the dataframe into X and y labels.

output:

Now we have to convert the data into a format which can be used for training the model.

We’ll use the bag of words representation for each sentence (document).

Imagine breaking X in individual words and putting them all in a bag. Then we’ll pick all the unique words from the bag one by one and make a dictionary of unique words.

This is called vectorization of words. We have the class CountVectorizer() in scikit learn to vectorize the words.

Here “vec” is an object of class “CountVectorizer()”. This has a method called “fit()” which converts a corpus of documents to a matrix of ‘tokens’.

“Countvectorizer()” shall convert the documents into a set of unique words alphabetically sorted and indexed.

Above statement shall return a Vector containing all words and size of this Vector is 39.

So What is stop words here?

We can see a few trivial words such as ‘and’,’is’,’of’, etc.

These words don’t really make any difference in classifying a document. These are called **stop words**, so its recommended to get rid of them.

We can remove them by passing a parameter stop_words=’english’ while instantiating “Countvectorizer()” as mentioned above:

Above will eliminate all stop words and now vector size will be down from 39 to 23

So our final dictionary is made of 23 words (after discarding the stop words). Now, to do classification, we need to represent all the documents with these words (or tokens) as features.

Every document will be converted into a feature vector representing presence of these words in that document. Let’s convert each of our training documents in to a feature vector.

convert this sparse matrix into a more easily interpretable array:

output :

To make the dataset more readable, let us examine the vocabulary and the document-term matrix together in a pandas dataframe. The way to convert a matrix into a dataframe is

output :

now import and transform the test data

Our test data contains

output:

convert label to a numerical variable

output:

Convert to numpy

output:

Transform the test data

For Train dataset

vect.fit(train): learns the vocabulary of the training data
vect.transform(train) : uses the fitted vocabulary to build a document-term matrix from the training data

For Test dataset

vect.transform(test) : uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn’t seen before)