example_train.csv: Contains all the training sentences.
example_test.csv: Contains all the Test sentences.
output:
convert label to a numerical variable
output:
split the dataframe into X and y labels.
output:
Now we have to convert the data into a format which can be used for training the model.
We’ll use the bag of words representation for each sentence (document).
Imagine breaking X in individual words and putting them all in a bag. Then we’ll pick all the unique words from the bag one by one and make a dictionary of unique words.
This is called vectorization of words. We have the class CountVectorizer() in scikit learn to vectorize the words.
Here “vec” is an object of class “CountVectorizer()”. This has a method called “fit()” which converts a corpus of documents to a matrix of ‘tokens’.
“Countvectorizer()” shall convert the documents into a set of unique words alphabetically sorted and indexed.
Above statement shall return a Vector containing all words and size of this Vector is 39.
So What is stop words here?
We can see a few trivial words such as ‘and’,’is’,’of’, etc.
These words don’t really make any difference in classifying a document. These are called **stop words**, so its recommended to get rid of them.
We can remove them by passing a parameter stop_words=’english’ while instantiating “Countvectorizer()” as mentioned above:
Above will eliminate all stop words and now vector size will be down from 39 to 23
So our final dictionary is made of 23 words (after discarding the stop words). Now, to do classification, we need to represent all the documents with these words (or tokens) as features.
Every document will be converted into a feature vector representing presence of these words in that document. Let’s convert each of our training documents in to a feature vector.
convert this sparse matrix into a more easily interpretable array:
output :
To make the dataset more readable, let us examine the vocabulary and the document-term matrix together in a pandas dataframe. The way to convert a matrix into a dataframe is
output :
now import and transform the test data
Our test data contains
output:
convert label to a numerical variable
output:
Convert to numpy
output:
Transform the test data
For Train dataset
- vect.fit(train): learns the vocabulary of the training data
- vect.transform(train) : uses the fitted vocabulary to build a document-term matrix from the training data
For Test dataset
- vect.transform(test) : uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn’t seen before)
output :
output:
output: