In a last post we talked about the three standard ways that exist to represent texts in NLP, which are:
One-hot encoding.
Count vectorizer.
Word embeddings.
In this article we delve into the one-hot encoding and now we will see what the count vectorizer.
What is count vectorizer?
He count vectorizer It is a way of representing a text in natural language processing which converts a collection of documents into a document-word array. Encoding is therefore done at the document level, rather than at the token level.
Being a model of bag-of-wordsinformation regarding the position of the tokens or their context is not encoded, only information about whether they appear and their frequency.
We are going to see a short exercise that will allow us to understand what we are talking about by mentioning it in the simplest way possible.
We are going to work with a short corpus made up of three sentences:
‘I like dogs’ ‘there are dogs and dogs’ ‘there are many breeds of dogs’ #count vectorizer import pandas as pd from sklearn.feature_extraction.text import CountVectorizer #count vectorizer sent_1 = ‘I like dogs’ sent_2 = ‘there are dogs and dogs’ sent_3 = ‘there are many breeds of dogs’
We put the three sentences together in a corpus:
#count vectorizer corpus = [sent_1, sent_2, sent_3]
Basic count vectorizer example
Once we have the corpus, with the count vectorizer lWhat we are going to obtain is the vocabulary of this corpus.
The first thing we will do is instantiate the object and apply a fit transform. The function of fit transform uses the corpus itself to train and generate predictions on this same corpus. This is precisely what we will obtain from our variable X.
#count vectorizer vectorizer = CountVectorizer() #count vectorizer X = vectorizer.fit_transform(corpus)
Now let’s get what are the names of the features:
#count vectorizer print(vectorizer.get_feature_names())
The displayed result has been created from our vocabulary. Those observed in the result are the unique words that exist within our corpus (of, like, there, the, me, many, dogs, breeds).
Now what we can do is convert each of the inputs into a dataframe and represent it:
#count vectorizer doc_term_matrix = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names()) doc_term_matrix
ofthey likethere istheImanydogsbreeds001011010100100020210100111
What appears to us is the table above. In it we can see that each row is a phrase. In the first row is the vocabulary. For example, for the first sentence, which is 0 and corresponds to «I like dogs», all it does is count, since it is a count vectorizer. So the word «me» is marked as 1, then «like» is also marked as 1. Likewise, «the» and «dogs» are marked as 1. Each word that appears is counted by us.
However, in the second sentence, which says «there are dogs and dogs», so we are going to mark «there are» with a 1 and «dogs» with a 2, since the same word exists twice in the sentence and Therefore, this is added.
The same goes for the last sentence, «there are many breeds of dogs.»
With all the phrases that we put We proceed in the same way and with the same rigor.
These models are computationally much more optimal, since They do not need strings of text, but rather works directly with lists.
Do you want to continue moving forward?
If you want to continue learning about count vectorizery other techniques to represent text in NLP, we invite you to continue training. In order to access the job options of Big Data, one of the areas in the world tech better paid and with more demand, we have for you our Big Data, Artificial Intelligence & Machine Learning Full Stack Bootcamp.
In just a few months, with this intensive and comprehensive training you will acquire all the theoretical and practical knowledge you need to get the job of your dreams. You will be accompanied at all times by expert teachers and, with our Talent Pool, you will have the doors of the job market open as soon as you finish. Don’t keep waiting to boost your career and request more information now!