In this article we are going to see how we can do text preprocessing for NLP in Python. It is not a very complicated task, since you will only need a basic level of the programming language. Also, if you have knowledge of linguistics, even better!
Text preprocessing for NLP in Python
Initial steps
! wget https://transfer.sh/J4o169/requirements.txt ! pip install -r requirements.txt #Text preprocessing for NLP ! wget https://transfer.sh/KWNxRn/datasets.zip ! wget https://transfer.sh/XoRKUW/utils.py
! unzip datasets.zip from utils import load_cinema_reviews
If you look closely, we have different wget, which download the files from the repository that have been previously uploaded. In this specific case, we are downloading the requirements from the repository. We can execute the cells one by one to install the dependencies, while we also download the datasets folder and the utilities file.
When we have all the dependencies installed, we can download the dataset and the utilities file. Subsequently, we decompress the datasets. Once unzipped, we can see that we have the datasets folder with all the datasets available in the repository:
Now, from the utilities file we are going to import the load_cinema_reviews function, which will allow us to load from the dataset a quantity of data about reviews of movies.
Data reading
#Text preprocessing for NLP #path to the directory where we have the datasets with the reviews #Decompress before! datasets_path=»./datasets» corpus_cine_folder=»corpusCine»
The first thing we will do in text preprocessing for NLP is write the directory in which the dataset will be. In this case it will be in corpusCine.
We execute, load the data and, once loaded, we verify what size the dataset is with the len command:
#Text preprocessing for NLP reviews_dict = load_cinema_reviews (datasets_path, corpus_cine_folder) print (len (reviews_dict))
3878
This is the number of reviews that the dataset has. What we have here is a dictionary that has all the reviews of the dataset we have taken. For example, if we want the tenth item in the dictionary, we would have to write the following:
reviews_dict.get (10)
{‘author’: ‘Javier Moreno’,
‘reviews_text’: ‘No. This time I’m not going to use a movie as a pretext to express my most insane/rational ideas or thoughts…’
‘sentiment’: ‘4’,
‘summary’: ‘Interesting adaptation of Rowling’s novel’,
‘title’: ‘Harry Potter and the Goblet of Fire’}
This tells us, therefore, who the author is and what the text of the review or review. Then, we find a label called feelingin which it has a rating of 4. Below comes the summary or summary, and finally, the title of the film.
String class
The first thing we will work with will be with the class stringa native Python class that incorporates text processing utilities for NLP.
Python has a number of built-in methods that can be used in strings. Let’s see:
capitalize() casefold() center() endswith() find() index() lower() replace() strip() title()
Among many others that you can consult here: Python String Methods.
Tokenization
What we will do now is stay only with the text of the review, that is, from the movie review. Let’s do print of the first 1000:
#Text preprocessing for NLP review = reviews_dict.get (10).get (review_text’) print (review [:1000])
When we have two points within a list, this indicates that:
If it is on the left, it indicates that it will take from the first element to the element that is marked; in this case it is 1000 → [:1000]. If it is on the right, take the indicated number up to n elements → [1000:].
We execute and the result is:
No. This time I am not going to use a movie as a pretext to express my most insane/rational/insane ideas or thoughts. This time, and I swear before Lovecraft’s sacred Necronomicon, I plan to talk about the film itself…
This would be the review of the first 1000 characters.
Phrases
To get the sentences out of the whole review It would be done as follows:
#Text preprocessing for NLP sentences = review.split (‘ . ‘) print (sentences [:5])
We will tell you that the character by which we want you to separate us is the points.
Tokens
We can also use the function strip To clear empty characters:
#Text preprocessing for NLP sentence.strip()
Do you want to continue moving forward?
To be able to access the job options of Big Data, one of the areas in the world of industry tech better paid, we have the Big Data, Artificial Intelligence & Machine Learning Full Stack Bootcamp for you. With this intensive and comprehensive high-quality training you will acquire the theoretical and practical knowledge necessary to get the job of your dreams in a few months. Don’t wait any longer to boost your career and request information now!