NLP in Python_Artificial Vision and Language Processing for Robotics-QQ阅读男生科幻网

书名：Artificial Vision and Language Processing for Robotics
作者名：?lvaro Morena Alberola Gonzalo Molina Gallego Unai Garay Maestre
本章字数：1710字
更新时间：2025-02-25 23:03:18

NLP in Python

Python has become very popular in recent years, by combining the power of general-purpose programming languages with the use of specific domain languages, such as MATLAB and R (designed for mathematics and statistics). It has different libraries for data loading, visualization, NLP, image processing, statistics, and more. Python has the most powerful libraries for text processing and machine learning algorithms.

Natural Language Toolkit (NLTK)

NLTK is the most common kit of tools for working with human language data in Python. It includes a set of libraries and programs for processing natural language and statistics. NLTK is commonly used as a learning tool and for carrying out research.

This library provides interfaces and methods for over 50 corpora and lexical resources. NLTK is capable of classifying text and performing other functions, such as tokenization, stemming (extracting the stem of a word), tagging (identifying the tag of a word, such as person, city…), and parsing (syntax analysis).

Exercise 10: Introduction to NLTK

In this exercise, we will review the most basic concepts about the NLTK library. As we said before, this library is one of the most widely used tools for NLP. It can be used to analyze and study text, disregarding useless information. These techniques can be applied to any text data, for example, to extract the most important keywords from a set of tweets or to analyze an article in a newspaper:

Note

All the exercises in this chapter will be executed in Google Colab.

Open up your Google Colab interface.
Create a folder for the book.
Here, we are going to process a sentence with basic methods of the NLTK library. First of all, let's import the necessary methods (stopwords, word_tokenize, and sent_tokenize):
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
Now we create a sentence and apply the methods:
example_sentence = "This course is great. I'm going to learn deep learning; Artificial Intelligence is amazing and I love robotics..."
sent_tokenize(example_sentence) # Divide the text into sentences

Figure 3.4: Sentence divided into a sub-sentence
word_tokenize(example_sentence)

Figure 3.5: Tokenizing a sentence into words
Note
Sent_tokenize returns a list of different sentences. One of the disadvantages of NLTK is that sent_tokenize does not analyze the semantic structure of the whole text; it just splits the text by the dots.
With the sentence tokenized sentence by words, let's subtract the stop words. The stop words are a set of words without relevant information about the text. Before using stopwords, we will need to download it:
nltk.download('stopwords')
Now, we set the language of our stopwords as English:
stop_words = set(stopwords.words("english"))
print(stop_words)
The output is as follows:

Figure 3.6: Stopwords set as English
Process the sentence, deleting stopwords:
print(word_tokenize(example_sentence))
print([w for w in word_tokenize(example_sentence.lower()) if w not in stop_words])
The output is as follows:

Figure 3.7: Sentence without stop words
We can now modify the set of stopwords and check the output:
stop_words = stop_words - set(('this', 'i', 'and'))
print([w for w in word_tokenize(example_sentence.lower()) if w not in stop_words])

Figure 3.8: Setting stop words
Stemmers remove morphological affixes from words. Let's define a stemmer and process our sentence. Porter stemmer is an algorithm for performing this task:
from nltk.stem.porter import * # importing a stemmer
stemmer = PorterStemmer() # importing a stemmer
print([stemmer.stem(w) for w in word_tokenize(example_sentence)])
The output is as follows:

Figure 3.9: Setting stop words
Finally, let's classify each word by its type. To do this, we will use a POS tagger:
nltk.download('averaged_perceptron_tagger')
t = nltk.pos_tag(word_tokenize(example_sentence)) #words with each tag
t
The output is as follows:

Figure 3.10: POS tagger

Note

The averaged perceptron tagger is an algorithm trained to predict the category of a word.

As you may have noticed in this exercise, NLTK can easily process a sentence. Also, it can analyze a huge set of text documents without any problem. It supports many languages and the tokenization process is faster than that for similar libraries, and it has many methods for each NLP problem.

spaCy

spaCy is another library for NLP in Python. It does look similar to NLTK, but you will see some differences in the way it works.

spaCy was developed by Matt Honnibal and is designed for data scientists to clean and normalize text easily. It's the quickest library in terms of preparing text data for a machine learning model. It includes built-in word vectors and some methods for comparing the similarity between two or more texts (these methods are trained with neural networks).

Its API is easy to use and more intuitive than NLTK. Often, in NLP, spaCy is compared to NumPy. It provides methods and functions for performing tokenization, lemmatization, POS tagging, NER, dependency parsing, sentence and document similarity, text classification, and more.

As well as having linguistic features, it also has statistical models. This means you can predict some linguistic annotations, such as whether a word is a verb or a noun. Depending on the language you want to make predictions in, you will need to change a module. Within this section are Word2Vec models, which we will discuss in Chapter 4, Neural Networks with NLP.

spaCy has many advantages, as we said before, but there are some cons too; for instance, it supports only 8 languages (NLTK supports 17 languages), the tokenization process is slow (and this time-consuming process could be critical on a long corpus), and overall, it is not flexible (that is, it just provides API methods without the possibility of modifying any parameters).

Before starting with the exercise, let's review the architecture of spaCy. The most important data structures of spaCy are the Doc and the Vocab.

The Doc structure is the text you are loading; it is not a string. It is composed of a sequence of tokens and their annotations. The Vocab structure is a set of lookup tables, but what are lookup tables and why is the structure important? Well, a lookup table in computation is an array indexing an operation that replaces a runtime. spaCy centralizes information that is available across documents. This means that it is more efficient, as this saves memory. Without these structures, the computational speed of spaCy would be slower.

However, the structure of Doc is different to Vocab because Doc is a container of data. A Doc object owns the data and is composed of a sequence of tokens or spans. There are also a few lexemes, which are related to the Vocab structure because they do not have context (unlike the token container).

Note

A lexeme is a unit of lexical meaning without having inflectional endings. The area of study for this is morphological analysis.

The figure 3.11 shows us the spaCy architecture.

Figure 3.11: spaCy architecture

Depending on the language model you are loading, you will have a different pipeline and a different Vocab.

Exercise 11: Introduction to spaCy

In this exercise, we will do the same transformations that we performed in Exercise 10, Introduction to NLTK, and to the same sentence as in that exercise but with the spaCy API. This exercise will help you to understand and learn about the differences between these libraries:

Open up your Google Colab interface.
Create a folder for the book.
Then, import the package to use all its features:
import spacy
Now we are going to initialize our nlp object. This object is a part of the spaCy methods. By executing this line of code, we are loading the model inside the parenthesis:
import en_core_web_sm
nlp = spacy.load('en')
Let's take the same sentence as in Exercise 10, Introduction to NLTK, and create the Doc container:
example_sentence = "This course is great. I'm going to learn deep learning; Artificial Intelligence is amazing and I love robotics..."
doc1 = nlp(example_sentence)
Now, print doc1, its format, the 5th and 11th token, and a span between the 5th and the 11th token. You will see this:
print("Doc structure: {}".format(doc1))
print("Type of doc1:{}".format(type(doc1)))
print("5th and 10th Token of the Doc: {}, {}".format(doc1[5], doc1[11]))
print("Span between the 5th token and the 10th: {}".format(doc1[5:11]))
The output is as follows:

Figure 3.12: Output of a spaCy document
As we saw in Figure 3.5, documents are composed of tokens and spans. First, we are going to see the spans of doc1, and then its tokens.
Print the spans:
for s in doc1.sents:
print(s)
The output is as follows:

Figure 3.13: Printing the spans of doc1
Print the tokens:
for i in doc1:
print(i)
The output is as follows:

Figure 3.14: Printing the tokens of doc1
Once we have the document divided into tokens, the stop words can be removed.
First, we need to import them:
from spacy.lang.en.stop_words import STOP_WORDS
print("Some stopwords of spaCy: {}".format(list(STOP_WORDS)[:10]))
type(STOP_WORDS)
The output is as follows:

Figure 3.15: 10 stop words in spaCy
But the token container has the is_stop attribute:
for i in doc1[0:5]:
print("Token: {} | Stop word: {}".format(i, i.is_stop)
The output is as follows:

Figure 3.16: The is_stop attribute of tokens
To add new stop words, we must modify the vocab container:
nlp.vocab["This"].is_stop = True doc1[0].is_stop
The output here would be as follows:
True
To perform speech tagging, we initialize the token container:
for i in doc1[0:5]:
print("Token: {} | Tag: {}".format(i.text, i.pos_))
The output is as follows:

Figure 3.17: The .pos_ attribute of tokens
The document container has the ents attribute, with the entity of the tokens. To have more entities in our document, let's declare a new one:
doc2 = nlp("I live in Madrid and I am working in Google from 10th of December.")
for i in doc2.ents:
print("Word: {} | Entity: {}".format(i.text, i.label_))
The output is as follows:

Figure 3.18: The .label_ attribute of tokens

Note

As you can see in this exercise, spaCy is much easier to use than NLTK, but NLTK provides more methods to perform different operations on text. spaCy is perfect for production. That means, in the least amount of time, you can perform basic processes on text.

The exercise has ended! You can now pre-process a text using NLTK or spaCy. Depending on the task you want to perform, you will be able to choose one of these libraries to clean your data.