Why do we need to preprocess data?

NLP software analyzes text by breaking it into sentence and words. We require a reliable NLP pipeline where a text is splitted into sentences and then into words so that we can analyze the text. There are many steps to be completed in a text preprocessing pipeline and they are:

  • Sentence Tokenization
  • Word Tokenization
  • Stemming
  • Lemmatization

Sentence Tokenization

Once we get a textual data and clean it as mentioned in my previous post Text Cleaning we have to perform a sentence tokenization to convert a corpus of text into a list of sentences. This steps is done using the following code:

from nltk.tokenize import sent_tokenize
def sentence_tokenizer(text):
    """Function to tokenize sentences in a text."""
    sentences = sent_tokenize(text)
    return sentences

text = "nlp is a process where we deal with textual data. also it is one of the most emerging technology in artificial intelligence."
sentence_tokenizer(text)
['nlp is a process where we deal with textual data.',
 'also it is one of the most emerging technology in artificial intelligence.']

Word Tokenization

Once we have a list of sentences we have to perform a word tokenization to convert a list of sentences into a list of words. This steps is done using the following code:

from nltk.tokenize import word_tokenize
def word_tokenizer(sentence):
    """
    Function to tokenize sentence into words
    """
    words = word_tokenize(sentence)
    return words

sentences = sentence_tokenizer(text)
for sentence in sentences:
    print(word_tokenizer(sentence))
['nlp', 'is', 'a', 'process', 'where', 'we', 'deal', 'with', 'textual', 'data', '.']
['also', 'it', 'is', 'one', 'of', 'the', 'most', 'emerging', 'technology', 'in', 'artificial', 'intelligence', '.']

Stemming

Words are formed as prefix+morpheme+suffix and the stemming is the process of removing the prefix and suffix from the word and extracting a morpheme which is also called as normalization for example word car and cars both reduces to car. But stemming is not a ideal step to normalize since sometimes it produces a word which are not in a dictionary. This steps is done using the following code:

from nltk.stem.porter import PorterStemmer
def stemming(sentence):
    """
    Function to stem word
    """
    ps = PorterStemmer()
    words = word_tokenize(sentence)
    for w in words:
        print(w, " : ", ps.stem(w))

sentence = "sketching and dancing is interesting"
stemming(sentence)
sketching  :  sketch
and  :  and
dancing  :  danc
is  :  is
interesting  :  interest

Lemmatization

Lemmatization is the process of mapping all the different forms of a word to its base word, or lemma. While this seems close to the definition of stemming, they are, in fact, different. For example, the adjective “better,” when stemmed, remains the same. However, upon lemmatization, this should become “good,”. Lemmatization requires more linguistic knowledge, and modeling and developing efficient lemmatizers remains an open problem in NLP research even now. Lemmatization is done using the following code:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print("cats" ":", lemmatizer.lemmatize("cats"))
cats: cat

Part of Speech Tagging

POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "google" can be used as both a noun and verb, depending upon the context. While processing natural language, it is important to identify this difference. Fortunately, the spaCy library comes pre-built with machine learning algorithms that, depending upon the context (surrounding words), it is capable of returning the correct POS tag for the word.

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
def pos_tagger(text):
    """
    Function to tag the part of speech of a word
    """
    words = word_tokenize(text)
    tagged = nltk.pos_tag(words)
    for word, tag in tagged:
        print(word, " : ", tag)
        
text = "nlp is a process where we deal with textual data. also it is one of the most emerging technology in artificial intelligence."
pos_tagger(text)
nlp  :  NN
is  :  VBZ
a  :  DT
process  :  NN
where  :  WRB
we  :  PRP
deal  :  VB
with  :  IN
textual  :  JJ
data  :  NNS
.  :  .
also  :  RB
it  :  PRP
is  :  VBZ
one  :  CD
of  :  IN
the  :  DT
most  :  RBS
emerging  :  JJ
technology  :  NN
in  :  IN
artificial  :  JJ
intelligence  :  NN
.  :  .
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DIPESH\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

Summary

In this blog post we have seen how text is preprocessed in NLP and in the next blog post we will see how to use the preprocessed text in our NLP pipeline.