Text Preprocessing in Natural Language Processing
All about how to prepare a data to convert into numbers.
- Why do we need to preprocess data?
- Sentence Tokenization
- Word Tokenization
- Stemming
- Lemmatization
- Part of Speech Tagging
- Summary
Why do we need to preprocess data?
NLP software analyzes text by breaking it into sentence and words. We require a reliable NLP pipeline where a text is splitted into sentences and then into words so that we can analyze the text. There are many steps to be completed in a text preprocessing pipeline and they are:
- Sentence Tokenization
- Word Tokenization
- Stemming
- Lemmatization
Sentence Tokenization
Once we get a textual data and clean it as mentioned in my previous post Text Cleaning we have to perform a sentence tokenization to convert a corpus of text into a list of sentences. This steps is done using the following code:
from nltk.tokenize import sent_tokenize
def sentence_tokenizer(text):
"""Function to tokenize sentences in a text."""
sentences = sent_tokenize(text)
return sentences
text = "nlp is a process where we deal with textual data. also it is one of the most emerging technology in artificial intelligence."
sentence_tokenizer(text)
from nltk.tokenize import word_tokenize
def word_tokenizer(sentence):
"""
Function to tokenize sentence into words
"""
words = word_tokenize(sentence)
return words
sentences = sentence_tokenizer(text)
for sentence in sentences:
print(word_tokenizer(sentence))
Stemming
Words are formed as prefix+morpheme+suffix and the stemming is the process of removing the prefix and suffix from the word and extracting a morpheme which is also called as normalization for example word car and cars both reduces to car. But stemming is not a ideal step to normalize since sometimes it produces a word which are not in a dictionary. This steps is done using the following code:
from nltk.stem.porter import PorterStemmer
def stemming(sentence):
"""
Function to stem word
"""
ps = PorterStemmer()
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
sentence = "sketching and dancing is interesting"
stemming(sentence)
Lemmatization
Lemmatization is the process of mapping all the different forms of a word to its base word, or lemma. While this seems close to the definition of stemming, they are, in fact, different. For example, the adjective “better,” when stemmed, remains the same. However, upon lemmatization, this should become “good,”. Lemmatization requires more linguistic knowledge, and modeling and developing efficient lemmatizers remains an open problem in NLP research even now. Lemmatization is done using the following code:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("cats" ":", lemmatizer.lemmatize("cats"))
Part of Speech Tagging
POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "google" can be used as both a noun and verb, depending upon the context. While processing natural language, it is important to identify this difference. Fortunately, the spaCy library comes pre-built with machine learning algorithms that, depending upon the context (surrounding words), it is capable of returning the correct POS tag for the word.
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
def pos_tagger(text):
"""
Function to tag the part of speech of a word
"""
words = word_tokenize(text)
tagged = nltk.pos_tag(words)
for word, tag in tagged:
print(word, " : ", tag)
text = "nlp is a process where we deal with textual data. also it is one of the most emerging technology in artificial intelligence."
pos_tagger(text)