Text Cleaning in Natural Language Processing
All about how to clean a text data in natural language processing.
- Importance of Text Cleaning
- Types of text cleaning
- Remove newlines & Tabs
- Remove Punctuation
- Remove Numbers
- Remove Stopwords
- Remove HTML Tags
- Remove URLs
- Remove Whitespaces
- Remove Accented Characters
- Make all words lowercase
- Remove Email
- Summary
Importance of Text Cleaning
As we know Natural Language Processing deals with textual data and it is highly recommended to clean a textual data so that there is not a redundant data that only affect our models performance. Properly cleaned data will help us to do good text analysis and help us in making accurate decisions for our business problems. Hence text cleaning is an important step.
Remove newlines & Tabs
You may encounter lots of new lines for no reason in your textual dataset and tabs as well. So when you scrape data, those newlines and tabs that are required on the website for structured content are not required in your dataset and also get converted into useless characters like \n, \t. So, I have written a function that will delete all such nonsense.
def remove_newlines_tabs(text):
"""
This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
arguments:
input_text: "text" of type "String".
return:
value: "text" after removal of newlines, tabs, \\n, \\ characters.
Example:
Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
Output : This is her first day at this place. Please, Be nice to her.
"""
# Replacing all the occurrences of \n,\\n,\t,\\ with a space.
Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
return Formatted_text
import string
string.punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
text = " I love you @ but # you % dont love me"
remove_punctuation(text)
string = 'abcd1234efg567'
newstring = ''.join([i for i in string if not i.isdigit()])
print(newstring)
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])
return text
text = " Is there a stopword in this text? "
remove_stopwords(text)
from bs4 import BeautifulSoup
def strip_html_tags(text):
"""
This function will remove all the occurrences of html tags from the text.
arguments:
input_text: "text" of type "String".
return:
value: "text" after removal of html tags.
Example:
Input : This is a nice place to live. <IMG>
Output : This is a nice place to live.
"""
# Initiating BeautifulSoup object soup.
soup = BeautifulSoup(text, "html.parser")
# Get all the text other than html tags.
stripped_text = soup.get_text(separator=" ")
return stripped_text
text = "This is a <b>nice</b> place to live. <IMG>"
strip_html_tags(text)
import re
def remove_links(text):
"""
This function will remove all the occurrences of links.
arguments:
input_text: "text" of type "String".
return:
value: "text" after removal of all types of links.
Example:
Input : To know more about this website: kajalyadav.com visit: https://kajalyadav.com//Blogs
Output : To know more about this website: visit:
"""
# Removing all the occurrences of links that starts with https
remove_https = re.sub(r'http\S+', '', text)
# Remove all the occurrences of text that ends with .com
remove_com = re.sub(r"\ [A-Za-z]*\.com", " ", remove_https)
return remove_com
text = "To know more about this website: kajalyadav.com visit: https://kajalyadav.com//Blogs"
remove_links(text)
def remove_whitespace(text):
""" This function will remove
extra whitespaces from the text
arguments:
input_text: "text" of type "String".
return:
value: "text" after extra whitespaces removed .
Example:
Input : How are you doing ?
Output : How are you doing ?
"""
pattern = re.compile(r'\s+')
Without_whitespace = re.sub(pattern, ' ', text)
# There are some instances where there is no space after '?' & ')',
# So I am replacing these with one space so that It will not consider two words as one token.
text = Without_whitespace.replace('?', ' ? ').replace(')', ') ')
return text
text = "How are you doing ?"
remove_whitespace(text)
import unidecode
def accented_characters_removal(text):
# this is a docstring
"""
The function will remove accented characters from the
text contained within the Dataset.
arguments:
input_text: "text" of type "String".
return:
value: "text" with removed accented characters.
Example:
Input : Málaga, àéêöhello
Output : Malaga, aeeohello
"""
# Remove accented characters from text using unidecode.
# Unidecode() - It takes unicode data & tries to represent it to ASCII characters.
text = unidecode.unidecode(text)
return text
text = "Málaga, àéêöhello"
accented_characters_removal(text)
def lower_casing_text(text):
"""
The function will convert text into lower case.
arguments:
input_text: "text" of type "String".
return:
value: text in lowercase
Example:
Input : The World is Full of Surprises!
Output : the world is full of surprises!
"""
# Convert text to lower case
# lower() - It converts all upperase letter of given string to lowercase.
text = text.lower()
return text
text = "The World is Full of Surprises!"
lower_casing_text(text)
def remove_email(text):
"""
The function will remove email addresses from the text.
arguments:
input_text: "text" of type "String".
return:
value: "text" after removal of email addresses.
"""
required_output=re.sub(r'[A-Za-z0-9]*@[A-Za-z]*\.?[A-Za-z0-9]*', "", text)
required_output=" ".join(required_output.split())
return required_output
text = "I am dipeshsilwal@gmail.com"
remove_email(text)