Importance of Text Cleaning

As we know Natural Language Processing deals with textual data and it is highly recommended to clean a textual data so that there is not a redundant data that only affect our models performance. Properly cleaned data will help us to do good text analysis and help us in making accurate decisions for our business problems. Hence text cleaning is an important step.

Types of text cleaning

  • Remove newlines & Tabs
  • Remove punctuation
  • Remove numbers
  • Remove stop words
  • Remove HTML tags
  • Remove URLs
  • Remove emails
  • Remove Whitespaces
  • Remove Accented Characters

Remove newlines & Tabs

You may encounter lots of new lines for no reason in your textual dataset and tabs as well. So when you scrape data, those newlines and tabs that are required on the website for structured content are not required in your dataset and also get converted into useless characters like \n, \t. So, I have written a function that will delete all such nonsense.

def remove_newlines_tabs(text):
    """
    This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of newlines, tabs, \\n, \\ characters.
        
    Example:
    Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
    Output : This is her first day at this place. Please, Be nice to her. 
    
    """
    
    # Replacing all the occurrences of \n,\\n,\t,\\ with a space.
    Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
    return Formatted_text

Remove Punctuation

In this step, all the punctuations from the text are removed. string library of Python contains some pre-defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[]^_`{|}~’

import string
string.punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

text = " I love you @ but # you % dont love me"
remove_punctuation(text)
' I love you  but  you  dont love me'

Remove Numbers

While creating textual data lots of numbers are used for the specific purpose which is only for the purpose of the data. So, we have to remove those numbers.

string = 'abcd1234efg567'
newstring = ''.join([i for i in string if not i.isdigit()])
print(newstring)
abcdefg

Remove Stopwords

Stopwords are a,an,the etc are words that are useful for forming sentence but not importent for feeding the model. So, we have to remove those stopwords.

import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
    from collections import Counter
    stop_words = stopwords.words('english')
    stopwords_dict = Counter(stop_words)
    text = ' '.join([word for word in text.split() if word not in stopwords_dict])
    return text
text = " Is there a stopword in this text? "
remove_stopwords(text)
'Is stopword text?'

Remove HTML Tags

When we scrape a text data from a website, the HTML tags are also included in the text data which doesn't play importent role for model training so, we have to remove those HTML tags.

from bs4 import BeautifulSoup
def strip_html_tags(text):
    """ 
    This function will remove all the occurrences of html tags from the text.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of html tags.
        
    Example:
    Input : This is a nice place to live. <IMG>
    Output : This is a nice place to live.  
    """
    # Initiating BeautifulSoup object soup.
    soup = BeautifulSoup(text, "html.parser")
    # Get all the text other than html tags.
    stripped_text = soup.get_text(separator=" ")
    return stripped_text
text = "This is a <b>nice</b> place to live. <IMG>"
strip_html_tags(text)
'This is a  nice  place to live. '

Remove URLs

As we said earlier when we scrape a text data from a website, the URLs are also included in the text data which doesn't play importent role for model training. It only degrade the performance of our model so, we have to remove those URLs.

import re
def remove_links(text):
    """
    This function will remove all the occurrences of links.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of all types of links.
        
    Example:
    Input : To know more about this website: kajalyadav.com  visit: https://kajalyadav.com//Blogs
    Output : To know more about this website: visit:     
    
    """
    
    # Removing all the occurrences of links that starts with https
    remove_https = re.sub(r'http\S+', '', text)
    # Remove all the occurrences of text that ends with .com
    remove_com = re.sub(r"\ [A-Za-z]*\.com", " ", remove_https)
    return remove_com
text = "To know more about this website: kajalyadav.com  visit: https://kajalyadav.com//Blogs"
remove_links(text)
'To know more about this website:   visit: '

Remove Whitespaces

A single line function can be performed to remove extra whitespaces as mentioned below. This step is crucial before performing further NLP tasks.

def remove_whitespace(text):
    """ This function will remove 
        extra whitespaces from the text
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after extra whitespaces removed .
        
    Example:
    Input : How   are   you   doing   ?
    Output : How are you doing ?     
        
    """
    pattern = re.compile(r'\s+') 
    Without_whitespace = re.sub(pattern, ' ', text)
    # There are some instances where there is no space after '?' & ')', 
    # So I am replacing these with one space so that It will not consider two words as one token.
    text = Without_whitespace.replace('?', ' ? ').replace(')', ') ')
    return text
text = "How   are   you   doing   ?"
remove_whitespace(text)
'How are you doing  ? '

Remove Accented Characters

This is a crucial step to convert all characters like accented characters into machine-understandable language. So that further steps can be implemented easily. Accented characters are characters like â, î, or ô which have diacritics above the characters.

import unidecode
def accented_characters_removal(text):
    # this is a docstring
    """
    The function will remove accented characters from the 
    text contained within the Dataset.
       
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" with removed accented characters.
        
    Example:
    Input : Málaga, àéêöhello
    Output : Malaga, aeeohello    
        
    """
    # Remove accented characters from text using unidecode.
    # Unidecode() - It takes unicode data & tries to represent it to ASCII characters. 
    text = unidecode.unidecode(text)
    return text
text = "Málaga, àéêöhello"
accented_characters_removal(text)
'Malaga, aeeohello'

Make all words lowercase

Since for machine two different case are different so we have to make all words lowercase.

def lower_casing_text(text):
    
    """
    The function will convert text into lower case.
    
    arguments:
         input_text: "text" of type "String".
         
    return:
         value: text in lowercase
         
    Example:
    Input : The World is Full of Surprises!
    Output : the world is full of surprises!
    
    """
    # Convert text to lower case
    # lower() - It converts all upperase letter of given string to lowercase.
    text = text.lower()
    return text
text = "The World is Full of Surprises!"
lower_casing_text(text)
'the world is full of surprises!'

Remove Email

Emails are not required for model so we remove those emails.

def remove_email(text):
    """
    The function will remove email addresses from the text.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of email addresses.
    """
    required_output=re.sub(r'[A-Za-z0-9]*@[A-Za-z]*\.?[A-Za-z0-9]*', "", text)   
    required_output=" ".join(required_output.split())
    return required_output
text = "I am dipeshsilwal@gmail.com"
remove_email(text)
'I am'

Summary

Here in this post we see how to clean our textual data. In next post I will describe the different pre-processing techniques that we can use in our textual data.