Data Acquisition in Natural Language Processing
All about how text data is collected and generated.
General NLP Pipeline
It consists of a step to be completed for completing whole NLP life cycle and it consist of following steps:
- Data Acquisition
- Text Cleaning
- Pre-Processing
- Feature Engineering
- Modelling
- Evaluation
- Deployment
- Monitoring and model updating
Block Diagram of General NLP pipeline
Data Acquisition
We all know without data we cannot imagine an AI and NLP is a part of AI so it obviously need data. So steps for data collection is known as Data Acquisition and it falls under general NLP pipeline which we have already seen. For this purpose we have different methods from which we can extract data and use it as our data for training our model. First we begin it by
Use a Public Dataset
- We can use a google’s specialized search engine to search for a dataset. If there is data that meet your requirements then you can download it use it and evaluate your model. If not then what?
-
- We could find a source of relevant data on the internet—for example, a consumer or discussion forum where people have posted queries (sales or support). Scrape the data from there and get it labeled by human annotators. But the problem here is everytime we cannot get a data that is required for a organisation so what should we do in that situation?
-
- AI models seldom exist by themselves. They’re developed mostly to serve users via a feature or product. In all such cases, the AI team should work with the product team to collect more and richer data by developing better instrumentation in the product. In the tech world, this is called product intervention. Also we don’t get the data then what should we do?
-
-
While instrumenting products is a great way to collect data, it takes time. Even if you instrument the product today, it can take anywhere between three to six months to collect a decent-sized, comprehensive dataset. So, can we do something in the meantime? NLP has a bunch of techniques through which we can take a small dataset and use some tricks to create more data. These tricks are also called data augmentation, and they try to exploit language properties to create text that is syntactically similar to source text data. They are given as:
-
- Randomly choose a number of words which are not a stop word and replace them with their respective synonym. For synonyms, we can use Synsets in Wordnet.
-
- Let’s say we have a sentence S1 in English and we want to translate it to a sentence in French given as S2. We can use a translation API to translate the text. Now again we retranslate the french sentence S2 back to English say S3 and compare it with the original text. We get a sentence that is similar to the original text but with variations in the words. Now we can use this sentence S3 back to our dataset.
-
- By doing back translation we loose the original meaning of the words. So, we can use TF-IDF to replace the words with their synonyms. I will explain obout a TF-IDF technique in the next blog, for now it is a technique to convert a text into numeric representation.
-
- Let’s take a sentence “ I am a nlp engineer”, now take a two word “nlp engineer” then flip it and add it to a original sentence replacing original place “I am a engineer nlp”. This is called Bigram Flipping.
-
- Replace entities like person name, location, organization, etc., with other entities in the same category. That is, replace person name with another person name, city with another city, etc. For example, in “I live in California,” replace “California” with “London.” This is called Replacing Entities.
-
- In many NLP applications, the incoming data contains spelling mistakes. This is primarily due to characteristics of the platform where the data is being generated (for example, Twitter).In such cases, we can add a bit of noise to data to train robust models. For example, randomly choose a word in a sentence and replace it with another word that’s closer in spelling to the first word. Another source of noise is the “fat finger” problem on mobile keyboards. Simulate a QWERTY keyboard error by replacing a few characters with their neighboring characters on the QWERTY keyboard.
-
-
Summary
In todays post we see what are the methods that can be used to collect data or generate data artificially by using a method like Data Augmentation, Bigram Flipping, Adding Noise to Data, TF-IDF–based word replacement and Replacing entities. In next post we will see how to clean the data.