Let's deal with some text data
Back in recent posts, I mentioned one indisputable fact about the current machine learning models. They deal with numbers. In the case of images, they digitally exist as matrices of numbers (pixel intensity ranging from 0-1 or 0-255). They don't understand string variables, Boolean variables (unless encoded as 0-1), or categorical variables. The input is always numbers. That's why we come up with preprocessing steps to manage this.
Now you'd say, hey wait google translation AI takes text as input. That's true. Well, partially true. It preprocesses the text you feed it before running any inference on it. Just like preprocessing for other variables, text data must be processed.
This takes us to talk about a fairly hyped branch of AI: Natural Language Processing. This AI branch takes care of the analyzing and processing text data in hope to extract data from text files/datasets. Some applications include: Text summary, Sentiment classification, speech recognition, etc. The more the NLP field advances, the better a machine can interpret a human being.
In this blog we'll explore two parts of the NLP process:
Preprocessing text data: This will include some tips and steps to take to prepare/preprocess the text data prior to transforming the data.
Transforming text data for model input: This will include some variants of transformations for the text data to become interpretable by machine learning models.
I- Preprocessing text data.
1- Lowercasing the data:
If I were to ask you what's the different between "Word" and "word" in terms of understanding the word, chances are you'd say they're the same thing. To a machine that doesn't understand words, they aren't the same. One way to circumvent this is to simply ensure that all text is fed to the transformation steps as lowercase text only. This step helps eliminate any redundancy in words. To achieve this, this sample code is more than enough.
text = text.lower()
In a dataframe, you simple map this function to each row in the column of interest.
df["text&q