top of page

Data Scientist Program


Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Introduction to Natural Language Processing (NLP)

"Natural language processing is an area of research in computer science and artificial intelligence (AI) concerned with processing natural languages such as English or Mandarin. This processing generally involves translating natural language into data (numbers) that a computer can use to learn about the world. And this understanding of the world is sometimes used to generate natural language text that reflects that understanding."- Natural Language Processing in action, pg. 4.

The definition sets the scope of NLP in artificial intelligence, that we are analysing data, translating and drawing insights from the data.

We will import the following 2 packages 1. re and 2. nltk.tokenize. From nltk.tokenize will grab the word_tokenize and sent_tokenize packages

# Libraries
import re
from nltk.tokenize import word_tokenize, sent_tokenize

We create an example of data we will use.

sample_text = "'My name is Ntandoyenkosi Matshisela, a 30 year old data analyst from Zimbabwe. I did my Masters degree at the National University of Science and Technology, Zimbabwe, where I specialised in Operations Research and Statistics, 2018. Prior to this I did a BSc in Operations Research and Statistics finishing it in 2015. My research interests are statistical computing and machine learning.😊 \n Most of the times, like 5 times a week, I tweet about #Python, #Rstats and #R. The tweeter handle is @matshisela😂. \n ◘Ŧ, ₦ ✔ \n I love gifts🎁, pizza 🍕, sandwich🥪'"

Search and Match

We can search or match for words and numbers. To do so we use, for words we use “\w” and digits we “\d”. We can accommodate more letters and digits by adding a plus sign (+).

# Let us match words
word_regex = r"\w+"
print(, sample_text))
number_regex = "\d+"
print(, sample_text))


We can search for capital letters by stating [A-Z] in our pattern. Further we include the full word which has the capital letter.

# Write a pattern to match sentence endings: sentence_endings
capital_words = r"[A-Z]\w+"
print(re.findall(capital_words, sample_text))

We can look for the lower case words by:

lower_cases = r"[a-z]\w+"
print(re.findall(lower_cases, sample_text))

Similarly digits can be found by:

digits = r"\d+"
print(re.findall(digits, sample_text))

# Looking for words with hash tags
wild_card = r"[#]\w+"
print(re.findall(wild_card, sample_text))

Word Tokenization

In the nltk library we can do the same things we have done above and indeed more. We download the following packages by running the following.

import nltk'punkt')

We can get the first 20 objects by:

word_tokens = word_tokenize(sample_text)

Another cool thing is that we can split by sentences by:

sent_tokens = sent_tokenize(sample_text)

As you can see we have split the text file by sentences.

Advanced Tokenization

Another source of the data is tweeter where people post a lot of information mostly words. People also tag using @, for example @matshisela, and they use # to highlight a subject e.g. #Rstats. To delve into the analysis we use the following

from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
We can use the following pattern to filter the topics 
pattern1 = r"#\w+"
hashtags = regexp_tokenize(sample_text, pattern1)

We can include the # hashtags and mentions in one line by

pattern2 = r"[.@#]\w+"
mention_hashtags = regexp_tokenize(sample_text, pattern2)

The TweetTokenizer can be instantiated by TweetTokenizer() which is used in the for loop.

tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in sample_text]

Emojis can be also used. The unicodes can be used to capture these like:

## Non ascii tokenization
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(sample_text, emoji))

Charting Words

We can chart the distribution of the length of words. The following code outputs a chart that shows this distribution.

import matplotlib.pyplot as plt
words = word_tokenize(sample_text)
word_lengths = [len(w) for w in words]

Word counts with bag of words

We can also show how many times a word has been used using the word tokenizer. We then get the most used word from the most used to the least used.

## Word Counts with bag of words
from collections import Counter
# lets convert words to lower case so that we capture all words the same
tokens = word_tokenize(sample_text)
lower_tokens = [t.lower() for t in tokens]
word_number = Counter(lower_tokens)

More commas are used in the sample text. This doesn’t give us a lot of information. We can do text preprocessing to get a conceptual understanding

Simple Text Preprocessing

There are 3 preprocessing tasks one can do which are

  1. Lemmatization

  2. Lowercasing

  3. Removing unwanted tokens

We can use the alphabetic words by using:

# Let us look for alphabetic words
alpha_only = [t for t in lower_tokens if t.isalpha()]

As you can see we have removed the commas and the non-alphabetic objects

We can remove the English stop words such as in this list to get a sense of the usage of the real words

# Let us remove stop words, shall we
english_stops = ['the', 'they', 'i', 'my', 'to', 'and', 'a', 'in', 'is', 'did', 'of']
no_stops = [t for t in alpha_only if t not in english_stops]

We can then use the word net lemmatizer to find the most used words by:

from nltk. stem import WordNetLemmatizer'wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
# Bag of words
word_number2 = Counter(lemmatized)

As you can see that the most used words are research, which is what I am. Another striking use is Zimbabwe, where I come from too. Now think of, say if we have a large text file, more insights would be derived

The code to generate the above code and output is here

The concepts learnt came from the Datacamp NLP lesson


Recent Posts

See All


bottom of page