Tokenization, Stopwords, Stemming, and PoS Tagging (with code) — Part 1
Published in · 6 min read · Nov 24, 2020
--
NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language Processing) with Python. It is a really powerful tool to preprocess text data for further analysis like with ML models for instance. It helps convert text into numbers, which the model can then easily work with. This is the first part of a basic introduction to NLTK for getting your feet wet and assumes some basic knowledge of Python.
First, you want to install NLTK using pip (or conda). The command for this is pretty straightforward for both Mac and Windows: pip install nltk
. If this does not work, try taking a look at this page from the documentation. Note, you must have at least version — 3.5 of Python for NLTK.
To check if NLTK is installed properly, just type import nltk
in your IDE. If it runs without any error, congrats! But hold ‘up, there’s still a bunch of stuff to download and install. In your IDE, after importing, continue to the next line and type nltk.download()
and run this script. An installation window will pop up. Select all and click ‘Download’ to download and install the additional bundles. This will download all the dictionaries and other language and grammar data frames necessary for full NLTK functionality. NLTK fully supports the English language, but others like Spanish or French are not supported as extensively. Now we are ready to process our first natural language.
One of the very basic things we want to do is dividing a body of text into words or sentences. This is called tokenization.
from nltk import word_tokenize, sent_tokenizesent = "I will walk 500 miles and I would walk 500 more, just to be the man who walks a thousand miles to fall down at your door!"print(word_tokenize(sent))print(sent_tokenize(sent))output:
[‘I’, ‘will’, ‘walk’, ‘500’, ‘miles’, ‘.’, ‘And’, ‘I’, ‘would’, ‘walk’, ‘500’, ‘more’, ‘,’, ‘just’, ‘to’, ‘be’, ‘the’, ‘man’, ‘who’, ‘walks’, ‘a’, ‘thousand’, ‘miles’, ‘to’, ‘fall’, ‘down’, ‘at’, ‘your’, ‘door’, ‘.’][‘I will walk 500 miles.’, ‘And I would walk 500 more, just to be the man who walks a thousand miles to fall down at your door.’]
We get the body of text elegantly converted intoalist. The above tokenization without NLTK would take hours and hours of coding with regular expressions! You may wonder about the punctuation marksthough. This is something we will have to care of separately. We could also use other tokenizers like the PunktSentenceTokenizer, which is a pre-trained unsupervised ML model. We can even train it ourselves if we wantusingourowndataset. Keep an eye out for my future articles. **insert shameless self-promoting call to follow**:3
Stop-words are basically words that don’t have strong meaningful connotations for instance, ‘and’, ‘a’, ‘it's’, ‘they’, etc. These have a meaningful impact when we use them to communicate with each other but for analysis by a computer, they are not really that useful (well, they probably could be but computer algorithms are not that clever yet to decipher their contextual impact accurately, to be honest). Let’s see an example:
from nltk.corpus import stopwords # the corpus module is an
# extremely useful one.
# More on that later.stop_words = stopwords.words('english') # this is the full list of
# all stop-words stored in
# nltk
token = word_tokenize(sent)
cleaned_token = []
for word in token:
if word not in stop_words:
cleaned_token.append(word)print("This is the unclean version:", token)
print("This is the cleaned version:", cleaned_token)output:
This is the unclean version: ['I', 'will', 'walk', '500', 'miles', 'and', 'I', 'would', 'walk', '500', 'more', ',', 'just', 'to', 'be', 'the', 'man', 'who', 'walks', 'a', 'thousand', 'miles', 'to', 'fall', 'down', 'at', 'your', 'door', '.']This is the cleaned version: ['I', 'walk', '500', 'miles', 'I', 'would', 'walk', '500', ',', 'man', 'walks', 'thousand', 'miles', 'fall', 'door', '.']
As you can see many of the words like ‘will’, ‘and’ are removed. This will save massive amounts of computation power and hence time if we were to shove bodies of texts with lots of “fluff” words into an ML model.
This is when ‘fluff’ letters (not words) are removed from a word and grouped together with its “stem form”. For instance, the words ‘play’, ‘playing’, or ‘plays’ convey the same meaning (although, again, not exactly, but for analysis with a computer, that sort of detail is still not a viable option). So instead of having them as different words, we can put them together under the same umbrella term ‘play’.
from nltk.stem import PorterStemmerstemmer = PorterStemmer()
words = ['play', 'playing', 'plays', 'played',
'playfullness', 'playful']
stemmed = [stemmer.stem(word) for word in words]
print(stemmed)output:
['play', 'play', 'play', 'play', 'playful', 'play']
We used the PorterStemmer, which is a pre-written stemmer class. There are other stemmers like SnowballStemmer and LancasterStemmer but PorterStemmer is sort of the simplest one. ‘Play’ and ‘Playful’ should have been recognized as two different wordshowever. Notice how the last ‘playful’ got recognized as ‘play’ and not ‘playful’. This is where the simplicity of the PorterStemmer is undesirable. You can also train your own using unsupervised clustering or supervised classification ML models. Now let’s stem an actual sentence!
sent2 = "I played the play playfully as the players were playing in
the play with playfullness"
token = word_tokenize(sent2)
stemmed = ""
for word in token:
stemmed += stemmer.stem(word) + " "
print(stemmed)output:
I play the play play as the player were play in the play with playful .
This can now be efficiently tokenized for further processing or analysis. Pretty neat, right?!
The next essential thing we want to do is tagging each word in the corpus (a corpus is just a ‘bag’ of words) we created after converting sentences by tokenizing.
from nltk import pos_tag token = word_tokenize(sent) + word_tokenize(sent2)
tagged = pos_tag(cleaned_token)
print(tagged)output:
[('I', 'PRP'), ('will', 'MD'), ('walk', 'VB'), ('500', 'CD'), ('miles', 'NNS'), ('and', 'CC'), ('I', 'PRP'), ('would', 'MD'), ('walk', 'VB'), ('500', 'CD'), ('more', 'JJR'), (',', ','), ('just', 'RB'), ('to', 'TO'), ('be', 'VB'), ('the', 'DT'), ('man', 'NN'), ('who', 'WP'), ('walks', 'VBZ'), ('a', 'DT'), ('thousand', 'NN'), ('miles', 'NNS'), ('to', 'TO'), ('fall', 'VB'), ('down', 'RP'), ('at', 'IN'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.'), ('I', 'PRP'), ('played', 'VBD'), ('the', 'DT'), ('play', 'NN'), ('playfully', 'RB'), ('as', 'IN'), ('the', 'DT'), ('players', 'NNS'), ('were', 'VBD'), ('playing', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('play', 'NN'), ('with', 'IN'), ('playfullness', 'NN'), ('.', '.')]
The pos_tag()
method takes in a list of tokenized words, and tags each of them with a corresponding Parts of Speech identifier into tuples. For example, VB refers to ‘verb’, NNS refers to ‘plural nouns’, DT refers to a ‘determiner’. Refer to this website for a list of tags. These tags are almost always pretty accurate but we should be aware that they can be inaccurate at times. However, pre-trained models usually assume the English being used is written properly, following the grammatical rules.
This can be a problem when analyzing informal texts like from the internet. Remember the data frames we downloaded after pip installing NLTK? Those contain the datasets that were used to train these models initially. To apply these models in the context of our own interests, we would need to train these models on new datasets containing informal languages first.