Intro to NLTK for NLP with Python (2024)

Tokenization, Stopwords, Stemming, and PoS Tagging (with code) — Part 1

NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language Processing) with Python. It is a really powerful tool to preprocess text data for further analysis like with ML models for instance. It helps convert text into numbers, which the model can then easily work with. This is the first part of a basic introduction to NLTK for getting your feet wet and assumes some basic knowledge of Python.

First, you want to install NLTK using pip (or conda). The command for this is pretty straightforward for both Mac and Windows: pip install nltk. If this does not work, try taking a look at this page from the documentation. Note, you must have at least version — 3.5 of Python for NLTK.

To check if NLTK is installed properly, just type import nltk in your IDE. If it runs without any error, congrats! But hold ‘up, there’s still a bunch of stuff to download and install. In your IDE, after importing, continue to the next line and type nltk.download() and run this script. An installation window will pop up. Select all and click ‘Download’ to download and install the additional bundles. This will download all the dictionaries and other language and grammar data frames necessary for full NLTK functionality. NLTK fully supports the English language, but others like Spanish or French are not supported as extensively. Now we are ready to process our first natural language.

One of the very basic things we want to do is dividing a body of text into words or sentences. This is called tokenization.

from nltk import word_tokenize, sent_tokenizesent = "I will walk 500 miles and I would walk 500 more, just to be the man who walks a thousand miles to fall down at your door!"print(word_tokenize(sent))print(sent_tokenize(sent))output: 
[‘I’, ‘will’, ‘walk’, ‘500’, ‘miles’, ‘.’, ‘And’, ‘I’, ‘would’, ‘walk’, ‘500’, ‘more’, ‘,’, ‘just’, ‘to’, ‘be’, ‘the’, ‘man’, ‘who’, ‘walks’, ‘a’, ‘thousand’, ‘miles’, ‘to’, ‘fall’, ‘down’, ‘at’, ‘your’, ‘door’, ‘.’]
[‘I will walk 500 miles.’, ‘And I would walk 500 more, just to be the man who walks a thousand miles to fall down at your door.’]

We get the body of text elegantly converted intoalist. The above tokenization without NLTK would take hours and hours of coding with regular expressions! You may wonder about the punctuation marksthough. This is something we will have to care of separately. We could also use other tokenizers like the PunktSentenceTokenizer, which is a pre-trained unsupervised ML model. We can even train it ourselves if we wantusingourowndataset. Keep an eye out for my future articles. **insert shameless self-promoting call to follow**:3

Stop-words are basically words that don’t have strong meaningful connotations for instance, ‘and’, ‘a’, ‘it's’, ‘they’, etc. These have a meaningful impact when we use them to communicate with each other but for analysis by a computer, they are not really that useful (well, they probably could be but computer algorithms are not that clever yet to decipher their contextual impact accurately, to be honest). Let’s see an example:

from nltk.corpus import stopwords # the corpus module is an 
# extremely useful one.
# More on that later.
stop_words = stopwords.words('english') # this is the full list of
# all stop-words stored in
# nltk
token = word_tokenize(sent)
cleaned_token = []
for word in token:
if word not in stop_words:
cleaned_token.append(word)
print("This is the unclean version:", token)
print("This is the cleaned version:", cleaned_token)
output:
This is the unclean version: ['I', 'will', 'walk', '500', 'miles', 'and', 'I', 'would', 'walk', '500', 'more', ',', 'just', 'to', 'be', 'the', 'man', 'who', 'walks', 'a', 'thousand', 'miles', 'to', 'fall', 'down', 'at', 'your', 'door', '.']
This is the cleaned version: ['I', 'walk', '500', 'miles', 'I', 'would', 'walk', '500', ',', 'man', 'walks', 'thousand', 'miles', 'fall', 'door', '.']

As you can see many of the words like ‘will’, ‘and’ are removed. This will save massive amounts of computation power and hence time if we were to shove bodies of texts with lots of “fluff” words into an ML model.

This is when ‘fluff’ letters (not words) are removed from a word and grouped together with its “stem form”. For instance, the words ‘play’, ‘playing’, or ‘plays’ convey the same meaning (although, again, not exactly, but for analysis with a computer, that sort of detail is still not a viable option). So instead of having them as different words, we can put them together under the same umbrella term ‘play’.

from nltk.stem import PorterStemmerstemmer = PorterStemmer()
words = ['play', 'playing', 'plays', 'played',
'playfullness', 'playful']
stemmed = [stemmer.stem(word) for word in words]
print(stemmed)
output:
['play', 'play', 'play', 'play', 'playful', 'play']

We used the PorterStemmer, which is a pre-written stemmer class. There are other stemmers like SnowballStemmer and LancasterStemmer but PorterStemmer is sort of the simplest one. ‘Play’ and ‘Playful’ should have been recognized as two different wordshowever. Notice how the last ‘playful’ got recognized as ‘play’ and not ‘playful’. This is where the simplicity of the PorterStemmer is undesirable. You can also train your own using unsupervised clustering or supervised classification ML models. Now let’s stem an actual sentence!

sent2 = "I played the play playfully as the players were playing in
the play with playfullness"
token = word_tokenize(sent2)
stemmed = ""
for word in token:
stemmed += stemmer.stem(word) + " "
print(stemmed)
output:
I play the play play as the player were play in the play with playful .

This can now be efficiently tokenized for further processing or analysis. Pretty neat, right?!

The next essential thing we want to do is tagging each word in the corpus (a corpus is just a ‘bag’ of words) we created after converting sentences by tokenizing.

from nltk import pos_tag token = word_tokenize(sent) + word_tokenize(sent2)
tagged = pos_tag(cleaned_token)
print(tagged)
output:
[('I', 'PRP'), ('will', 'MD'), ('walk', 'VB'), ('500', 'CD'), ('miles', 'NNS'), ('and', 'CC'), ('I', 'PRP'), ('would', 'MD'), ('walk', 'VB'), ('500', 'CD'), ('more', 'JJR'), (',', ','), ('just', 'RB'), ('to', 'TO'), ('be', 'VB'), ('the', 'DT'), ('man', 'NN'), ('who', 'WP'), ('walks', 'VBZ'), ('a', 'DT'), ('thousand', 'NN'), ('miles', 'NNS'), ('to', 'TO'), ('fall', 'VB'), ('down', 'RP'), ('at', 'IN'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.'), ('I', 'PRP'), ('played', 'VBD'), ('the', 'DT'), ('play', 'NN'), ('playfully', 'RB'), ('as', 'IN'), ('the', 'DT'), ('players', 'NNS'), ('were', 'VBD'), ('playing', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('play', 'NN'), ('with', 'IN'), ('playfullness', 'NN'), ('.', '.')]

The pos_tag() method takes in a list of tokenized words, and tags each of them with a corresponding Parts of Speech identifier into tuples. For example, VB refers to ‘verb’, NNS refers to ‘plural nouns’, DT refers to a ‘determiner’. Refer to this website for a list of tags. These tags are almost always pretty accurate but we should be aware that they can be inaccurate at times. However, pre-trained models usually assume the English being used is written properly, following the grammatical rules.

This can be a problem when analyzing informal texts like from the internet. Remember the data frames we downloaded after pip installing NLTK? Those contain the datasets that were used to train these models initially. To apply these models in the context of our own interests, we would need to train these models on new datasets containing informal languages first.

Intro to NLTK for NLP with Python (2024)

FAQs

Is learning NLP difficult? ›

Is NLP easy to learn? Yes, NLP is easy to learn as long as you are learning it from the right resources. In this blog, we have mentioned the best way to learn NLP. So, read it completely to know about the informative resources.

How to start with NLTK? ›

To install NLTK library, open the command terminal and type:
  1. pip install nltk.
  2. import nltk nltk. ...
  3. # import libraries import pandas as pd import nltk from nltk. ...
  4. # create preprocess_text function def preprocess_text(text): # Tokenize the text tokens = word_tokenize(text.

What is the difference between Python NLP and NLTK? ›

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP. A lot of the data that you could be analyzing is unstructured data and contains human-readable text.

How does Python NLTK work? ›

NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language Processing) with Python. It is a really powerful tool to preprocess text data for further analysis like with ML models for instance. It helps convert text into numbers, which the model can then easily work with.

Does NLP require a lot of math? ›

You must be familiar with math principles to use natural language processing. Even if studying arithmetic is challenging, using the appropriate approach will be helpful. You only need to know math fundamentals to learn about natural language processing.

Can NLP be self taught? ›

This entire field has been transformed in last 8-10 years, and traditional approaches which included a lot of hand built signals and linguistics knowledge are being replaced by deep learning techniques. But the good news is that, anyone can learn all of this by putting just a little bit of effort.

Do people still use NLTK? ›

NLTK was originally designed for research and development due to its vast libraries. Today, it is used in prototyping and creating text processing software and can still be used in production environments. sPaCy is a newer NLP tool and currently trending in the NLP libraries.

Should I use spaCy or NLTK? ›

spaCy has support for word vectors whereas NLTK does not. As spaCy uses the latest and best algorithms, its performance is usually good as compared to NLTK. In word tokenization and POS-tagging spaCy performs better, but in sentence tokenization, NLTK outperforms spaCy.

How to learn NLP from scratch? ›

To start with, you must have a sound knowledge of programming languages like Python, Keras, NumPy, and more. You should also learn the basics of cleaning text data, manual tokenization, and NLTK tokenization. The next step in the process is picking up the bag-of-words model (with Scikit learn, keras) and more.

Which is best NLP library for Python? ›

Top 10 Python NLP Libraries [And Their Applications in 2024]
  • Natural Language Toolkit (NLTK)
  • Gensim.
  • CoreNLP.
  • spaCy.
  • TextBlob.
  • Pattern.
  • PyNLPl.
  • Polyglot.
Jul 15, 2024

What is the best language for NLP? ›

Although languages such as Java and R are used for natural language processing, Python is favored, thanks to its numerous libraries, simple syntax, and its ability to easily integrate with other programming languages.

What is the best NLP algorithm? ›

The Top NLP Algorithms
  1. Latent Dirichlet Allocation (LDA) LDA is a generative statistical model used for topic modeling. ...
  2. Conditional Random Fields (CRF) ...
  3. Porter Stemmer. ...
  4. Hidden Markov Model (HMM) ...
  5. TextRank. ...
  6. Naive Bayes. ...
  7. Support Vector Machines (SVM) ...
  8. Maximum Entropy.
Jun 20, 2024

Is NLTK easy to use? ›

It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

What is the advantage of NLTK in Python? ›

Advantages and disadvantages
  • NLTK proves to be highly suitable for carrying out NLP tasks.;
  • It is convenient to access external resources, and all the models have been trained on dependable datasets.;
  • Texts are often supplied with annotations.

How to learn NLP in Python? ›

You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library Gensim.
  1. Word counts with bag-of-words.
  2. Bag-of-words picker.
  3. Building a Counter with bag-of-words.
  4. Simple text preprocessing.
  5. Text preprocessing steps.
  6. Text preprocessing practice.
  7. Introduction to gensim.

How long does it take to learn NLP? ›

The content is designed so that you spend 6hrs per week for around 15 weeks making it 90 hrs (assuming good familiarity with general ML algorithms and Python). Of course, you are free to speed up or take it easy!

Why is NLP considered hard? ›

NLP is not easy. There are several factors that make this process hard. For example, there are hundreds of natural languages, each of which has different syntax rules. Words can be ambiguous where their meaning is dependent on their context.

Can I learn NLP on my own? ›

Once you are clear about your outcome, decide which NLP techniques are relevant for you and start practicing them. The best way to start practicing is to apply them on yourself as it allows you to directly experience the effects of the techniques and refine your skills before applying them to others.

Is NLP harder than computer vision? ›

Natural language processing tasks are deemed more technically diverse when compared to computer vision procedures. This diversification ranges from variable syntax identification, morphology and segmentation capabilities, and semantics to study abstract meaning.

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 6058

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.