Python Code to Begin Part-of-Speech Tagging Using a Web Scrapped Website

Part-of-speech tagging, also known as POS tagging or grammatical tagging, is a method of annotating words in a text with their corresponding grammatical categories, such as noun, verb, adjective, adverb, and sometimes this is referred to as data mining. This process is important for natural language processing (NLP) tasks such as text classification, machine translation, and information retrieval.

There are two main approaches to POS tagging: rule-based and statistical. Rule-based tagging uses a set of hand-written rules to assign POS tags to words, while statistical tagging uses machine learning algorithms to learn the POS tag of a word based on its context.

Statistical POS tagging is more accurate and widely used because it can take into account the context in which a word is used and learn from a large corpus of annotated text. The most common machine learning algorithm used for POS tagging is the Hidden Markov Model (HMM), which uses a set of states and transition probabilities to predict the POS tag of a word.

One of the most popular POS tagging tools is the Natural Language Toolkit (NLTK) library in Python, which provides a set of functions for tokenizing, POS tagging, and parsing text. NLTK also includes a pre-trained POS tagger based on the Penn Treebank POS tag set, which is a widely used standard for POS tagging.

In addition to NLTK, other popular POS tagging tools include the Stanford POS Tagger, the OpenNLP POS Tagger, and the spaCy library.

POS tagging is an important step in many NLP tasks, and it is used as a pre-processing step for other NLP tasks such as named entity recognition, sentiment analysis, and text summarization. It is a crucial step in understanding the meaning of text, as the POS tags provide important information about the syntactic structure of a sentence.

In conclusion, Part-of-Speech tagging is a technique that assigns grammatical category to words in a text, which is important for natural language processing tasks. Statistical approach is more accurate and widely used, and there are several libraries and tools available to perform POS tagging. It serves as a pre-processing step for other NLP tasks and it is crucial in understanding the meaning of text.

Using NLTK for the First Time

Here’s a quick walkthrough to allow you to begin POS tagging.

First, you’ll want to install NLTK completely.

NLTK is an open source software. The source code is distributed under the terms of the Apache License Version 2.0. The documentation is distributed under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license. The corpora are distributed under various licenses, as documented in their respective README files.
Quote from; https://github.com/nltk/nltk/wiki/FAQ

If you have pycharm available or a python IDE, begin by opening the terminal and running.

pip install nltk

Next you want to use their downloader.

Here’s the python to run next. It will open their downloader on your computer.

import nltk
nltk.download()

The following window will open.

Go ahead and download everything.

Here is an example of a Python script that uses the Natural Language Toolkit (NLTK) library to perform part-of-speech tagging on the text scraped from a website:

Find the code from the youtube video above, here on github, explained line by line below.

import requests
from bs4 import BeautifulSoup
import nltk

# Work-around for mod security, simulates you being a real user

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

# Scrape the website's HTML
url = "https://dev3lop.com"
page = requests.get(url,  headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

# Extract the text from the website
text = soup.get_text()

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Perform part-of-speech tagging on the tokens
tagged_tokens = nltk.pos_tag(tokens)

# Print the tagged tokens
print(tagged_tokens)

This script uses the requests library to scrape the HTML of the website specified in the url variable. It then uses the BeautifulSoup library to extract the text from the HTML. The text is tokenized using the word_tokenize() function from NLTK, and then part-of-speech tagging is performed on the tokens using the pos_tag() function. The resulting list of tagged tokens is then printed to the console.

Filtering out common words

If you’re digging deeper, you may want to see what “NN” for nouns, “VB” for verbs, and “JJ” for adjectives are in usage.

We can quickly filter out the POS tags that are not useful for our analysis, such as punctuation marks or common function words like “is” or “the”. For example, you can use a list comprehension to filter out the POS tags that are not in a certain list of POS tags that you are interested in analyzing:

# List of POS tags to include in the analysis
include_pos = ["NN", "VB", "JJ"]

# Filter the tagged tokens to include only the specified POS tags
filtered_tokens = [(token, pos) for token, pos in tagged_tokens if pos in include_pos]

# Print the filtered tokens
print(filtered_tokens)

Counting occurrences

# Count filtered tokens
token_counts = Counter(filtered_tokens)

# Print counts
print(token_counts)

Final output will look like the following;

Now that you’re done counting occurrences, you can inspect the print of token_counts and notice this method also helped you sort the information from largest to smallest. We hope this lesson on Part-of-Speech Tagging using a Web Scrapped Website is a solution you’re able to take into consideration when generating your next python data pipeline!

If you need assistance creating these tools, you can count on our data engineering consulting services to help elevate your python engineering needs!