dev3lopcom, llc, official logo 12/8/2022

Connect Now

The Basics of Natural Language Processing (NLP)

The Basics of Natural Language Processing (NLP)

Processing unstructured text can take various approaches. One way is to split paragraphs based on new lines and break sentences by focusing on spaces. However, this can lead to the need for sourcing your own scoring data to join with your unstructured data source, and that requires data warehousing services internally or externally. Finding, cleaning, and processing your word scoring data sources is a project that becomes a big part of this realm of solving. NLP can be considered a row-level relationship when using a relational database solution, but NLTK provides a Python alternative that eliminates the need for processing data differently than what a relational database would solve. An acid compliant database would prioritize establishing a relationship with “word scoring” tables, requiring scored data per word, which can be a time-consuming task. Instead, there are more precise and efficient methods available, such as those showcased in this blog post, which include resources like YouTube content, Python code, Code walkthroughs, and a cloud version of Jupyter notebook code that your digital marketing team can use to start solving problems immediately.

Both row level scoring per word and NLP are both power tools when trying to understand data. Data engineering services will open a new level of data solution development, and allow you to quickly harness different levels of capabilities with your internal and external data sources.

Natural Language Processing, or NLP for short, is a branch of artificial intelligence that deals with the interaction between computers and human languages. It is a field that has seen tremendous growth in recent years, with applications ranging from language translation to sentiment analysis, and even to building intelligent virtual assistants.

At its core, NLP is about teaching computers to understand and process human language. This is a challenging task, as human language is highly ambiguous and context-dependent. For example, the word “bass” can refer to a type of fish or a low-frequency sound, and the word “bat” can refer to an animal or a piece of sports equipment. Understanding the intended meaning in a given context requires a deep understanding of the language and the context in which it is used.

There are several key techniques that are used in NLP, including:

  • Tokenization: This is the process of breaking down a sentence or text into individual words or phrases. This is the first step in any NLP process, as it allows the computer to work with the individual elements of the text.
  • Part-of-speech tagging: This is the process of identifying the role of each word in a sentence, such as whether it is a noun, verb, adjective, etc. This helps the computer understand the grammar and structure of the sentence.
  • Named Entity Recognition: This is the process of identifying proper nouns and entities in a sentence such as people, places, and organizations. This can be used to extract structured information from unstructured text.
  • Sentiment Analysis: This is the process of determining the emotional tone of a piece of text. This can be used to understand how people feel about a particular topic or product.
  • Machine Translation: This is the process of converting text from one language to another. This can be used to translate documents, websites or even speech.

These are just a few examples of the many techniques used in NLP. The field is constantly evolving, with new techniques and algorithms being developed all the time. As the amount of text data available on the internet continues to grow, the importance of NLP will only increase. It is a fascinating field that has the potential to revolutionize how we interact with technology, and understanding the basics of NLP is essential for anyone working in technology or data science.

In conclusion, NLP is a rapidly growing field that deals with teaching computers to understand human languages. It encompasses a wide range of techniques and applications, from tokenization and part-of-speech tagging to sentiment analysis and machine translation. With the increasing amount of text data available, understanding the basics of NLP is essential for anyone working in technology or data science.

10 Blog Resources related to NLP;

  1. The Stanford Natural Language Processing Group: http://nlp.stanford.edu/blog/
  2. Google AI Blog: https://ai.googleblog.com/category/natural-language-processing/
  3. Hugging Face: https://huggingface.co/blog/
  4. SpaCy: https://spacy.io/blog/
  5. NLP News: http://nlpnews.com/
  6. OpenAI: https://openai.com/blog/tag/natural-language-processing/
  7. KDNuggets: https://www.kdnuggets.com/tag/natural-language-processing
  8. NLP Progress: https://nlpprogress.com/
  9. NLP Overview: https://nlpoverview.com/
  10. The NLP Newsletter: http://nlpnewsletter.com/

Python Code to Begin Part-of-Speech Tagging Using a Web Scrapped Website

Part-of-speech tagging, also known as POS tagging or grammatical tagging, is a method of annotating words in a text with their corresponding grammatical categories, such as noun, verb, adjective, adverb, and sometimes this is referred to as data mining. This process is important for natural language processing (NLP) tasks such as text classification, machine translation, and information retrieval.

There are two main approaches to POS tagging: rule-based and statistical. Rule-based tagging uses a set of hand-written rules to assign POS tags to words, while statistical tagging uses machine learning algorithms to learn the POS tag of a word based on its context.

Statistical POS tagging is more accurate and widely used because it can take into account the context in which a word is used and learn from a large corpus of annotated text. The most common machine learning algorithm used for POS tagging is the Hidden Markov Model (HMM), which uses a set of states and transition probabilities to predict the POS tag of a word.

One of the most popular POS tagging tools is the Natural Language Toolkit (NLTK) library in Python, which provides a set of functions for tokenizing, POS tagging, and parsing text. NLTK also includes a pre-trained POS tagger based on the Penn Treebank POS tag set, which is a widely used standard for POS tagging.

In addition to NLTK, other popular POS tagging tools include the Stanford POS Tagger, the OpenNLP POS Tagger, and the spaCy library.

POS tagging is an important step in many NLP tasks, and it is used as a pre-processing step for other NLP tasks such as named entity recognition, sentiment analysis, and text summarization. It is a crucial step in understanding the meaning of text, as the POS tags provide important information about the syntactic structure of a sentence.

In conclusion, Part-of-Speech tagging is a technique that assigns grammatical category to words in a text, which is important for natural language processing tasks. Statistical approach is more accurate and widely used, and there are several libraries and tools available to perform POS tagging. It serves as a pre-processing step for other NLP tasks and it is crucial in understanding the meaning of text.

Using NLTK for the First Time

Here’s a quick walkthrough to allow you to begin POS tagging.

First, you’ll want to install NLTK completely.

NLTK is an open source software. The source code is distributed under the terms of the Apache License Version 2.0. The documentation is distributed under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license. The corpora are distributed under various licenses, as documented in their respective README files.

Quote from; https://github.com/nltk/nltk/wiki/FAQ

If you have pycharm available or a python IDE, begin by opening the terminal and running.

pip install nltk

Next you want to use their downloader.

Here’s the python to run next. It will open their downloader on your computer.

import nltk
nltk.download()

The following window will open.

Go ahead and download everything.

Here is an example of a Python script that uses the Natural Language Toolkit (NLTK) library to perform part-of-speech tagging on the text scraped from a website:

Find the code from the youtube video above, here on github, explained line by line below.

import requests
from bs4 import BeautifulSoup
import nltk

# Work-around for mod security, simulates you being a real user

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

# Scrape the website's HTML
url = "https://dev3lop.com"
page = requests.get(url,  headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

# Extract the text from the website
text = soup.get_text()

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Perform part-of-speech tagging on the tokens
tagged_tokens = nltk.pos_tag(tokens)

# Print the tagged tokens
print(tagged_tokens)

This script uses the requests library to scrape the HTML of the website specified in the url variable. It then uses the BeautifulSoup library to extract the text from the HTML. The text is tokenized using the word_tokenize() function from NLTK, and then part-of-speech tagging is performed on the tokens using the pos_tag() function. The resulting list of tagged tokens is then printed to the console.

Filtering out common words

If you’re digging deeper, you may want to see what “NN” for nouns, “VB” for verbs, and “JJ” for adjectives are in usage.

We can quickly filter out the POS tags that are not useful for our analysis, such as punctuation marks or common function words like “is” or “the”. For example, you can use a list comprehension to filter out the POS tags that are not in a certain list of POS tags that you are interested in analyzing:

# List of POS tags to include in the analysis
include_pos = ["NN", "VB", "JJ"]

# Filter the tagged tokens to include only the specified POS tags
filtered_tokens = [(token, pos) for token, pos in tagged_tokens if pos in include_pos]

# Print the filtered tokens
print(filtered_tokens)

Counting occurrences

# Count filtered tokens
token_counts = Counter(filtered_tokens)

# Print counts
print(token_counts)

Final output will look like the following;

Now that you’re done counting occurrences, you can inspect the print of token_counts and notice this method also helped you sort the information from largest to smallest. We hope this lesson on Part-of-Speech Tagging using a Web Scrapped Website is a solution you’re able to take into consideration when generating your next python data pipeline!

If you need assistance creating these tools, you can count on our data engineering consulting services to help elevate your python engineering needs!