N-grams are a sequence of “n” adjacent symbols in text or speech. The n items are used in text mining and natural language processing tasks and can be syllables, letters, phonemes, words, or adjacent base pairs from a genome.
N-Grams Overview
N-grams are crucial when dealing with text data in Natural Language Processing (NLP) tasks. Their applications include semantic features, spelling corrections, text mining, machine translations, text prediction, etc.
“N” is a positive integer with values like 1,2,3,4, and so on. For example, assuming “n” is 1 in the domain of text analysis, it is called unigram. If it’s 2, it is called a bigram; if it is 3, it is a trigram.
Unigram
A unigram is a type of n-gram when the value of n is 1. It refers to single words that represent the most basic units of text.
Example:
Text = “Follow good SEO practices”
Here’s the unigram for the above text:
(“Follow,” “good,” “SEO,” “practices”)
From this example, you can see that unigram means taking one word at a time.
Bigram
In Bigrams, the value of n is 2. Let’s look at the example below:
Text = “Follow good SEO practices”
The n bigrams will be as follows:
(“Follow good,” “good SEO,” “SEO practices”)
Ideally, bigrams mean taking two words at a time.
Trigram
The value of n in a trigram is 3. Here’s an example.
Text = “Follow good SEO practices”
The trigram will be like this:
(“Follow good SEO,” “Good SEO practices”)
It is important to note that the value of n isn’t limited to three. Depending on the text, we can have 4 grams, 5 grams, 6 grams, and so on. Look at the text below, for example.
Text = “Always work with the best SEO agency”
The 4 grams for this text will appear like this:
(“Always work with the,” “work with the best,” “with the best SEO,” “the best SEO agency”)
So, What is an N-Gram in Corpus Linguistics?
An N-gram in corpus linguistics is the sequence of items (bigram = two items, trigram = three items, four-gram = four items, etc., where n-gram refers to the number of items. In corpus linguistics, n-grams are the tokens (words). The n-grams are also called MWEs or multiword expressions in linguistics.
Creating a list of the most frequently occurring n-grams helps us discover patterns in language use that are not so obvious with other methods of linguistic analysis. Analyzing n-grams helps educators find and emphasize important phrases that students should memorize as complete units for more effective learning.
N-grams in NLP
N-grams in NLP are sequences of “n” words taken from a text to help understand the contextual information and relationships between words in a text. The item can be a single word (unigram) or multiple words, characters, or syllables (bigram, trigram, four-gram, etc.)
You can generate n-grams in NLP by sliding a window of n words in a given text or corpus. Once you extract these n-grams, you can analyze how often certain word sequences appear, identify common-occurring words, and understand language patterns in a text. N-grams play a role in training machine learning models for tasks like text classification and sentiment analysis.
Why do N-grams in NLP Matter?
Here are the key benefits of using n-grams in NLP:
- Language modeling: N-grams improve applications like speech recognition, machine translation, and auto-completion by helping capture the probability of word distribution in a given language.
- Enhance text prediction: Looking at the most frequent n-grams can help predict the next word in a sequence. This is beneficial in text generation and autocomplete applications.
- Information retrieval: N-grams can help match and rank documents to get relevant results in information retrieval tasks.
- Capturing context and semantics: N-grams help capture the context and meaning within a sequence of words, making it easier to understand the language.
Key Use Cases of N-grams in NLP
Machine Translation
N-grams help understand and translate phrases within a larger context to improve the overall quality of machine translation.
For example, this sentence: “She went to the bank to withdraw money.” An n-gram model could help the system understand that “the bank” in this sentence could mean a financial institution rather than a side of the river.
If the system comes across the word “bank” in another context, the n-gram model can use the surrounding words to predict the correct meaning with the help of phrases like “deposit money.”
Speech Recognition
N-grams can help speech recognition systems choose the correct words based on word sequences and context. For example, recognition of the phrase “ice cream” in speech helps the system predict that the word “cream” follows “ice” based on how common the phrase is.
Text Classification
In machine learning models, N-grams can help categorize text into specific categories to recognize patterns and features associated with different categories. For example, assuming you are classifying movie reviews into two categories; positive and negative, and have one review as “The movie was absolutely fantastic and enjoyable.” The bigrams in this text will be as follows:
“The movie”
“movie was”
“was absolutely”
“absolutely fantastic”
“fantastic and”
“and enjoyable”
The machine learning model can use these bigrams to learn that phrases like “absolutely fantastic” and “enjoyable” are commonly associated with positive reviews. The model can compare these n-grams with previous reviews to predict whether new reviews are positive or negative.
Predictive Text Input
Predictive text inputs on mobile devices and keywords use n-grams to suggest the next word in a sequence based on the context of the words available. For example, when you type on your smartphone: “I want to buy milk and…” The predictive text system could suggest the next word based on context as you type. Some suggestions could include “bread,” “bacon,” “eggs,” groceries, etc.
Named Entity Recognition (NER)
N-grams can help NER systems identify and classify named entities such as names, locations, organizations, dates, etc.
Topic Modeling
N-grams can help identify underlying themes or topics within a collection of documents for easier clustering and categorization based on content. For example, suppose you have a large set of articles and want to identify the main topics that the articles discuss. In that case, n-grams can help the modeling algorithm identify common word sequences such as health & wellness, climate change, artificial intelligence, etc.
Based on the frequency of these n-grams, the algorithm can group articles that mention “health & wellness” together, those that mention “artificial intelligence” together, and so on. After that, you can categorize your documents as “health,” “technology,” “climate,” etc.
Search Engine Algorithms
N-grams help search engines in indexing and retrieving relevant content. Ideally, search engines use n-grams to break down user queries and documents into smaller word sequences to provide more accurate results for a search query. For example, if a user searches for the “best CBD marketing agency in Los Angeles,” Google uses n-grams to break down the query into sequences like:
“Best CBD”
“CBD marketing”
“Marketing agency”
“Agency in”
In Los”
“Los Angeles”
The search will then look for articles, reviews, listings, or blogs with these n-grams to provide highly relevant results to a searcher’s query.
N-Gram Language Modelling with NLTK
Language modeling is used in speech recognition, spam filtering, and other applications. Here are the two key methods of language modeling:
- Statistical Language Modeling
- Neural Language Modelings
Statistical Language Modeling
Statistical Language Modeling in NLP is used to predict the probability of a sequence of words. These models are key in many NLP applications, including:
- Speech recognition: Help predict the next word sequence for more accurate speech recognition.
- Machine translation: Predict word sequences to generate more natural translations.
- Text generation: Create relevant text based on preceding words.
- Spell checkers: Analyze word sequences to identify and correct misspelled words.
Key Concepts in Statistical Language Models
- N-grams: As stated, an n-gram is a sequence of N items from a text or speech. In a bigram, for example, the probability of a word is based on the previous word.
- Probability Estimation: Here, the idea is to estimate the probability of a word sequence. For instance, in an n-gram model, the likelihood of a sequence of words is calculated as the product of the conditional probabilities of each word based on the previous N-1 words.
- Smoothing: These techniques help solve the zero probabilities problem for unseen n-grams. Common techniques include Good-Turing discounting, Laplace smoothing, and Kneser-Ney smoothing.
Types of Statistical Language Models
Unigram Model: This model considers each word independent of the words before it. P(w1,w2,…,wn)=P(w1)P(w2)…P(wn)
Bigram Model: Assumes that the probability of a word is based only on the previous word. P(wn∣w1,w2,…,wn−1)≈P(wn∣wn−1)
Trigram Model: This is an extension of the bigram model. It considers two previous words. P(wn∣w1,w2,…,wn−1)≈P(wn∣wn−2,wn−1)
Higher-Order N-gram Models: Consider more previous words. They are more complex and require more data for a more accurate estimation.
Steps to Create an N-Gram Model with NLTK
1. Install NLTK
bash
pip install nltk
2. Import Necessary Libraries
Python
import nltk
from nltk.util import ngrams
from collections import Counter, defaultdict
3. Tokenize Text
Python
text = “This is a simple example to demonstrate N-gram language modeling using NLTK.”
tokens = nltk.word_tokenize(text.lower())
3. Generate N-Grams
python
n = 2 # for bigrams
bigrams = list(ngrams(tokens, n))
4. Calculate Frequencies
Python
bigram_freqs = Counter(bigrams)
5. Calculate Probabilities
Python
unigram_freqs = Counter(tokens)
bigram_probs = {k: v / unigram_freqs[k[0]] for k, v in bigram_freqs.items()}
6. Smoothing (Optional)
Python
def add_one_smoothing(bigram_counts, unigram_counts):
vocabulary_size = len(unigram_counts)
smoothed_bigram_probs = {}
for bigram in bigram_counts:
smoothed_bigram_probs[bigram] = (bigram_counts[bigram] + 1) / (unigram_counts[bigram[0]] + vocabulary_size)
return smoothed_bigram_probs
Example Code for Statistical Language Modeling
Import nltk
from nltk.util import ngrams
from collections import Counter
# Example text
text = “This is a simple example to demonstrate N-gram language modeling using NLTK.”
# Tokenize text
tokens = nltk.word_tokenize(text.lower())
# Generate bigrams
bigrams = list(ngrams(tokens, 2))
# Calculate frequencies
bigram_freqs = Counter(bigrams)
unigram_freqs = Counter(tokens)
# Calculate probabilities
bigram_probs = {k: v / unigram_freqs[k[0]] for k, v in bigram_freqs.items()}
# Print bigram probabilities
for bigram, prob in bigram_probs.items():
print(f”Bigram: {bigram}, Probability: {prob:.4f}”)
# Add-one smoothing
def add_one_smoothing(bigram_counts, unigram_counts):
vocabulary_size = len(unigram_counts)
smoothed_bigram_probs = {}
for bigram in bigram_counts:
smoothed_bigram_probs[bigram] = (bigram_counts[bigram] + 1) / (unigram_counts[bigram[0]] + vocabulary_size)
return smoothed_bigram_probs
smoothed_probs = add_one_smoothing(bigram_freqs, unigram_freqs)
# Print smoothed probabilities
print(“\nWith Add-One Smoothing:”)
for bigram, prob in smoothed_probs.items():
print(f”Bigram: {bigram}, Smoothed Probability: {prob:.4f}”)
Neural Language Modeling
These models use neural networks to predict the probability of a word sequence. Neural language modeling can capture long and complex data patterns, making them more powerful than traditional statistical models. Here is a simple example using Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks with TensorFlow/Keras.
Key Concepts in Neural Language Modelings
- Word Embeddings: Represent words as continuous vectors in a high-dimensional space. Pre-trained embeddings like Word2Vec, GloVe, or FastText can apply.
- Neural Networks: Use layers of neurons to learn patterns in the data. Common architectures include RNNs, LSTMs, and Transformers.
- Training: Neural networks require training on large datasets to learn the relationships between words.
Steps to Create a Neural Language Model
# Install TensorFlow/Keras:
bash
pip install tensorflow
# Import Necessary Libraries:
Python
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Prepare the Dataset:
Python
Text = “This is a simple example to demonstrate neural language modeling using TensorFlow.”
corpus = text.lower().split(‘.’)
# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
# Create input sequences
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# Pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding=’pre’))
# Split data into input and output
X, y = input_sequences[:,:-1], input_sequences[:,-1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)
# Build the Model:
Python
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150))
model.add(Dense(total_words, activation=’softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
# Train the Model:
Python
model.fit(X, y, epochs=100, verbose=1)
# Generate Text:
Python
seed_text = “This is a”
next_words = 5
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding=’pre’)
predicted = model.predict(token_list, verbose=0)
predicted_word_index = np.argmax(predicted, axis=1)
output_word = tokenizer.index_word[predicted_word_index[0]]
seed_text += ” ” + output_word
print(seed_text)
Example Code for Neural Language Modeling
Here is a complete example:
Python
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample text
text = “This is a simple example to demonstrate neural language modeling using TensorFlow.”
corpus = text.lower().split(‘.’)
# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
# Create input sequences
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# Pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding=’pre’))
# Split data into input and output
X, y = input_sequences[:,:-1], input_sequences[:,-1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)
# Build the model
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150))
model.add(Dense(total_words, activation=’softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
# Train the model
model.fit(X, y, epochs=100, verbose=1)
# Generate text
seed_text = “This is a”
next_words = 5
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding=’pre’)
predicted = model.predict(token_list, verbose=0)
predicted_word_index = np.argmax(predicted, axis=1)
output_word = tokenizer.index_word[predicted_word_index[0]]
seed_text += ” ” + output_word
print(seed_text)
BERT (Bidirectional Encoder Representations from Transformers) and NLP
BERT is a crucial feature of Natural Language Processing (NLP). Unlike traditional models that read text input sequentially (left-to-right or right-to-left), BERT processes text bi-directionally (consider the context from both directions simultaneously).
BERT is built on the Transformer architecture, which uses self-attention mechanisms to understand the relationship between words in a sentence. This bidirectional approach allows BERT to understand the meaning of words based on their context within a sentence to improve the performance of various NLP tasks.
BERT is adopted in numerous NLP applications, including:
Search Engines: Help search engines understand user queries to enhance the relevance of search results.
Question Answering Systems: BERT understands contextual information in the input text thus can help improve question-answering tasks.
Text Classification: BERT can help improve the accuracy of sentiment analysis, spam detection, and other classification tasks.
BERT and n-grams
Unlike n-grams, BERT goes beyond simple word sequences. It can capture the meaning of sentences and phrases in a way that traditional n-gram models cannot, making it a powerful tool for more sophisticated NLP tasks.
RAG (Retrieval-Augmented Generation)
RAG is an advanced NLP model that consists of two main components:
- Retriever: This component fetches relevant documents or passages from a large corpus based on the query.
- Generator: This component generates a coherent and contextually accurate response using the retrieved information.
While n-grams are primarily used to model local patterns in text, RAG uses external information to generate richer and more informed responses. The outputs are not only contextually accurate but also contain relevant details from external sources.