N-gram and its use in text generation
In natural language processing, it is not only important to make sense of words but context too.
n-grams are one of the ways to understand the language in terms of context to better understand the meaning of words written or spoken.
For example, “I need to book a ticket to Australia.” Vs “I want to read a book of Shakespeare.”
Here the word “book” has different meanings altogether.
In the first sentence, it is used as a verb, which is the action while in the second sentence it is used as a noun, which is an object.
We are able to understand it as we have learned it from childhood in the context of the sentence it is used or the words used before or after the book in the sentence.
Now the question arises how in NLP does it understand, what’s the context of a word?
Machines learn this by seeing the words before and after the word to know about its context.
The answer is through n-grams.
Bi-grams is splitting a sentence into a pair of 2 to see the context.
A machine can understand if an article is used before the book, its Noun, also if 'read' is used in a sentence it is of course a reading book.
Tri-grams is splitting a sentence into 3 sets of words to know about the context. The bigger the window, the harder it is to pick up the words in vocabulary for context.
N-grams define the number of words one needs to look at to see the context.
An example is as below. Bull in the first sentence is an animal while in the second it refers to the share market.
One can also use this to know about the negative context of words like:
“The movie was not nice, awful really.”
The words before and after nice (not nice, nice, awful), cancel the meaning of the positive word nice.
We can also capture sarcasm by this. Sarcasm is an ironic or satirical remark tempered by humor.
An example of the sentence is:
“You are intelligent…not.” It actually means you are not intelligent.
Although a lot of other types of sarcasm like in tonality or remarks or as question are being explored and enriched as we talk.
We use n-grams or pairs of words to know the broader context of the text which is then provided in machine learning to know the real meaning of the text.
N-grams is a simple yet effective approach in Natural language processing to know about the context of words.
Now let’s see how to implement it practically in python.
Install the following packages
!pip install -U pip !pip install -U dill !pip install -U nltk==3.4
from nltk.util import pad_sequence from nltk.util import bigrams from nltk.util import ngrams from nltk.util import everygrams from nltk.lm.preprocessing import pad_both_ends from nltk.lm.preprocessing import flatten
text = [['I','need','to','book', 'ticket', 'to', 'Australia' ], ['I', 'want', 'to' ,'read', 'a' ,'book', 'of' ,'Shakespeare']]
Bigrams can be seen as:
N-grams can be seen as:
Now let's try an implementation of the n-gram in text generation, for this let's import the Trump tweet database from Kaggle and put it in a data frame.
import pandas as pd df = pd.read_csv('../input/trump-tweets/realdonaldtrump.csv') df.head()
Import tokenize library
from nltk import word_tokenize, sent_tokenize
Now apply to tokenize the tweet column
trump_corpus = list(df['content'].apply(word_tokenize))
import 'every gram' pipeline library
from nltk.lm.preprocessing import padded_everygram_pipeline
Apply n-gram to the corpus
# Preprocess the tokenized text for 3-grams language modelling n = 3 train_data, padded_sents = padded_everygram_pipeline(n, trump_corpus)
Define a maximum likelihood model:
from nltk.lm import MLE trump_model = MLE(n) # Lets train a 3-grams model, previously we set n=3 trump_model.fit(train_data, padded_sents)
Generate sentences from the model after detokenizing the content.
from nltk.tokenize.treebank import TreebankWordDetokenizer detokenize = TreebankWordDetokenizer().detokenize def generate_sent(model, num_words, random_seed=42): """ :param model: An ngram language model from `nltk.lm.model`. :param num_words: Max no. of words to generate. :param random_seed: Seed value for random. """ content =  for token in model.generate(num_words, random_seed=random_seed): if token == '<s>': continue if token == '</s>': break content.append(token) return detokenize(content)
Generate sentence based on the text in dataframe
generate_sent(trump_model, num_words=20, random_seed=42)
generate_sent(trump_model, num_words=10, random_seed=0)
Text generation through n-grams is a basic method for text generation. RNN and LSTMS are used for more refined generations. The context in N-grams can come with a lot of noise, stop words can be removed to make the text cleaner
Advantages of n-grams
It gives insight at different levels.(bigram, trigram, n-gram)
Simple and conceptually easy to understand.
Disadvantages of n-grams
We may need to use stop words to avoid any noise in results.
A count may not necessarily indicate importance to text or entity.
Thanks for reading!