Text Summarization through use of Spacy library

Text summarization in NLP means telling a long story in short with a limited number of words and convey an important message in brief.

There can be many strategies to make the large message short and giving the most important information forward, one of them is calculating word frequencies and then normalizing the word frequencies by dividing by the maximum frequency.

After that finding the sentences with high frequencies and taking the most important sentences to convey the message.


Why do we need automatic summarization?

Time optimization, as it takes lesser time to get the gist of the text in summary.

Indexing can be improved with automatic summarization.

More number of documents can be processed if we use automatic summarization.

Bias is lesser in automatic summarization rather than manual ones.


Text Summarization can be of Various types like:

1) Based on Input Type: It can be single or multiple documents from which the text needs to be summarized.

2) Based on Purpose: What is the purpose of summarization, do the person need answer to queries, or domain specific summarization or generic summarization.

3) Based on Output Type: Depending on how the output is needed abstractive or extractive.


Steps to Text Summarization:

1) Text Cleaning : Removing stop words, punctuation marks and making the words in lower case.

2) Work Tokenization: Tokenize each word from sentences.

3) Word Frequency table: Count the frequency of each word and then divide ethe maximum frequency with each frequency to get the normalized word frequency count.

4) Sentence Tokenization: As per frequency of sentence then

5) Summarization




Now lets see how to do it using Spacy library.

Install and import spacy library , also import stop words.


pip install spacy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

Import punctuation marks from string and also add additional next line tag in it.

stopwords=list(STOP_WORDS)
from string import punctuation
punctuation=punctuation+ '\n'

Store a text in variable text from which we need to summarise the text.


text="""The human coronavirus was first diagnosed in 1965 by Tyrrell and Bynoe from the respiratory tract sample of an adult with a common cold cultured on human embryonic trachea.1 Naming the virus is based on its crown-like appearance on its surface.2 Coronaviruses (CoVs) are a large family of viruses belonging to the Nidovirales order, which includes Coronaviridae, Arteriviridae, and Roniviridae families.3 Coronavirus contains an RNA genome and belongs to the Coronaviridae family.4 This virus is further subdivided into four groups, ie, the α, β, γ, and δ coronaviruses.5 α- and β-coronavirus can infect mammals, while γ- and δ- coronavirus tend to infect birds.6 Coronavirus in humans causes a range of disorders, from mild respiratory tract infections, such as the common cold to lethal infections, such as the severe acute respiratory syndrome (SARS), Middle East respiratory syndrome (MERS) and Coronavirus disease 2019 (COVID-19). The coronavirus first appeared in the form of severe acute respiratory syndrome coronavirus (SARS-CoV) in Guangdong province, China, in 20027 followed by Middle East respiratory syndrome coronavirus (MERS-CoV) isolated from the sputum of a 60-year-old man who presented symptoms of acute pneumonia and subsequent renal failure in Saudi Arabia in 2012.8 In December 2019, a β-coronavirus was discovered in Wuhan, China. The World Health Organization (WHO) has named the new disease as Coronavirus disease 2019 (COVID-19), and Coronavirus Study Group (CSG) of the International Committee has named it as SARS-CoV-2.9,10 Based on the results of sequencing and evolutionary analysis of the viral genome, bats appear to be responsible for transmitting the virus to humans"""

Tokenize the words from the sentences in text.


nlp = spacy.load('en_core_web_sm')
doc= nlp(text)
tokens=[token.text for token in doc]
print(tokens)

Calculate word frequencies from the text after removing stopwords and punctuations.



word_frequencies={}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

Print and see word frequencies to know important words.


print(word_frequencies)

Calculate the maximum frequency and divide it by all frequencies to get normalized word frequencies.


max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word]=word_frequencies[word]/max_frequency

Print normalized word frequencies.

print(word_frequencies)

Get sentence tokens.


sentence_tokens= [sent for sent in doc.sents]
print(sentence_tokens)

Calculate the most important sentences by adding the word frequencies in each sentence.


sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():                            
             sentence_scores[sent]=word_frequencies[word.text.lower()]
            else:
             sentence_scores[sent]+=word_frequencies[word.text.lower()]

Print sentence scores


sentence_scores

From headhq import nlargest and calculate 30% of text with maximum score.


from heapq import nlargest
select_length=int(len(sentence_tokens)*0.3)
select_length
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
summary

Get the summary of text.


final_summary=[word.text for word in summary]
final_summary
summary=''.join(final_summary)
summary

We get the output as below:

'and β-coronavirus can infect mammals, while γ- and δ- coronavirus tend to infect birds.6 Coronavirus in humans causes a range of disorders, from mild respiratory tract infections, such as the common cold to lethal infections, such as the severe acute respiratory syndrome (SARS), Middle East respiratory syndrome (MERS) and Coronavirus disease 2019 (COVID-19).The coronavirus first appeared in the form of severe acute respiratory syndrome coronavirus (SARS-CoV) in Guangdong province, China, in 20027 followed by Middle East respiratory syndrome coronavirus (MERS-CoV) isolated from the sputum of a 60-year-old man who presented symptoms of acute pneumonia and subsequent renal failure in Saudi Arabia in 2012.8The World Health Organization (WHO) has named the new disease as Coronavirus disease 2019 (COVID-19), and Coronavirus Study Group (CSG) of the International Committee has named it as SARS-CoV-2.9,10 Based on the results of sequencing and evolutionary analysis of the viral genome, bats appear to be responsible for transmitting the virus to humans

Conclusion

This is just one of the ways to get text summarization by use of most frequently used words and then calculating most important sentences.


There can be various other ways like use of library nltk to do it by using lexical analysis, part of speech tagger and n-grams. We will talk more about it in my next blog.



43 views0 comments

Recent Posts

See All

BDD - GHERKIN FOR OCTOPARSE APPLICATION

Octoparse is a web scrapping tool used to extract web data without coding.You can use the free version where you can work up to ten task or paid subscription with more service included. Let's see some

 

© Numpy Ninja.