编程知识 cdmana.com

Using NLP to create a summary

author |Louis Teo compile |VK source |Towards Data Science

Have you read a lot of reports , And you just want to make a quick summary of each report ? Have you ever been in a situation like this ?

The abstract has become 21 A very helpful way to solve data problems in the 21st century . In this article , I'll show you how to use Python Natural language processing in (NLP) Create a personal text summary generator .

Preface : Personal text digesters are not hard to create —— Beginners can do it easily !

What is a text Abstract

Basically , While maintaining critical information , Generate an accurate summary , Without losing the whole meaning , It's a task .

There are two general types of abstract :

  • Abstract abstract >> Generate new sentences from the original text .
  • Extract abstract >> Identify important sentences , And use these sentences to create a summary .

Which summary method should be used

I use extract Abstracts , Because I can apply this method to many documents , You don't have to do a lot of ( It's frightening ) Machine learning model training task of .

Besides , Abstract method has better summary effect than abstract abstract abstract method , Because Abstract abstracts must generate new sentences from the original text , This is a more difficult way to extract important sentences than data-driven methods .

How to create your own text digester

We will use the word histogram to rank the importance of sentences , Then create a summary . The advantage of this is , You don't need to train your model to use it for documentation .

Text summary workflow

Here's the workflow we're going to follow …

Import text >>>> Clean up the text and break it into sentences >> Delete stop words >> Building a word histogram >> Ranking sentences >> Choose the former N To extract a summary of

(1) Sample text

I used the text of a news article , The title is apple with 5000 Million dollars for AI Start-up company , To advance their applications . You can find the original news article here :https://analyticsindiamag.com/apple-acquires-ai-startup-for-50-million-to-advance-its-apps/

You can also take my Github Download the text document :https://github.com/louisteo9/personal-text-summarizer

(2) Import library

#  Natural language toolkit (NLTK)
import nltk
nltk.download('stopwords')

#  Regular expressions for text preprocessing 
import re

#  Queue algorithm to find the first sentence 
import heapq

#  Numerically NumPy
import numpy as np

#  Used to create data frames pandas
import pandas as pd

# matplotlib mapping 
from matplotlib import pyplot as plt
%matplotlib inline

(3) Import text and perform preprocessing

There are many ways to do it . The goal here is to have a clean text , We can input it into our model .

#  Load text file 
with open('Apple_Acquires_AI_Startup.txt', 'r') as f:
    file_data = f.read()

here , We use regular expressions for text preprocessing . We will

(A) Use spaces ( If any …) Replace reference number , namely [1]、[10]、[20],

(B) Replace one or more spaces with a single space .

text = file_data
#  If there is , Please replace 
text = re.sub(r'\[[0-9]*\]',' ',text) 

#  Replace one or more spaces with a single space 
text = re.sub(r'\s+',' ',text)

then , We use lowercase ( Without special characters 、 Numbers and extra space ) Form a clean text , And split it into single words , Used for phrase score calculation and word formation histogram .

The reason for a clean text is , Algorithms don't put “ understand ” and “ understand ” Treat as two different words .

#  Convert all uppercase characters to lowercase characters 
clean_text = text.lower()

#  Replace... With a space [a-zA-Z0-9] Characters other than 
clean_text = re.sub(r'\W',' ',clean_text) 

#  Replace numbers with spaces 
clean_text = re.sub(r'\d',' ',clean_text) 

#  Replace one or more spaces with a single space 
clean_text = re.sub(r'\s+',' ',clean_text)

(4) Split the text into sentences

We use NLTK sent_tokenize Method to split the text into sentences . We will assess the importance of each sentence , Then decide whether each sentence should be included in the summary .

sentences = nltk.sent_tokenize(text)

(5) Delete stop words

Stop words are English words that do not add too much meaning to a sentence . They can be safely ignored , Without sacrificing the meaning of the sentence . We've downloaded a file , It contains English stop words

here , We'll get a list of stop words , And store them in stop_word variable .

#  Get the list of inactive words 
stop_words = nltk.corpus.stopwords.words('english')

(6) Build histogram

Let's evaluate the importance of each word according to the number of times it appears throughout the text .

We will pass (1) Split words into clean text ,(2) Delete stop words , then (3) Check the frequency of each word in the text .

#  Create an empty dictionary to hold the word count 
word_count = {}

#  Loop through the tokenized words , Delete Inactive words and save word count to dictionary 
for word in nltk.word_tokenize(clean_text):
    # remove stop words
    if word not in stop_words:
        #  Save the number of words in the dictionary 
        if word not in word_count.keys():
            word_count[word] = 1
        else:
            word_count[word] += 1

Let's draw a histogram of words and see the results .

plt.figure(figsize=(16,10))
plt.xticks(rotation = 90)
plt.bar(word_count.keys(), word_count.values())
plt.show()

Let's turn it into a bar graph , Show only before 20 Word , There is a helper function .

# helper  function , Used to draw the top word .
def plot_top_words(word_count_dict, show_top_n=20):
    word_count_table = pd.DataFrame.from_dict(word_count_dict, orient = 'index').rename(columns={0: 'score'})
    
    word_count_table.sort_values(by='score').tail(show_top_n).plot(kind='barh', figsize=(10,10))
    plt.show()

Let's show you before 20 Word .

plot_top_words(word_count, 20)

From the picture above , We can see “ai” and “apple” Two words appear at the top . It makes sense , Because this article is about Apple's acquisition of an AI start-up .

(7) Arrange the sentences according to the scores

Now? , We will rank the importance of each sentence according to the sentence score . We will :

  • Delete more than 30 Sentences of words , Realize that long sentences don't always make sense ;

  • then , Add the fraction of each word in the sentence , Form sentence scores .

Sentences with high scores will be at the front . The preceding sentence will form our conclusion .

Be careful : According to my experience , whatever 25 To 30 A word can give you a good summary .

#  Create an empty dictionary to store sentence scores 
sentence_score = {}

#  Loop through marked sentences , Take less than 30 Sentences of words , Then add word scores to form sentence scores 
for sentence in sentences:
    #  Check whether the words in the sentence are in the word dictionary 
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word_count.keys():
            #  Accept less than 30 Sentences of words 
            if len(sentence.split(' ')) < 30:
                #  Add the word score to the sentence score 
                if sentence not in sentence_score.keys():
                    sentence_score[sentence] = word_count[word]
                else:
                    sentence_score[sentence] += word_count[word]

We will make sentences - The score dictionary is converted into a data frame , And display sentence_score.

Be careful : Dictionaries don't allow sentences to be sorted according to scores , Therefore, it is necessary to convert the data stored in the dictionary into DataFrame.

df_sentence_score = pd.DataFrame.from_dict(sentence_score, orient = 'index').rename(columns={0: 'score'})
df_sentence_score.sort_values(by='score', ascending = False)

(8) Choose the preceding sentence as a summary

We use the heap queue algorithm to select the front 3 A sentence , And store them in best_quences variable .

Usually 3-5 One sentence is enough . According to the length of the document , You can change the number of the top sentences you want to display .

In this case , I chose 3, Because our text is relatively short .

#  Show the best three sentences as a summary          
best_sentences = heapq.nlargest(3, sentence_score, key=sentence_score.get)

Let's use print and for loop Function to display summary text .

print('SUMMARY')
print('------------------------')

#  Show the top sentence according to the sentence order in the original text 
for sentence in sentences:
    if sentence in best_sentences:
        print (sentence)

This is for me Github To get to Jupyter The notebook . You'll find another one that you can execute Python file , You can use it immediately to summarize your text :https://github.com/louisteo9/personal-text-summarizer

Let's look at the actual operation of the algorithm !

Here is an article entitled “ Apple with 5000 Million U.S. dollars to buy an artificial intelligence start-up (Apple Acquire AI Startup) To advance their applications ” The original text of the news article ( The original text can be found here ):https://analyticsindiamag.com/apple-acquires-ai-startup-for-50-million-to-advance-its-apps/

In an attempt to scale up its AI portfolio, Apple has acquired Spain-based AI video startup — Vilynx for approximately $50 million.

Reported by Bloomberg, the AI startup — Vilynx is headquartered in Barcelona, which is known to build software using computer vision to analyse a video’s visual, text, and audio content with the goal of “understanding” what’s in the video. This helps it categorising and tagging metadata to the videos, as well as generate automated video previews, and recommend related content to users, according to the company website.

Apple told the media that the company typically acquires smaller technology companies from time to time, and with the recent buy, the company could potentially use Vilynx’s technology to help improve a variety of apps. According to the media, Siri, search, Photos, and other apps that rely on Apple are possible candidates as are Apple TV, Music, News, to name a few that are going to be revolutionised with Vilynx’s technology.

With CEO Tim Cook’s vision of the potential of augmented reality, the company could also make use of AI-based tools like Vilynx.

The purchase will also advance Apple’s AI expertise, adding up to 50 engineers and data scientists joining from Vilynx, and the startup is going to become one of Apple’s key AI research hubs in Europe, according to the news.

Apple has made significant progress in the space of artificial intelligence over the past few months, with this purchase of UK-based Spectral Edge last December, Seattle-based Xnor.ai for $200 million and Voysis and Inductiv to help it improve Siri. With its habit of quietly purchasing smaller companies, Apple is making a mark in the AI space. In 2018, CEO Tim Cook said in an interview that the company had bought 20 companies over six months, while only six were public knowledge.

In this paper, the following :

SUMMARY
------------------------
In an attempt to scale up its AI portfolio, Apple has acquired Spain-based AI video startup — Vilynx for approximately $50 million.
With CEO Tim Cook’s vision of the potential of augmented reality, the company could also make use of AI-based tools like Vilynx.
With its habit of quietly purchasing smaller companies, Apple is making a mark in the AI space.

ending

congratulations ! You are already in Python Created your personal text digester in . I hope , The abstract looks good .

Link to the original text :https://towardsdatascience.com/report-is-too-long-to-read-use-nlp-to-create-a-summary-6f5f7801d355

Welcome to join us AI Blog station : http://panchuang.net/

sklearn Machine learning Chinese official documents : http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/

版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢

Scroll to Top