Hacking Book | Free Online Hacking Learning

Home

letiantian / textrank4zh: automatically extract keywords and abstracts from chinese texts

Posted by truschel at 2020-02-27
all

Textrank algorithm can be used to extract keywords and Abstracts (important sentences) from text. Textrank 4zh is a python algorithm implementation of textrank algorithm for Chinese text.

install

Mode 1:

$ python setup.py install --user

Mode 2:

$ sudo python setup.py install

Mode 3:

$ pip install textrank4zh --user

Mode 4:

$ sudo pip install textrank4zh

In Python 3, you need to change the above Python to python3 and pip to PIP3.

uninstall

$ pip uninstall textrank4zh

rely on

jieba >= 0.35numpy >= 1.7.1networkx >= 1.9.1

Compatibility

The tests passed in Python 2.7.9 and python 3.4.3.

principle

For the detailed principle of textrank, please refer to:

Mihalcea R, Tarau P. TextRank: Bringing order into texts[C]. Association for Computational Linguistics, 2004.

On the principle and application of textrank: using textrank algorithm to generate keywords and summaries for text

Keywords extraction

Split the original text into sentences, filter out the stop words (optional) in each sentence, and keep only the words of the specified part of speech (optional). From this we can get the set of sentences and the set of words.

Each word serves as a node in PageRank. Set the window size to K. suppose a sentence consists of the following words in turn:

w1, w2, w3, w4, w5, ..., wn

W1, W2,..., wk, W2, W3,..., wk + 1, W3, W4,..., wk + 2 are all windows. There is an undirected and unauthorized edge between the nodes corresponding to any two words in a window.

w1, w2, ..., wk w2, w3, ...,wk+1 w3, w4, ...,wk+2

Based on the above diagram, the importance of each word node can be calculated. The most important words can be used as keywords.

Key phrase extraction

Several keywords are extracted by reference to keywords. If there are several keywords adjacent to each other in the original text, these keywords can form a key phrase.

For example, in an article that introduces support vector machine, we can find the keyword support, vector and machine. Through keyword group extraction, we can get support vector machine.

支持向量机 支持 向量 支持向量机

Summary generation

Each sentence is regarded as a node in the graph. If there is similarity between two sentences, it is considered that there is an undirected weighted edge between the corresponding two nodes, and the weight is similarity.

The most important sentences calculated by PageRank algorithm can be used as abstracts.

Example

See example and test.

example/example01.py:

The operation results are as follows:

关键词: 媒体 0.02155864734852778 高圆圆 0.020220281898126486 微 0.01671909730824073 宾客 0.014328439104001788 赵又廷 0.014035488254875914 答谢 0.013759845912857732 谢娜 0.013361244496632448 现身 0.012724133346018603 记者 0.01227742092899235 新人 0.01183128428494362 北京 0.011686712993089671 博 0.011447168887452668 展示 0.010889176260920504 捧场 0.010507502237123278 礼物 0.010447275379792245 张杰 0.009558332870902892 当晚 0.009137982757893915 戴 0.008915271161035208 酒店 0.00883521621207796 外套 0.008822082954131174 关键短语: 微博 摘要: 摘要: 0 0.0709719557171 中新网北京12月1日电(记者 张曦) 30日晚,高圆圆和赵又廷在京举行答谢宴,诸多明星现身捧场,其中包括张杰(微博)、谢娜(微博)夫妇、何炅(微博)、蔡康永(微博)、徐克、张凯丽、黄轩(微博)等 6 0.0541037236415 高圆圆身穿粉色外套,看到大批记者在场露出娇羞神色,赵又廷则戴着鸭舌帽,十分淡定,两人快步走进电梯,未接受媒体采访 27 0.0490428312984 记者了解到,出席高圆圆、赵又廷答谢宴的宾客近百人,其中不少都是女方的高中同学

Instructions

Textrank4keyword and textrank4presence will split the text into four formats when processing a piece of text:

For example, for:

这间酒店位于北京东三环,里面摆放很多雕塑,文艺气息十足。答谢宴于晚上8点开始。

The operation results are as follows:

sentences: 这间酒店位于北京东三环,里面摆放很多雕塑,文艺气息十足 答谢宴于晚上8点开始 words_no_filter 这/间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足 答谢/宴于/晚上/8/点/开始 words_no_stop_words 间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足 答谢/宴于/晚上/8/点 words_all_filters 酒店/位于/北京/东三环/摆放/雕塑/文艺/气息 答谢/宴于/晚上

API

TODO.

Please refer to the source code Notes for the implementation of the class and the parameters of the function.

License

MIT