Textrank algorithm can be used to extract keywords and Abstracts (important sentences) from text. Textrank 4zh is a python algorithm implementation of textrank algorithm for Chinese text.
install
Mode 1:
$ python setup.py install --user
Mode 2:
$ sudo python setup.py install
Mode 3:
$ pip install textrank4zh --user
Mode 4:
$ sudo pip install textrank4zh
In Python 3, you need to change the above Python to python3 and pip to PIP3.
uninstall
$ pip uninstall textrank4zh
rely on
jieba >= 0.35numpy >= 1.7.1networkx >= 1.9.1
Compatibility
The tests passed in Python 2.7.9 and python 3.4.3.
principle
For the detailed principle of textrank, please refer to:
Mihalcea R, Tarau P. TextRank: Bringing order into texts[C]. Association for Computational Linguistics, 2004.
On the principle and application of textrank: using textrank algorithm to generate keywords and summaries for text
Keywords extraction
Split the original text into sentences, filter out the stop words (optional) in each sentence, and keep only the words of the specified part of speech (optional). From this we can get the set of sentences and the set of words.
Each word serves as a node in PageRank. Set the window size to K. suppose a sentence consists of the following words in turn:
w1, w2, w3, w4, w5, ..., wn
W1, W2,..., wk, W2, W3,..., wk + 1, W3, W4,..., wk + 2 are all windows. There is an undirected and unauthorized edge between the nodes corresponding to any two words in a window.
w1, w2, ..., wk
w2, w3, ...,wk+1
w3, w4, ...,wk+2
Based on the above diagram, the importance of each word node can be calculated. The most important words can be used as keywords.
Key phrase extraction
Several keywords are extracted by reference to keywords. If there are several keywords adjacent to each other in the original text, these keywords can form a key phrase.
For example, in an article that introduces support vector machine, we can find the keyword support, vector and machine. Through keyword group extraction, we can get support vector machine.
支持向量机
支持
向量
机
支持向量机
Summary generation
Each sentence is regarded as a node in the graph. If there is similarity between two sentences, it is considered that there is an undirected weighted edge between the corresponding two nodes, and the weight is similarity.
The most important sentences calculated by PageRank algorithm can be used as abstracts.
Example
See example and test.
example/example01.py:
The operation results are as follows:
关键词:
媒体 0.02155864734852778
高圆圆 0.020220281898126486
微 0.01671909730824073
宾客 0.014328439104001788
赵又廷 0.014035488254875914
答谢 0.013759845912857732
谢娜 0.013361244496632448
现身 0.012724133346018603
记者 0.01227742092899235
新人 0.01183128428494362
北京 0.011686712993089671
博 0.011447168887452668
展示 0.010889176260920504
捧场 0.010507502237123278
礼物 0.010447275379792245
张杰 0.009558332870902892
当晚 0.009137982757893915
戴 0.008915271161035208
酒店 0.00883521621207796
外套 0.008822082954131174
关键短语:
微博
摘要:
摘要:
0 0.0709719557171 中新网北京12月1日电(记者 张曦) 30日晚,高圆圆和赵又廷在京举行答谢宴,诸多明星现身捧场,其中包括张杰(微博)、谢娜(微博)夫妇、何炅(微博)、蔡康永(微博)、徐克、张凯丽、黄轩(微博)等
6 0.0541037236415 高圆圆身穿粉色外套,看到大批记者在场露出娇羞神色,赵又廷则戴着鸭舌帽,十分淡定,两人快步走进电梯,未接受媒体采访
27 0.0490428312984 记者了解到,出席高圆圆、赵又廷答谢宴的宾客近百人,其中不少都是女方的高中同学
Instructions
Textrank4keyword and textrank4presence will split the text into four formats when processing a piece of text:
- Sentences: a list of sentences.
- Words_no_filter: a two-level list obtained by segmenting each sentence in sentences.
- Words no stop words: a two-dimensional list obtained by removing the stop words in words no filter.
- Words ﹣ all ﹣ filters: a two-dimensional list of words with the part of speech specified in words ﹣ no ﹣ stop ﹣ words.
For example, for:
这间酒店位于北京东三环,里面摆放很多雕塑,文艺气息十足。答谢宴于晚上8点开始。
The operation results are as follows:
sentences:
这间酒店位于北京东三环,里面摆放很多雕塑,文艺气息十足
答谢宴于晚上8点开始
words_no_filter
这/间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足
答谢/宴于/晚上/8/点/开始
words_no_stop_words
间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足
答谢/宴于/晚上/8/点
words_all_filters
酒店/位于/北京/东三环/摆放/雕塑/文艺/气息
答谢/宴于/晚上
API
TODO.
Please refer to the source code Notes for the implementation of the class and the parameters of the function.
License
MIT