"Dumb ha" Chinese participle, faster or more accurate, is defined by you. Through simple customization, make the segmentation module more suitable for your needs. "Yaha" You can custom your Chinese Word Segmentation efficiently by using Yaha
PS. here is a crfseg that encapsulates CRF + +. At present, ugliness is suitable for learning.
In the past, the word discovery function was implemented in extra / seqword.cpp, and now it has been upgraded and optimized to stand alone: project address
Using multithreading and MapReduce like ideas, you can process 50m + text and automatically get professional nouns, names, place nouns and other words in the text. After getting the words, they can be added to the dictionary of the word segmentation library.
pip install yaha
QQ communication group (also the communication group of VxWorks kernel like project): 2749-83126
Code deployed on gae: http://yahademo.appspot.com
Code deployed on SAE: http://yaha.sinaapp.com
The original address is no longer used: http://yaha.v-find.com/
Example code: https://github.com/jannson/yaha/blob/master/tests/test_cutter.py
- Basic function: precise pattern, cutting sentences into the most reasonable words. All possible words are cut into words without eliminating ambiguity. Search engine mode, on the basis of accuracy, can segment long words again to improve recall rate, which is suitable for search engine to create index. Alternative paths can generate the best multiple word segmentation paths, and on this basis, more accurate word segmentation patterns can be obtained according to other information.
Basic functions:
- Precise pattern, cut sentences into the most reasonable words.
- All possible words are cut into words without eliminating ambiguity.
- Search engine mode, on the basis of accuracy, can segment long words again to improve recall rate, which is suitable for search engine to create index.
- Alternative paths can generate the best multiple word segmentation paths, and on this basis, more accurate word segmentation patterns can be obtained according to other information.
- Available plug-ins: regular expression plug-in name prefix plug-in place name suffix plug-in customization function. There are four stages in the process of word segmentation, each of which can be customized.
Available plug-ins:
- Regular expression plug in
- Name prefix plug-in
- Name suffix plug-in
- Custom features. There are four stages in the process of word segmentation, each of which can be customized.
- Additional function: new words learning function. By inputting large paragraphs of text, learn the new and old words produced by this content. (added a C + + version of maximum entropy neologism discovery function implemented by my friend, which is 10 times faster than Python's) get the keywords of large text. Gets a summary of a large piece of text. Word correction function (New! It is often used to correct the user's wrong input in search) it supports user-defined Dictionary (todo has not been implemented well at present)
Additional features:
- New words learning function. By inputting large paragraphs of text, learn the new and old words produced by this content. (added a C + + version of maximum entropy neologism discovery function implemented by my friend, which is 10 times faster than Python)
- Gets the key for a large piece of text.
- Gets a summary of a large piece of text.
- Word correction function (New! Often used in search to correct user's wrong input)
- Support user-defined Dictionary (todo has not been implemented well at present)
- The core is to segment words based on the maximum probability path of finding sentences.
- On the basis of ensuring efficiency, define each stage of word segmentation to facilitate users to add their own word segmentation methods (there are regular, prefix name and suffix place name by default).
- Users can customize to use dynamic planning or Dijdstra algorithm to get the best one or more paths, and then get the best path according to other information such as part of speech (ICTCLAS method of University of science and technology of China).
- Using "maximum entropy" algorithm to realize the ability of finding new words in large text is very suitable for creating custom dictionaries, or data mining in SNS and other occasions.
- Compared with the existing stuttering segmentation, the trie tree structure which consumes a lot of memory is removed, as well as the HMM model which does not have a strong ability to discover new words (this model may be added to this module as an alternative plug-in in in the future).
- Stage 1 is implemented in clauses. By regular, numbers or English words can be directly divided into independent words to generate independent words that will no longer participate in the next step.
- Stage 2 is implemented before the directed acyclic graph is created. It pre scans the clauses, adds some possible words, and gives a certain probability.
- Stage 3 is implemented during the creation of a directed acyclic graph. The probability of the word is obtained from the dictionary, or the possible word is obtained through some matching patterns, giving a certain probability.
- After getting the maximum probability of directed acyclic graph (the shortest path in program implementation), stage 4 continues to process some words that can't be formed into words, or gets the shortest multiple paths, and gets the final path according to the user's interest. If users are interested, part of speech analysis can be realized in this step.
I have been using it all the time. It seems that there is no problem. Finally, thanks to the author of Jieba, the current dictionary is directly copied from the Jieba project.