Hacking Book | Free Online Hacking Learning


code4craft/webmagic: a scalable web crawler framework for java.

Posted by harmelink at 2020-04-09

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. Features: Simple core with high flexibility. Simple API for html extracting. Annotation with POJO to customize a crawler, no configuration. Multi-thread and Distribution support. Easy to be integrated. Install: Add dependencies to your pom.xml: WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12. Get Started: First crawler: Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation. page.addTargetRequests(links)Add urls for crawling. page.addTargetRequests(links)


Add urls for crawling. You can also use annotation way: Docs and samples: Documents: http://webmagic.io/docs/ The architecture of webmagic (refered to Scrapy) There are more examples in webmagic-samples package.


Lisence: Lisenced under Apache 2.0 lisence Thanks: To write webmagic, I refered to the projects below : ScrapyA crawler framework in Python.http://scrapy.org/ Scrapy A crawler framework in Python. http://scrapy.org/ SpidermanAnother crawler framework in Java.http://git.oschina.net/l-weiwei/spiderman Spiderman Another crawler framework in Java. http://git.oschina.net/l-weiwei/spiderman Mail-list: https://groups.google.com/forum/#!forum/webmagic-java http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988 QQ Group: 373225642 542327088 Related Project Gather PlatformA web console based on WebMagic for Spider configuration and management. Gather Platform A web console based on WebMagic for Spider configuration and management.