Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. Features: Simple core with high flexibility. Simple API for html extracting. Annotation with POJO to customize a crawler, no configuration. Multi-thread and Distribution support. Easy to be integrated. Install: Add dependencies to your pom.xml: WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12. Get Started: First crawler: Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation. page.addTargetRequests(links)Add urls for crawling. page.addTargetRequests(links)
page.addTargetRequests(links)
Add urls for crawling. You can also use annotation way: Docs and samples: Documents: http://webmagic.io/docs/ The architecture of webmagic (refered to Scrapy) There are more examples in webmagic-samples package.
webmagic-samples