Hacking Book | Free Online Hacking Learning

Home

record a batch of crawling exp

Posted by chiappelli at 2020-04-12
all

"Click on the blue font above, pay attention to the official account".

Recently, I need to collect some Exps, so I visited several websites of exp collection, such as expand dB, domestic exp search Daquan, seebug, etc. Due to the need to obtain vulnerability information and corresponding exp content in batch, I thought it necessary to write a crawler to automatically collect vulnerability exp.

Choose a target

The first three websites have rich exp resources, but I'm not going to crawl from them. Here's another better website: https://cn.0day.today/ (need to climb over the wall). The reason for selecting it is that exp update is faster and richer, and the anti crawler strategy is relatively general.

Analyze URL structure

After selecting the target, first try to analyze the structure of the web page, such as the need to determine whether it is dynamic or static pages and other features. This website is dynamic, and its vulnerability list URL structure is as follows:

Cn.0day.today/webapps/1 (Page 1 of Web vulnerability list)

Cn.0day.today/webapps/2 (page 2 of Web vulnerability list)

Cn.0day.today/remote/1 (Page 1 of the list of remote exploits)

Cn.0day.today/local/1

...

There are 30 vulnerability lists in each vulnerability list page, and each vulnerability list corresponds to a vulnerability URL. The structure is as follows:

cn.0day.today/exploit/30029

cn.0day.today/exploit/30030

Note: the content of this URL is the exp of a vulnerability. Roughly speaking, there are 600 web vulnerabilities, 30 per page, with a total of 18000 vulnerability Exps.

Analyze web content

After analyzing the URL structure, we can get the idea of crawler: traverse the pages of vulnerability list to get all vulnerability URLs – > crawl vulnerability URL to get vulnerability exp. So how to get the URL corresponding to the vulnerability through crawling the vulnerability list page and how to crawl the vulnerability information page to get exp? , here we need to analyze the page structure. We can try to write regular or extract the content of page elements to get the target content.

Get vulnerability URL

Page structure:

I did not use regular for this page, but used the beautifulsop module to obtain the content of web page elements. The code is as follows:

soup=BeautifulSoup(content,"html.parser")

n=soup.find_all("div",{"class":"ExploitTableContent"})

If n:

For I in n:

m=i.find_all("div",{"class":"td allow_tip "})

For J in m:

y=j.find_all("a")

For X in y:

vul_name=x.text

vul_url=x.attrs.get("href")

Get vulnerability exp

Page structure:

For this page, I did not use regular, but used the beautifulsop module to obtain the content of web page elements. The code is as follows:

soup=BeautifulSoup(content,"html.parser")

m=soup.find_all("div",{"class":"container"})

n=m[0].find_all("div")

Exp_info= ""

For I in n:

exp_info+=i.text+"\n"

Anti crawler strategy

After visiting the website n times in a row, I found that there are some anti crawler strategies in this website. And I have to study and solve it to get more exp content.

CDN anti DDoS strategy

First of all, I found that this website uses cloudflare accelerator, and after the user continues to visit for a period of time (it should be based on IP + headers authentication), anti DDoS page will appear. If you use a common crawler to visit at this time, the source code of the obtained page is the source code of anti DDoS, that is

Solution

When we open the browser to access the vulnerability page, we will wait for a few seconds on the anti DDoS page, and then automatically jump to the target vulnerability page. Based on this feature, I decided to use headless browser to access and set the waiting time. Here I choose phantomjs to do this experiment, and other headless do the same.

d=webdriver.PhantomJS()

d.get(vul_api)

time.sleep(5)

print d.page_source

After visiting the web page for 5S, the source code of the output web page is the source code of the target vulnerability exp page.

User click to confirm

After bypassing the anti DDoS strategy, we found that the website itself has an anti crawler strategy, which requires users to click the confirm button before continuing to visit the original target. If you use a common crawler to visit at this time, the source code of the obtained page is the source code of the page confirmed by the user, that is

Solution

This web page requires users to click the "OK" button to jump to the target page, so you can use headless browser to access and operate the page elements, that is, click the "OK" button in simulation.

d=webdriver.PhantomJS()

d.get(vul_api)

time.sleep(5)

d.find_element_by_name("agree").click()

time.sleep(5)

content=d.page_source

D.quit ()

summary

If you want to crawl the content of a website, you must analyze the URL structure, page content, anti crawler strategy, etc. For this website, the complexity lies in how to bypass the anti crawler strategy. Here, headless browser is used to simulate human access. In a word, it needs patience and care to write a crawler. How to analyze the whole access process step by step is sometimes more important than how to program. Perhaps, this is the so-called: "sharpening the knife does not miss the woodcutter"!

"Click on the blue font above, pay attention to the official account".

This article is from my blog: https://thief.one