remember a crawler crawl crawl exp

Knife sharpening the woodwork

Recently I need to collect some exp, so I went shopping [exploit-db] (https://www.exploit-db.com/), [domestic exp search Daquan] (http://expku.com/), [seebug] (https://www.seebug.org/) and several other sites collected by exp. Due to the need to obtain vulnerability information and the corresponding exp content in batches, I thought it necessary to write a crawler to automatically collect the vulnerability exp.

Select a target

The first three websites have a lot of exp resources, but I don’t plan to crawl from them. Here is another more powerful website: 0day.today (need to overturn the wall). The reason for choosing it is that the exp update is faster and richer, and the anti-crawl strategy is more general.

Analyze URL structure

After selecting the target, first try to analyze the structure of the page, such as the need to determine whether it is dynamic or static pages. This website is dynamic, and its vulnerability list URL structure is as follows:

  • cn.0day.today/webapps/1 (web vulnerability list first page)
  • cn.0day.today/webapps/2 (web vulnerability list page 2)
  • cn.0day.today/remote/1 (first page of the list of remote exploit vulnerabilities)
  • cn.0day.today/local/1 (first page of the list of local exploits)
  • ……

There are 30 vulnerability lists in each vulnerability list page. Each vulnerability list corresponds to a vulnerability URL. The structure is as follows:

  • cn.0day.today/exploit/30029
  • cn.0day.today/exploit/30030

Description: The content of this URL is the exp of a vulnerability. Roughly, the web vulnerability has 600 pages, 30 per page, and the total is 18000 vulnerabilities.

Analyze web content

After analyzing the URL structure, you can roughly get the crawler idea: traverse the vulnerability list page number to get all the vulnerability URLs –> Crawl the vulnerability URL to get the vulnerability exp.
So how do I get the URL corresponding to the vulnerability by crawling the vulnerability list page and how to crawl the vulnerability information page to get exp? Here, you need to analyze the page structure. You can try to write the regular or extract the content of the page element to get the target content.

Get the vulnerability URL

Page structure:

I didn’t use regular rules for this page. Instead, I used the BeautifulSoup module to get the content of the page elements. The code is as follows:

1
2
3
4
5
6
7
8
9
10
soup=BeautifulSoup(content,"html.parser")
n=soup.find_all("div",{"class":"ExploitTableContent"})
if n:
for i in n:
m=i.find_all("div",{"class":"td allow_tip "})
for j in m:
y=j.find_all("a")
for x in y:
Vul_name=x.text # vulnerability name
Vul_url=x.attrs.get("href") # vulnerability url

Get Vulnerability EXP

Page structure:

I didn’t use regular rules for this page. Instead, I used the BeautifulSoup module to get the content of the page elements. The code is as follows:

1
2
3
4
5
6
soup=BeautifulSoup(content,"html.parser")
m=soup.find_all("div",{"class":"container"})
n=m[0].find_all("div")
exp_info=""
for i in n:
exp_info+=i.text+"\n"

Anti-reptile strategy

After visiting the website for n consecutive times, I found that this station has some anti-reptile strategies. And I have to study and solve it to get the exp content further.

cdn anti-ddos strategy

First of all, I found that this website uses the cloudflare accelerator, and after the user has been accessing for a period of time (should be based on ip+headers authentication), an anti-ddos page will appear. If you use ordinary crawlers to access this time, the source code of the obtained page is the source code of anti-ddos, namely:

solution

When we open the browser access vulnerability page, it will automatically jump to the target vulnerability page after waiting a few seconds on the anti-ddos page. Based on this feature, I decided to use a headless browser to access and set the wait time. Here I use phantomjs for this experiment, the other headless the same.

1
2
3
4
d=webdriver.PhantomJS()
d.get (vul_api)
Time.sleep(5) #wait 5s
Print d.page_source # output source

After accessing the webpage for 5s, the source code of the output webpage is the source code of the target vulnerability exp page.

User click to confirm

After bypassing the anti-ddos strategy, I found that the website itself also has an anti-crawling strategy, that is, the user needs to click the confirmation button before continuing to access the original target. If you use ordinary crawlers to access this time, the source code of the obtained page is the source code of the user confirmation page, namely:

solution

This page requires the user to click the “OK” button to jump to the target page, so you can use the headless browser to access and manipulate the page elements, ie the analog click OK button.

1
2
3
4
5
6
7
8
d=webdriver.PhantomJS()
d.get (vul_api)
Time.sleep(5) # Wait 5s (bypassing the anti-ddos policy)
D.find_element_by_name("agree").click() # Click the OK button (bypassing the user click on the confirmation policy)
Time.sleep(5) #wait 5s
Content=d.page_source # Output page source
d.quit()

to sum up

To crawl the content of a website, you must analyze the URL structure of the website, the content of the web page, the anti-crawl strategy, and so on. The complex point for this site is how to bypass the anti-crawl strategy, which uses a headless browser to simulate human access. In short, writing crawlers requires patience and care. How to analyze the entire access process step by step is sometimes more important than how to program. Perhaps this is the so-called: “Knife sharpening the woodworker”!

本文标题:remember a crawler crawl crawl exp

文章作者:nmask

发布时间:2018年03月27日 - 10:03

最后更新:2019年08月16日 - 15:08

原始链接:https://thief.one/2018/03/27/1/en/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: