Climbing the search engine to find you thousands of Baidu

Regarding anti-reptile technology, there are many resources on the Internet, and the methods are nothing more than (agent, identification verification code, distributed architecture, emulation browser, ADSL switch ip, etc.). These are not the focus of this article. This article is only for crawling Baidu search engine. Anti-reptile measures encountered while, as well as some solutions.

Why do you want to crawl Baidu?

  • Baidu does not provide APi
  • Baidu has a wealth of resources available for inquiry
  • Baidu anti-reptiles are not so perverted

Baidu anti-reptile measures

In general, the single-threaded crawler interval is set to >2s, which should not be blocked in a short time. Of course, long-term crawling still does not work; if multi-threaded has no time interval to crawl, then it will definitely be blocked for about 30 minutes. It is.
I have tried adding headers, even using phantomjs to simulate the browser, etc., all of which ended in failure. I think Baidu is a search engine company. The crawler technology is one of its core technologies. Therefore, playing anti-reptile technology with it should be based on egg-rocking (similar to analog browsers, methods such as modifying headers should be invalid).
However, we can change our minds. Baidu does not allow crawlers to access, but only limits the frequency of crawling. There are no obvious restrictions on the information such as the access headers. That is to say, Baidu’s anti-reptile is actually controlling the frequency of single ip access, then we can solve it through distributed architecture or switching ip.

Shielded phenomenon

Before discussing how to solve the problem of being blocked, let us first study the phenomenon when it is blocked by Baidu. Generally speaking, when Baidu detects that an ip access traffic is particularly large, it will first prompt the source code. If it has not stopped accessing, it will directly block access.

Source prompt network exception

Web page source code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<title>Baidu--Your visit is wrong</title>
<style>
body{text-align:center;margin-top:3px}
#wrap{width:650px;text-align:left;margin:auto}
#logo{float:left;margin:0 3px 0 0}
#logo img{border:0}
#title{float:left;width:510px}
#intitle{margin:20px 0 0 0;background-color:#e5ecf9;width:100%;font-weight:bold;
font-size:14px;padding:3px 0 4px 10px}
#content{clear:left;padding-top:60px;line-height:200%}
#vf{margin-top:10px}
#vf img{float:left;border:1px solid #000}
#kw{font:16px Verdana;height:1.78em;padding-top:2px}
#vf form{float:left;margin:12px 0 0 5px;padding:0}
#ft{text-align:center}
#ft,#ft a{color:#666;font-size:14px}
</style>
</head>
<body>
<div id="wrap">
<div id="logo"><a href="http://www.baidu.com"><img alt="to Baidu homepage" title="to hundred
/div>
<div id="title"><div id="intitle">Your access is wrong</div></div>
<div id="content">I am sorry, your computer or the """ access, we are unable to respond at the moment
Your request. <br>Please enter the verification code below to resume use. </div>
<div id="vf">
<img src="http://verify.baidu.com/cgi-bin/genimg?6D8B74BFF43F7AE5457E1E8DA8C6335
5C8F00514C99AC6AD0182FCD695A4FED003A2592509E05792FF7A137E4184B4D9D9F5366F" width
="120" height="40">
<form action="http://verify.baidu.com/verify">
<input type="hidden" name="url" value="http://www.baidu.com/s?wd=.gov.cn&pn=0&vi
f=1">
<input type="hidden" name="vcode" value="6D8B74BFF43F7AE5457E1E8DA8C63355C8F0051
4 4 014 044 004 007 007 097 05792 FOB 137 4144 B 4 "
<input type="hidden" name="id" value="1488861310">
<input type="hidden" name="di" value="ad617386491a359a">
<input type="text" size="6" maxlength="10" name="verifycode" id="kw">
</form>
</div>
<div style="clear:left;height:90px"></div>
Responsibility statement</a></div>
</div>
<script>
(function(){
var rfr = window.document.location.href,
p = encodeURIComponent(rfr),
img = new Image(),
imgzd = new Image(),
re = /\/vcode\?http:\/\/(\S+)\.baidu/ig,r="";
img.src = "http://nsclick.baidu.com/v.gif?pid=201&pj=vcode&path="+p+"&t="+ne
w Date().getTime();
r = re.exec(rfr);
if(r&&r[1]){imgzd.src = "http://"+r[1]+".baidu.com/v.gif?fr=vcode&url="+p+"&
t="+new Date().getTime();}
})();
</script>
</body>
</html>

Directly masking Ip addresses

In this case, the interview will report an error.

General Solutions

Based on the characteristics of Baidu anti-crawler, we can collect resources through distributed deployment of crawler servers, of course, personally think that ADSL server will be better. But distributed deployments, especially ADSL server deployments, can become very expensive and require maintenance. So is there a single server that can solve the blocked problem?
The answer is yes, that is stand-alone + multi-threaded + ip agent, this method is more affordable, but it compares the stability of ip agent. After personal testing, I feel that most of the domestic agents (charges, free, dynamic, etc.) are not very stable, so this is a compromise, is there a better way?

Alternative Solutions

As a search engine company, Baidu’s crawler must be distributed deployment; and because Baidu has a high domestic market share, its search service server should also be distributed deployment, which means that many Baidu deployed throughout the country. Server.
Then when we open the browser and access Baidu, the server that provides the search service is often the one closest to us, so we can think of the server that shields us. Boldly imagine, if we can freely switch to access Baidu servers in different regions, can we bypass the problem of being blocked by a single server?

Of course the premise of this solution is:

  • We must have a large number of ip addresses of Baidu servers
  • Baidu allows access with ip address (it is not possible to change the host)

Fortunately, the above 2 points are not difficult to do. The resources of Baidu server can be obtained online. Of course, you can also obtain ip by pinging Baidu in different regions. As for accessing Baidu directly through ip address, this is feasible by default (I don’t know why Baidu is set this way)

Baidu’s big move

c through the above several ways, should be able to bypass Baidu’s anti-reptile mechanism, but Baidu is not vegetarian, it also has its own unique anti-reptile killing, perhaps called “search restrictions” or “resource protection” measures A little more appropriate.

Maximum number of search results

Searching for keywords through Baidu search engine, the number of results calculated has an upper limit.

The maximum number of displays is 100 million, which is far more than that, so the data is not true.

Search page number cap

Look at the number of search results pages:

It only displays a maximum of 76 pages, and this is just the tip of the iceberg in all the results.

cookies affect search results

In the process of crawling several times, I have no intention to find that adding no cookies to the headers will affect the final search results (mainly affecting the search results).

The above points are not strictly anti-reptile technology, but a way to protect their own resources. The meaning is self-evident

By obtaining the source code of Baidu search results, and by regular matching, we can get some links to search results, and then these links are not the original links of the website, and have the following two forms:

1
2
3
http://www.baidu.com/link?url=1qIAIIh_2N7LUQpI0AARembLK2en4QpGjaRqKZ3BxYtzoZYevC5jA2jq6XMwgEKF&wd=&eqid=9581fbec0007eae00000000458200ad4
http://www.baidu.com/link?url=1qIAIIh_2N7LUQpI0AARembLK2en4QpGjaRqKZ3BxYtzoZYevC5jA2jq6XMwgEKF

I will call it “Baidu Link” for the time being, which is basically the above two forms. The first one is obtained by right clicking and copying the link address, usually with the eqid parameter, which is used to represent the referer; the second is obtained by the page source code, which does not have the wd and eqid parameters. The value of the eqid parameter changes every time the page is refreshed. This may be a parameter set by Baidu to limit the black hat SEO.
Then we compare the difference between the two, when we take access to these two connections, the returned packets are not the same.

with eqid parameter

The first type with the eqid parameter will return 200, there will be a real link in the body inside the body, you can pass the regular match:

1
res_baidu=r"window\.location\.replace\(\"([^\"]*)\"\)"

Without eqid parameter

The second type without parameters will return a 302 jump, and there will be a location field in the header, which can be accessed through the requests module (head mode).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#! -*- coding:utf-8 -*-
'''
@ Analysis baidu_link
'''
__author__="nMask"
__Blog__="http://thief.one"
__Date__="20170301"
import requests
import re
res_baidu=r"window\.location\.replace\(\"([^\"]*)\"\)"
class anbaidulink:
headers={'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6',
'Referer':'http://www.baidu.com/link?url='}
def __init__(self):
pass
def run(self,url,one_proxy=""):
'''
Entry function, accept baidu_link and proxy address, the default is "", the proxy address should be http://xx.xx.xx.xx:xx format
'''
if "&eqid=" in url:
url=self.have_eqid(url,one_proxy)
else:
url=self.noeqid(url,one_proxy)
return url
def noeqid(self,url,one_proxy):
'''
There is no eqid parameter for baidu_link
'''
try:
h=requests.head(url,proxies={'http':one_proxy},headers=anbaidulink.headers,timeout=5).headers #
except Exception,e:
print is
else:
url=h["location"]
return url
def have_eqid(self,url,one_proxy):
'''
There are eqid parameters in baidu_link
'''
try:
body=requests.get(url,proxies={'http':one_proxy},headers=anbaidulink.headers,timeout=5).content #
except Exception,e:
print is
else:
p=re.compile(res_baidu)
url=p.findall(body)
if len(url)>0:
url = url [0]
return url
if __name__=="__main__":
Tax = Ambadink ()
url=cur.run(url='https://www.baidu.com/link?url=1qIAIIh_2N7LUQpI0AARembLK2en4QpGjaRqKZ3BxYtzoZYevC5jA2jq6XMwgEKF&wd=&eqid=9581fbec0007eae00000000458200ad4',one_proxy="")
#url=cur.run(url='http://www.baidu.com/link?url=1qIAIIh_2N7LUQpI0AARembLK2en4QpGjaRqKZ3BxYtzoZYevC5jA2jq6XMwgEKF',one_proxy="")
print url
  • Affirmation: This article only lists the problems I encountered when crawling Baidu resources. It does not represent all the anti-reptile technology of Baidu itself. The solution provided in this article is time-sensitive. It is also necessary to do it yourself. If there is better Solution can exchange messages*

This article address: [http://thief.one/2017/03/17/ Climb search engine to find you thousands of Baidu /] (http://thief.one/2017/03/17/ Climb search engine to find you thousands of Baidu / )
Reprinted please specify from: [nMask’Blog] (http://thief.one)

Portal

[Crawling the search engine’s Sogou] (http://thief.one/2017/03/19/Crawling search engine Sogou/)
[Climbing the search engine to find you thousands of Baidu] (http://thief.one/2017/03/17/%E7%88%AC%E6%90%9C%E7%B4%A2%E5%BC%95% E6%93%8E%E4%B9%8B%E5%AF%BB%E4%BD%A0%E5%8D%83%E7%99%BE%E5%BA%A6/)

本文标题:Climbing the search engine to find you thousands of Baidu

文章作者:nmask

发布时间:2017年03月17日 - 16:03

最后更新:2019年07月11日 - 16:07

原始链接:https://thief.one/2017/03/17/Climb the search engine to find you thousands of Baidu/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: