Regarding anti-reptile technology, there are many resources on the Internet, and the methods are nothing more than (agent, identification verification code, distributed architecture, emulation browser, ADSL switch ip, etc.). These are not the focus of this article. This article is only for crawling Baidu search engine. Anti-reptile measures encountered while, as well as some solutions.
- Baidu does not provide APi
- Baidu has a wealth of resources available for inquiry
- Baidu anti-reptiles are not so perverted
In general, the single-threaded crawler interval is set to >2s, which should not be blocked in a short time. Of course, long-term crawling still does not work; if multi-threaded has no time interval to crawl, then it will definitely be blocked for about 30 minutes. It is.
I have tried adding headers, even using phantomjs to simulate the browser, etc., all of which ended in failure. I think Baidu is a search engine company. The crawler technology is one of its core technologies. Therefore, playing anti-reptile technology with it should be based on egg-rocking (similar to analog browsers, methods such as modifying headers should be invalid).
However, we can change our minds. Baidu does not allow crawlers to access, but only limits the frequency of crawling. There are no obvious restrictions on the information such as the access headers. That is to say, Baidu’s anti-reptile is actually controlling the frequency of single ip access, then we can solve it through distributed architecture or switching ip.
Before discussing how to solve the problem of being blocked, let us first study the phenomenon when it is blocked by Baidu. Generally speaking, when Baidu detects that an ip access traffic is particularly large, it will first prompt the source code. If it has not stopped accessing, it will directly block access.
Web page source code:
In this case, the interview will report an error.
Based on the characteristics of Baidu anti-crawler, we can collect resources through distributed deployment of crawler servers, of course, personally think that ADSL server will be better. But distributed deployments, especially ADSL server deployments, can become very expensive and require maintenance. So is there a single server that can solve the blocked problem?
The answer is yes, that is stand-alone + multi-threaded + ip agent, this method is more affordable, but it compares the stability of ip agent. After personal testing, I feel that most of the domestic agents (charges, free, dynamic, etc.) are not very stable, so this is a compromise, is there a better way?
As a search engine company, Baidu’s crawler must be distributed deployment; and because Baidu has a high domestic market share, its search service server should also be distributed deployment, which means that many Baidu deployed throughout the country. Server.
Then when we open the browser and access Baidu, the server that provides the search service is often the one closest to us, so we can think of the server that shields us. Boldly imagine, if we can freely switch to access Baidu servers in different regions, can we bypass the problem of being blocked by a single server?
Of course the premise of this solution is:
- We must have a large number of ip addresses of Baidu servers
- Baidu allows access with ip address (it is not possible to change the host)
Fortunately, the above 2 points are not difficult to do. The resources of Baidu server can be obtained online. Of course, you can also obtain ip by pinging Baidu in different regions. As for accessing Baidu directly through ip address, this is feasible by default (I don’t know why Baidu is set this way)
c through the above several ways, should be able to bypass Baidu’s anti-reptile mechanism, but Baidu is not vegetarian, it also has its own unique anti-reptile killing, perhaps called “search restrictions” or “resource protection” measures A little more appropriate.
Searching for keywords through Baidu search engine, the number of results calculated has an upper limit.
The maximum number of displays is 100 million, which is far more than that, so the data is not true.
Look at the number of search results pages:
It only displays a maximum of 76 pages, and this is just the tip of the iceberg in all the results.
In the process of crawling several times, I have no intention to find that adding no cookies to the headers will affect the final search results (mainly affecting the search results).
The above points are not strictly anti-reptile technology, but a way to protect their own resources. The meaning is self-evident
By obtaining the source code of Baidu search results, and by regular matching, we can get some links to search results, and then these links are not the original links of the website, and have the following two forms:
I will call it “Baidu Link” for the time being, which is basically the above two forms. The first one is obtained by right clicking and copying the link address, usually with the eqid parameter, which is used to represent the referer; the second is obtained by the page source code, which does not have the wd and eqid parameters. The value of the eqid parameter changes every time the page is refreshed. This may be a parameter set by Baidu to limit the black hat SEO.
Then we compare the difference between the two, when we take access to these two connections, the returned packets are not the same.
The first type with the eqid parameter will return 200, there will be a real link in the body inside the body, you can pass the regular match:
The second type without parameters will return a 302 jump, and there will be a location field in the header, which can be accessed through the requests module (head mode).
- Affirmation: This article only lists the problems I encountered when crawling Baidu resources. It does not represent all the anti-reptile technology of Baidu itself. The solution provided in this article is time-sensitive. It is also necessary to do it yourself. If there is better Solution can exchange messages*
This article address: [http://thief.one/2017/03/17/ Climb search engine to find you thousands of Baidu /] (http://thief.one/2017/03/17/ Climb search engine to find you thousands of Baidu / )
Reprinted please specify from: [nMask’Blog] (http://thief.one)
[Crawling the search engine’s Sogou] (http://thief.one/2017/03/19/Crawling search engine Sogou/)
[Climbing the search engine to find you thousands of Baidu] (http://thief.one/2017/03/17/%E7%88%AC%E6%90%9C%E7%B4%A2%E5%BC%95% E6%93%8E%E4%B9%8B%E5%AF%BB%E4%BD%A0%E5%8D%83%E7%99%BE%E5%BA%A6/)