What is the most lonely sentence or verse? Unsatisfactory things are often eighty-nine, can be two or three with the speaker
The last part describes the problems and solutions encountered when crawling Baidu search results. This article continues to crawl the topic of search engines, and what problems will you encounter when crawling Sogou? And how to solve it. The reputation of Sogou search engine is far less than that of Baidu in China, but it is a rising star. The accuracy of its search results and the crawling algorithm are not bad. It can be said that Sogou search is another good in Baidu search in China. Choice, if you want to understand Baidu search related information, you can move: [Crawling search engine to find you thousands of Baidu] (http://thief.one/2017/03/17/%E7%88%AC%E6%90 %9C%E7%B4%A2%E5%BC%95%E6%93%8E%E4%B9%8B%E5%AF%BB%E4%BD%A0%E5%8D%83%E7%99%BE %E5%BA%A6/)
Regarding anti-reptile technology, there are many resources on the Internet, and the methods are nothing more than (agent, identification verification code, distributed architecture, emulation browser, ADSL switching ip, etc.). These are not the focus of this article. This article is only for crawling Sogou search engine. Anti-reptile measures encountered while, as well as some solutions.
- The search results are more accurate and comprehensive, and there are no measures similar to Baidu’s protection resources (the number of search results is relatively accurate)
- Also has a wealth of resources
- Anti-reptile measures are relatively less strict
Using crawlers to crawl Sogou search engine results, the first thing to solve is the problem of cookies. Sogou will verify that the http request has a cookie parameter. If there is no cookie, the number of requests will be very limited. In order to solve this problem, we must first understand the composition of the Sogou search engine cookie content, and its role.
After my test, I found that several of them are extremely important, and are also the key parameters affecting the search for anti-reptile measures, SUID, SNUID and SUV.
The specific meaning of SUID can be Baidu, and only the process of its generation is described here. When we visit the sogou search home page, the contents of the SUID parameter will be generated in the set-cookies, unless the browser is restarted, otherwise the SUID will not change in a short time. The value of SUID should be randomly assigned by the sogou server. Its value will only be updated when a session is reopened.
Similarly, to solve the anti-reptile problem, let’s first look at the phenomenon of triggering anti-reptiles. When the same SNUID access is limited, continuing to access sogou will jump to a verification code page.
Page source code:
Although I know the process of SNUID value generation, only the automatic generation can achieve the bypass of the anti-reptile limit.
After accessing the verification code page and completing the verification code to complete the verification, a new SNUID will be regenerated, and the request can be sent repeatedly (no need to input the verification code again), and a new SNUID will be generated for each transmission.
You can use phantomjs to crawl the sogou page and get the SNUID value.
- SUID value acquisition is relatively simple, you can directly access sogou.
- After getting the value of SUID, go to get the SNUID value (can be done in the above ways)
- After getting the SNUID, it can be saved to the queue.
Note: If the value of SNUID is not used, it can be stored for a long time, until it is used to the upper limit, it will be invalid; SUID is generally not limited by the number of times, and can be used all the time.
In the search for Sogou, in addition to the cookie problem, you also need to solve the ip problem, of course, this problem can refer to the solution to climb Baidu, reference address: [crawl the search engine to find you thousands of Baidu] (http://thief.one/2017 /03/17/%E7%88%AC%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E%E4%B9%8B%E5%AF%BB%E4% BD%A0%E5%8D%83%E7%99%BE%E5%BA%A6/)
- Affirmation: This article only lists the problems I encountered when crawling Sogou resources. It does not represent all the anti-reptile technology of Sogou itself. The solution provided in this article is time-sensitive. It is also necessary to do it yourself. If there is better Solution can exchange messages*
This article address: [http://thief.one/2017/03/19/ Crawling search engine Sogou/] (http://thief.one/2017/03/19/ Crawling search engine Sogou/)
Reprinted please specify from: [nMask’Blog] (http://thief.one)
[Crawling the search engine’s Sogou] (http://thief.one/2017/03/19/Crawling search engine Sogou/)
[Climbing the search engine to find you thousands of Baidu] (http://thief.one/2017/03/17/%E7%88%AC%E6%90%9C%E7%B4%A2%E5%BC%95% E6%93%8E%E4%B9%8B%E5%AF%BB%E4%BD%A0%E5%8D%83%E7%99%BE%E5%BA%A6/)