Crawling search engine Sogou

What is the most lonely sentence or verse? Unsatisfactory things are often eighty-nine, can be two or three with the speaker

The last part describes the problems and solutions encountered when crawling Baidu search results. This article continues to crawl the topic of search engines, and what problems will you encounter when crawling Sogou? And how to solve it. The reputation of Sogou search engine is far less than that of Baidu in China, but it is a rising star. The accuracy of its search results and the crawling algorithm are not bad. It can be said that Sogou search is another good in Baidu search in China. Choice, if you want to understand Baidu search related information, you can move: [Crawling search engine to find you thousands of Baidu] (http://thief.one/2017/03/17/%E7%88%AC%E6%90 %9C%E7%B4%A2%E5%BC%95%E6%93%8E%E4%B9%8B%E5%AF%BB%E4%BD%A0%E5%8D%83%E7%99%BE %E5%BA%A6/)

Regarding anti-reptile technology, there are many resources on the Internet, and the methods are nothing more than (agent, identification verification code, distributed architecture, emulation browser, ADSL switching ip, etc.). These are not the focus of this article. This article is only for crawling Sogou search engine. Anti-reptile measures encountered while, as well as some solutions.

Why do you want to crawl Sogou?

  • The search results are more accurate and comprehensive, and there are no measures similar to Baidu’s protection resources (the number of search results is relatively accurate)
  • Also has a wealth of resources
  • Anti-reptile measures are relatively less strict

Sogou anti-reptile measures

Using crawlers to crawl Sogou search engine results, the first thing to solve is the problem of cookies. Sogou will verify that the http request has a cookie parameter. If there is no cookie, the number of requests will be very limited. In order to solve this problem, we must first understand the composition of the Sogou search engine cookie content, and its role.

1
2
3
4
5
6
7
8
9
10
Cookie:
ABTEST=3|1489908642|v17;
IPLOC=CN3301;
SOUTH = 899F006F2208990A0000000058CE33A3;
SUV=1489908643339695;
browerV=3;
etc. = 1;
sct = 1;
SNUID=1B0D93FD9297D882F63E3C8D93692285;
ld=E@n5Llllll2Y80nclllllV0nGEklllllbZjKAyllll9lllll9Zlll5@@@@@@@@@@

After my test, I found that several of them are extremely important, and are also the key parameters affecting the search for anti-reptile measures, SUID, SNUID and SUV.

SOUTH

The specific meaning of SUID can be Baidu, and only the process of its generation is described here. When we visit the sogou search home page, the contents of the SUID parameter will be generated in the set-cookies, unless the browser is restarted, otherwise the SUID will not change in a short time. The value of SUID should be randomly assigned by the sogou server. Its value will only be updated when a session is reopened.

SNUID

SNUID is the focus of the sogou anti-crawler. Sogou also limits the number of accesses to the same SNUID. After the limit is exceeded, it will jump to the verification code page. After the verification code is re-verified, the SNUID will be updated and the access will continue. . So how is SNUID generated? After testing, it should be generated by javascript, of course, the premise is to have SUID, SUID is the basis for generating SNUID.

SUV

The content of the SUV parameters is generated by javascript. The test did not find any effect on the anti-reptile, so this article will not be described in detail.

Shielded phenomenon

Similarly, to solve the anti-reptile problem, let’s first look at the phenomenon of triggering anti-reptiles. When the same SNUID access is limited, continuing to access sogou will jump to a verification code page.
URL address:

1
http://www.sogou.com/antispider/?from=%2fweb%3Fquery%3d152512wqe%26ie%3dutf8%26_ast%3d1488957312%26_asf%3dnull%26w%3d01029901%26p%3d40040100%26dp%3d1%26cid%3d%26cid%3d%26sut%3d578%26sst0%3d1488957299160%26lkt%3d3%2C1488957298718%2C1488957298893

Page source code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 27 Oct 2016 04:41:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.3.3
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Length: 5130
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<link rel="shortcut icon" href="//www.sogou.com/images/logo2014/new/favicon.ico" type="image/x-icon">
<title>Sogou Search</title>
<link rel="stylesheet" href="static/css/anti.min.css?v=1"/>
<script src="//dl.web.sogoucdn.com/common/lib/jquery/jquery-1.11.0.min.js"></script>
<script src="static/js/antispider.min.js?v=2"></script>
<script>
var domain = getDomain();
window.imgCode = -1;
(function() {
function checkSNUID() {
var cookieArr = document.cookie.split('; '),
count = 0;
for(var i = 0, len = cookieArr.length; i < len; i++) {
if (cookieArr[i].indexOf('SNUID=') > -1) {
count++;
}
}
return count > 1;
}
if(checkSNUID()) {
var date = new Date(), expires;
date.setTime(date.getTime() -100000);
expires = date.toGMTString();
document.cookie = 'SNUID=1;path=/;expires=' + expires;
document.cookie = 'SNUID=1;path=/;expires=' + expires + ';domain=.www.sogou.com';
document.cookie = 'SNUID=1;path=/;expires=' + expires + ';domain=.weixin.sogou.com';
document.cookie = 'SNUID=1;path=/;expires=' + expires + ';domain=.sogou.com';
document.cookie = 'SNUID=1;path=/;expires=' + expires + ';domain=.snapshot.sogoucdn.com';
sendLog('delSNUID');
}
if(getCookie('seccodeRight') === 'success') {
sendLog('verifyLoop');
setCookie('seccodeRight', 1, getUTCString(-1), location.hostname, '/');
}
if(getCookie('refresh')) {
sendLog('refresh');
}
})();
function setImgCode(code) {
try {
var t = new Date().getTime() - imgRequestTime.getTime();
sendLog('imgCost',"cost="+t);
} catch (e) {
}
window.imgCode = code;
}
sendLog('index');
function changeImg2() {
if(window.event) {
window.event.returnValue=false
}
}
</script>
</head>
<body>
<div class="header">
<div class="logo"><a href="/"><img width="180" height="60" src="//www.sogou.com/images/logo2014/error180x60.png"></a></div>
<div class="other"><span class="s1">Your visit is wrong</span><span class="s2"><a href="/">Back to Home&gt;&gt;</a ></span></div>
</div>
<div class="content-box">
<p class="ip-time-p">IP: 183.129.218.233<br>Visit Time: 2016.10.27 12:41:19</p>
<p class="p2">Hello users, your visits are too frequent. To confirm that this visit is a normal user behavior, you need to assist with verification. </p>
<p class="p3"><label for="seccodeInput">Certificate:</label></p>
<form name="authform" method="POST" id="seccodeForm" action="/">
<p class="p4">
<input type=text name="c" value="" placeholder="Please enter the verification code" id="seccodeInput">
<input type="hidden" name="tc" id="tc" value="">
<input type="hidden" name="r" id="from" value="%2Fweb%3Fquery%3D%E6%9F%90%E8%8D%A3%26ie%3Dutf8%26_ast%3D1477536768%26_asf%3Dnull%26w%3D01029901%26cid%3D" >
<input type="hidden" name="m" value="0" > <span class="s1">
<script>imgRequestTime=new Date();</script>
<a onclick="changeImg2();" href="javascript:void(0)">
<img id="seccodeImage" onload="setImgCode(1)" onerror="setImgCode(0)" src="util/seccode.php?tc=1477543279" width="100" height="40" alt="Please Enter the verification code in the picture " title="Please enter the verification code in the picture">
</a>
</span>
<span class="s2" id="error-tips" style="display: none;"></span>
</p>
</form>
<p class="p5">
<span> Didn't solve the problem after submitting? Welcome <a href="http://fankui.help.sogou.com/index.php/web/web/index?type=10&anti_time=1477543279&domain=www.sogou.com" target="_blank">Feedback</a >. </span>
</p>
</div>
<div id="ft"><a href="http://fuwu.sogou.com/" target="_blank">Corporate Promotion</a><a href="http://corp.sogou.com /" target="_blank">About Sogou</a><a href="/docs/terms.htm?v=1" target="_blank">Disclaimer</a><a href="http:/ /fankui.help.sogou.com/index.php/web/web/index?type=10&anti_time=1477543279&domain=www.sogou.com" target="_blank">Feedback</a><br>&nbsp;&copy; &nbsp;2016<span id="footer-year"></span>&nbsp;SOGOU&nbsp;-&nbsp;<a href="http://www.miibeian.gov.cn" target="_blank" class="g ">Beijing ICP Certificate No. 050897</a>&nbsp;-&nbsp;Jinggong Net Security 1100<span class="ba">00000025#</span></div>
<script src="static/js/index.min.js?v=0.1.3"></script>
</body>
</ Html> <! - bad ->

Automating the generation of SNUID

Although I know the process of SNUID value generation, only the automatic generation can achieve the bypass of the anti-reptile limit.

Get access to the verification code page

After accessing the verification code page and completing the verification code to complete the verification, a new SNUID will be regenerated, and the request can be sent repeatedly (no need to input the verification code again), and a new SNUID will be generated for each transmission.

Accessing through a simulated browser, executing javascript

You can use phantomjs to crawl the sogou page and get the SNUID value.

Get the SNUID code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
#! -*- coding:utf-8 -*-
'''
Get the value of SNUID
'''
import requests
import json
import time
import random
'''
Method (1) Access the sogou search result page through phantomjs to obtain the value of SNUID
'''
def phantomjs_getsnuid():
from selenium import webdriver
d=webdriver.PhantomJS('D:\python27\Scripts\phantomjs.exe',service_args=['--load-images=no','--disk-cache=yes'])
try:
d.get("https://www.sogou.com/web?query=")
Snuid=d.get_cookies()[5]["value"]
except:
Snuid=""
d.quit()
return Snuid
'''
Method (2) Get the id inside the body by accessing a specific url
'''
def Method_one():
url="http://www.sogou.com/antispider/detect.php?sn=E9DA81B7290B940A0000000058BFAB0&wdqz22=12&4c3kbr=12&ymqk4p=37&qhw71j=42&mfo5i5=7&3rqpqk=14&6p4tvk=27&eiac26=29&iozwml=44&urfya2=38&1bkeul=41&jugazb=31&qihm0q=8&lplrbr=10&wo65sp=11&2pev4x=23&4eyk88=16&q27tij=27&65l75p=40&fb3gwq=27&azt9t4=45&yeyqjo=47&kpyzva=31&haeihs=7&lw0u7o=33&tu49bk=42&f9c5r5=12&gooklm=11&_=1488956271683"
headers={"Cookie":
"ABTEST=0|1488956269|v17;\
IPLOC=CN3301;\
SOUTH = E9DA81B7290B940A0000000058BFAB6D; \
PHPSESSID=rfrcqafv5v74hbgpt98ah20vf3;\
SUIR=1488956269"
}
try:
f=requests.get(url,headers=headers).content
f=json.loads(f)
Snuid=f["id"]
except:
Snuid=""
return Snuid
'''
Method (3) accessing a specific url to obtain the content inside the header
'''
def Method_two():
url="https://www.sogou.com/web?query=333&_asf=www.sogou.com&_ast=1488955851&w=01019900&p=40040100&ie=utf8&from=index-nologin"
headers={"Cookie":
"ABTEST=0|1488956269|v17;\
IPLOC=CN3301;\
SOUTH = E9DA81B7290B940A0000000058BFAB6D; \
PHPSESSID=rfrcqafv5v74hbgpt98ah20vf3;\
SUIR=1488956269"
}
f=requests.head(url,headers=headers).headers
print f
'''
Method (4) SNUID can be obtained by accessing the page that needs to be input into the verification code to be unsealed.
'''
def Method_three():
'''
Http://www.sogou.com/antispider/util/seccode.php?tc=1488958062 Verification code address
'''
'''
http://www.sogou.com/antispider/?from=%2fweb%3Fquery%3d152512wqe%26ie%3dutf8%26_ast%3d1488957312%26_asf%3dnull%26w%3d01029901%26p%3d40040100%26dp%3d1%26cid%3d%26cid%3d%26sut%3d578%26sst0%3d1488957299160%26lkt%3d3%2C1488957298718%2C1488957298893
Access this url, then fill in the verification code, after sending it is the following package content, you can get the SNUID.
'''
import socket
import re
res=r"id\"\: \"([^\"]*)\""
s = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
s.connect(('www.sogou.com',80))
s.send('''
POST http://www.sogou.com/antispider/thank.php HTTP/1.1
Host: www.sogou.com
Content-Length: 223
X-Requested-With: XMLHttpRequest
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: CXID=65B8AE6BEE1CE37D4C63855D92AF339C; SUV=006B71D7B781DAE95800816584135075; IPLOC=CN3301; pgv_pvi=3190912000; GOTO=Af12315; ABTEST=8|1488945458|v17; PHPSESSID=f78qomvob1fq1robqkduu7v7p3; SUIR=D0E3BB8E393F794B2B1B02733A162729; SNUID=B182D8EF595C126A7D67E4E359B12C38; sct=2; sst0=958; ld=AXrrGZllll2Ysfa1lllllVA@rLolllllHc4zfyllllYllllljllll5@@@@@@@@@@; browerV=3; osV=1; LSTMV=673%2C447; LCLKINT=6022; ad=6FwTnyllll2g@popQlSGTVA@7VCYx98tLueNukllll9llllljpJ62s@@@@@@@@@@; SUID=EADA81B7516C860A57B28911000DA424; successCount=1|Wed, 08 Mar 2017 07:51:18 GMT; seccodeErrorCount=1|Wed, 08 Mar 2017 07:51:45 GMT
c=6exp2e&r=%252Fweb%253Fquery%253Djs%2B%25E6%25A0%25BC%25E5%25BC%258F%25E5%258C%2596%2526ie%253Dutf8%2526_ast%253D1488957312%2526_asf%253Dnull%2526w%253D01029901%2526p%253D40040100%2526dp%253D1%2526cid%253D%2526cid%253D&v=5
''')
buf=s.recv(1024)
p=re.compile(res)
L=p.findall(buf)
if len(L)>0:
Snuid=L[0]
else:
Snuid=""
return Snuid
def getsnuid(q):
while 1:
if q.qsize()<10:
Snuid=random.choice([Method_one(),Method_three(),phantomjs_getsnuid()])
if Snuid!="":
q.put(Snuid)
print Snuid
time.sleep(0.5)
if __name__=="__main__":
import Queue
q=Queue.Queue()
getsnuid(q)
  • SUID value acquisition is relatively simple, you can directly access sogou.
  • After getting the value of SUID, go to get the SNUID value (can be done in the above ways)
  • After getting the SNUID, it can be saved to the queue.

Note: If the value of SNUID is not used, it can be stored for a long time, until it is used to the upper limit, it will be invalid; SUID is generally not limited by the number of times, and can be used all the time.

ip problem solution

In the search for Sogou, in addition to the cookie problem, you also need to solve the ip problem, of course, this problem can refer to the solution to climb Baidu, reference address: [crawl the search engine to find you thousands of Baidu] (http://thief.one/2017 /03/17/%E7%88%AC%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E%E4%B9%8B%E5%AF%BB%E4% BD%A0%E5%8D%83%E7%99%BE%E5%BA%A6/)

  • Affirmation: This article only lists the problems I encountered when crawling Sogou resources. It does not represent all the anti-reptile technology of Sogou itself. The solution provided in this article is time-sensitive. It is also necessary to do it yourself. If there is better Solution can exchange messages*

This article address: [http://thief.one/2017/03/19/ Crawling search engine Sogou/] (http://thief.one/2017/03/19/ Crawling search engine Sogou/)
Reprinted please specify from: [nMask’Blog] (http://thief.one)

Portal

[Crawling the search engine’s Sogou] (http://thief.one/2017/03/19/Crawling search engine Sogou/)
[Climbing the search engine to find you thousands of Baidu] (http://thief.one/2017/03/17/%E7%88%AC%E6%90%9C%E7%B4%A2%E5%BC%95% E6%93%8E%E4%B9%8B%E5%AF%BB%E4%BD%A0%E5%8D%83%E7%99%BE%E5%BA%A6/)

本文标题:Crawling search engine Sogou

文章作者:nmask

发布时间:2017年03月19日 - 15:03

最后更新:2019年08月16日 - 15:08

原始链接:https://thief.one/2017/03/19/Crawling search engine Sogou/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: