Use it, and hide it
I wrote a subdomain scanning tool a few months ago. It feels okay to use it during the period, so I plan to share a wave of composition. Of course, there are many excellent open source subdomain scanning tools on Github. The principle is nothing more than dictionary blasting, crawling third-party platforms, crawling dns parsing records, using search engines, and using certificates. This article is not intended to be a comparison between tools, mainly to share how to use Baidu search engine to efficiently collect subdomains.
Before introducing the use of search engines to collect subdomains, I will share the shortcomings of using dictionary blasting and personal solutions.
The most common subdomain scanning tool is to use the dictionary blasting method, that is, prepare a common subdomain dictionary, and then simulate requesting the dns server to observe whether the subdomain has been successfully parsed. This is very efficient because you can use scripts to send requests and it’s very fast. If the dictionary is good enough, the results will be ideal. But there is also a problem, that is, if the domain name is set to pan-resolved, then almost all sub-domain names in the dictionary can be successfully parsed, which will cause a lot of useless sub-domain data.
To solve this problem, my first method is to store the blasted subdomain + IP result in the List list, and then check how many subdomains corresponding to the same IP. If the number is greater than the threshold, it is judged as useless data (because normal In the case, the domain name bound by the same IP will not be particularly large). The second way is to construct a special subdomain before blasting, for example: “iamisnmask.thief.one”, similar to this random string composition, subdomains that are impossible to use, if it can be resolved successfully, It shows that pan-parsing is used.
This idea stems from the fact that the search engine itself is a huge crawler system that contains a lot of information about the website, including subdomains, so it can be used to collect a wave. There are many domestic search engines. Here I only use Baidu search engine to search. In addition, I wrote a special article on how to crawl Baidu and Sogou. I am interested in moving: Climbing the search engine to find you thousands of Baidu
, [Crawling Search Engine Sogou] (https://thief. One/2017/03/19/%E7%88%AC%E5%8F%96%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E%E4%B9%8B %E6%90%9C%E7%8B%97/)
To crawl Baidu search engine, you first need to collect a Baidu IP list for distributed crawling. On the one hand, speed up the crawling speed, on the other hand, avoid being blocked by Baidu. The second is to consider how to search, in order to get as many subdomain results as possible, I have listed two ways here, welcome everyone to add.
The first one, you can use the site and link in the Baidu search grammar, such as searching in Baidu: site:ctrip.com link:ctrip.com keyword, you can search for the subdomain name related to ctrip.com, but Baidu only displays before 76 pages of search results, so the subdomain obtained using this method is certainly not complete.
Second, in order to solve the first drawback, we can use site+ block to search, for example, search in Baidu: site:ctrip.com inurl:login keyword, the keyword after inurl can use common The dictionary is constructed such that it searches for subdomains as much as possible by means of multiple block searches.
The project code is relatively simple, and I will not introduce it here. This article mainly shares some personal ideas for collecting subdomains as much as possible through Baidu search engine. About the code, in order to facilitate communication, I uploaded it to github, and I can leave a message to discuss.
Project address: https://github.com/tengzhangchao/subdomain_baidu_search
Micro-channel public number of articles Address: [domain name when the child meets the search engine] (https://mp.weixin.qq.com/s?__biz=MzI5NTQ5MTAzMA==&mid=2247483958&idx=1&sn=c63ecc94f0415aa836600c454fb3a793&chksm=ec53868fdb240f99b0695bd270cad1b81e1ade4e0e18270d6e2c395ae2fb6d63fcf55704b6a8&token=192719934&lang=zh_CN#rd )