Maybe the old accent is my sadness
A friend who has written a crawler or vulnerability scanner must have encountered a problem, that is, how to determine the page corresponding to a url is a 404 page, because this is especially important for the subsequent logical judgment. However, due to some special circumstances, the 404 page judgment is not as simple as imagined, which is often related to the server configuration. This article is a branch of “[Discussion on the Design and Development of Vulnerability Scanners] (https://thief.one/2018/03/16/1/)”, focusing on how to determine whether a page is a 404 page.
Under normal circumstances, to determine whether a web page is a 404 page, mainly depends on the response code returned. If the response code is 404, it means that this is a non-existent page, if not, it means an existing page. However, due to the user-friendliness, some websites tend to optimize 404 pages, and there are several optimization methods.
The first optimization is that once the user accesses a page that does not exist, the server will jump to a specified url, often the home page of the website, or the landing page of the website. In this case, the response code of the page requesting a non-existent page will change from 302 to 200 (server-side jump), or the response code is directly 200 (client-side jump, the user can feel); the content of the webpage is the homepage of the website or The content of the specified page such as the landing page of the website.
The second optimization method is: once the user accesses a page that does not exist, the server will jump to the 404 page. The difference from the first method is that the page after the jump is indeed a 404 page, but it is special. Processing optimization. In this case, the response code for requesting a page that does not exist will change from 302 to 200 (server-side jump), or the response code is 200 (client-side jump), and the content of the web page is an optimized 404 page content.
The third way is: once the user accesses a page that does not exist, the page is displayed directly as a 404 page (server default). In this case, the response code for requesting a page that does not exist may be 404 (the default) or 200, and the page content is the default 404 or the processed 404 page.
In summary, the response code of a 404 page may be: 404, 302, 200 (of course, there are other circumstances); the content of a 404 page may be: the content of the home page (designated page), the optimized 404 page Content, server default 404 page content.
In summary, we can roughly get such a judgment logic: (pseudo code is as follows)
But through the above logic, we need to solve two problems. Question 1: How to collect the 404 page content of the website in advance; Question 2: How to judge whether the content of the target web page is similar to the content of the website 404 page.
First solve the next problem, this is a better solution, we can construct some non-existent paths (such as: /this_is_a_404_nmask_page), request to get the page content.
The second question is more troublesome. First of all, we need to pay attention to the fact that the webpage is similar but not the same. Why is it not directly judged here whether it is the same? Because some 404 page content contains random factors, such as the current time, or the page contains some promotional information, resulting in a difference in the content of each 404 page. Therefore, how to judge whether the content of the target web page and the content of the website 404 are similar, not the same, is a scientific method for identifying whether a web page is a 404 page.
So how do you judge whether two web pages are similar? Here we draw on the algorithm for judging the similarity of articles—cosine similarity algorithm. So what is the cosine similarity algorithm, how is it used to judge the similarity of web pages? Please look down.
Suppose we have a need: Is the two articles similar?
(1) Using the TF-IDF algorithm to find the keywords of two articles;
(2) Each article takes out several keywords (such as 20), merges them into a set, and calculates the word frequency of each article for the words in the set (in order to avoid the difference in length of the article, the relative word frequency can be used);
(3) generating a word frequency vector for each of the two articles;
(4) Calculate the cosine similarity of the two vectors. The larger the value, the more similar.
Sentence A: I / like / watch / TV, no / like / watch / movie.
Sentence B: I / no / like / watch / TV, also / no / like / watch / movie.
All the wordings are: I, like, watch, TV, movie, no, too.
Calculate word frequency: (number of occurrences)
Sentence A: I 1, like 2, watch 2, TV 1, movie 1, no 1, also 0.
Sentence B: I 1, like 2, watch 2, TV 1, movie 1, no 2, also 1.
Calculate the word frequency vector:
Sentence A: [1, 2, 2, 1, 1, 1, 0]
Sentence B: [1, 2, 2, 1, 1, 2, 1]
We can think of them as two line segments in space. We can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar it is.
Note: The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors are, this is called “cosine similarity”.
The cosine similarity algorithm and the Hamming distance algorithm are listed below. The test finds that the accuracy of the cosine similarity algorithm for determining the similarity of web pages is higher.
Note: Hamming distance pays more attention to the order, such as whether the sort order of a web page is similar; and the cosine similarity pays more attention to the number relationship of the whole label, which is not sensitive to the order. The Hamming distance can be seen as the distance between points and the cosine similarity can be seen as the angle or distance between the lines.
Through the cosine similarity algorithm, we can roughly calculate the similarity between two web pages. Then it seems that the above logical judgment should be able to judge the 404 page. However, the actual situation is more complicated, such as how to set the threshold of similarity, and also requires a large amount of marking data to calculate. For example, how to reduce the false positives caused by some special urls. The special url here includes the homepage of the website, the landing page, etc., because when accessing some 404 pages, it may jump to this page, resulting in the similarity of the page similarity calculation results. The solution to these problems is not introduced here.
Based on the above theory, I deployed a api interface that judges the 404 page, so that you can test the accuracy.
API interface address: http://api.nmask.cn/not_exist_page_calculation/?target_url=http://www.baidu.com/nmask
If you encounter a url that determines the error, you can leave a message below, or email: email@example.com.