404 page recognition based on cosine similarity

Maybe the old accent is my sadness

A friend who has written a crawler or vulnerability scanner must have encountered a problem, that is, how to determine the page corresponding to a url is a 404 page, because this is especially important for the subsequent logical judgment. However, due to some special circumstances, the 404 page judgment is not as simple as imagined, which is often related to the server configuration. This article is a branch of “[Discussion on the Design and Development of Vulnerability Scanners] (https://thief.one/2018/03/16/1/)”, focusing on how to determine whether a page is a 404 page.

404 page

Under normal circumstances, to determine whether a web page is a 404 page, mainly depends on the response code returned. If the response code is 404, it means that this is a non-existent page, if not, it means an existing page. However, due to the user-friendliness, some websites tend to optimize 404 pages, and there are several optimization methods.

Jump to the specified page

The first optimization is that once the user accesses a page that does not exist, the server will jump to a specified url, often the home page of the website, or the landing page of the website. In this case, the response code of the page requesting a non-existent page will change from 302 to 200 (server-side jump), or the response code is directly 200 (client-side jump, the user can feel); the content of the webpage is the homepage of the website or The content of the specified page such as the landing page of the website.
Example: http://didichuxing.com/nmask

Jump to the optimized 404 page

The second optimization method is: once the user accesses a page that does not exist, the server will jump to the 404 page. The difference from the first method is that the page after the jump is indeed a 404 page, but it is special. Processing optimization. In this case, the response code for requesting a page that does not exist will change from 302 to 200 (server-side jump), or the response code is 200 (client-side jump), and the content of the web page is an optimized 404 page content.
Example: https://www.jd.com/nmask

Directly displaying 404 pages

The third way is: once the user accesses a page that does not exist, the page is displayed directly as a 404 page (server default). In this case, the response code for requesting a page that does not exist may be 404 (the default) or 200, and the page content is the default 404 or the processed 404 page.
Example: http://www.alibaba.com/nmask

Summary of the characteristics of the 404 page

In summary, the response code of a 404 page may be: 404, 302, 200 (of course, there are other circumstances); the content of a 404 page may be: the content of the home page (designated page), the optimized 404 page Content, server default 404 page content.

How to scientifically judge a 404 page?

In summary, we can roughly get such a judgment logic: (pseudo code is as follows)

If response code == 404:
return this_is_404_page
Elif landing page content Similar to website 404 page content:
return this_is_404_page
return this_is_not_404_page

But through the above logic, we need to solve two problems. Question 1: How to collect the 404 page content of the website in advance; Question 2: How to judge whether the content of the target web page is similar to the content of the website 404 page.
First solve the next problem, this is a better solution, we can construct some non-existent paths (such as: /this_is_a_404_nmask_page), request to get the page content.
The second question is more troublesome. First of all, we need to pay attention to the fact that the webpage is similar but not the same. Why is it not directly judged here whether it is the same? Because some 404 page content contains random factors, such as the current time, or the page contains some promotional information, resulting in a difference in the content of each 404 page. Therefore, how to judge whether the content of the target web page and the content of the website 404 are similar, not the same, is a scientific method for identifying whether a web page is a 404 page.
So how do you judge whether two web pages are similar? Here we draw on the algorithm for judging the similarity of articles—cosine similarity algorithm. So what is the cosine similarity algorithm, how is it used to judge the similarity of web pages? Please look down.

Introduction of Cosine Similarity Algorithm

Suppose we have a need: Is the two articles similar?
Implementation plan:
(1) Using the TF-IDF algorithm to find the keywords of two articles;
(2) Each article takes out several keywords (such as 20), merges them into a set, and calculates the word frequency of each article for the words in the set (in order to avoid the difference in length of the article, the relative word frequency can be used);
(3) generating a word frequency vector for each of the two articles;
(4) Calculate the cosine similarity of the two vectors. The larger the value, the more similar.

Specific examples:
Sentence A: I / like / watch / TV, no / like / watch / movie.
Sentence B: I / no / like / watch / TV, also / no / like / watch / movie.

All the wordings are: I, like, watch, TV, movie, no, too.

Calculate word frequency: (number of occurrences)
Sentence A: I 1, like 2, watch 2, TV 1, movie 1, no 1, also 0.
Sentence B: I 1, like 2, watch 2, TV 1, movie 1, no 2, also 1.

Calculate the word frequency vector:
Sentence A: [1, 2, 2, 1, 1, 1, 0]
Sentence B: [1, 2, 2, 1, 1, 2, 1]

We can think of them as two line segments in space. We can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar it is.

Calculation formula:

Calculation results:

Note: The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors are, this is called “cosine similarity”.

Web page similarity judgment method based on cosine similarity algorithm

The cosine similarity algorithm and the Hamming distance algorithm are listed below. The test finds that the accuracy of the cosine similarity algorithm for determining the similarity of web pages is higher.

a) Web page tag similarity (filter out all the tags of the web page, only select the tag name)
First calculate the vector of all the labels of the two web pages:
1) Calculate the Hamming distance between A and B:
0 1 1 0 0 0 0 0 1 1
The Hamming distance between A and B is 1+1+1+1=4, and the similarity is: (10-4)/10=60%
2) Calculate the cosine similarity between A and B:
A: a 2 b 2 c 2 d 1 e 1 f 1 g 1
B: a 2 b 1 c 2 d 1 e 1 f 1 g 1
Continue to simplify:
A: [2,2,2,1,1,1,1]
B: [2,1,2,1,1,1,1]
Cosine similarity:
((2^2+2^2+2^2+1^2+1^2+1^2+1^2) ** 0.5) * ((2^2+1^2+2^2+1^2+1^2+1^2+1^2) ** 0.5)
= -------------
4 * (13**0.5)
= 0.97
That is, the similarity is 97%
b) Web page text similarity calculation
Just like the label judgment algorithm, it is only necessary to filter out the text of the webpage and perform word segmentation.

Note: Hamming distance pays more attention to the order, such as whether the sort order of a web page is similar; and the cosine similarity pays more attention to the number relationship of the whole label, which is not sensitive to the order. The Hamming distance can be seen as the distance between points and the cosine similarity can be seen as the angle or distance between the lines.

More rigorous scientific judgment

Through the cosine similarity algorithm, we can roughly calculate the similarity between two web pages. Then it seems that the above logical judgment should be able to judge the 404 page. However, the actual situation is more complicated, such as how to set the threshold of similarity, and also requires a large amount of marking data to calculate. For example, how to reduce the false positives caused by some special urls. The special url here includes the homepage of the website, the landing page, etc., because when accessing some 404 pages, it may jump to this page, resulting in the similarity of the page similarity calculation results. The solution to these problems is not introduced here.

Judge the test interface of the 404 page

Based on the above theory, I deployed a api interface that judges the 404 page, so that you can test the accuracy.
API interface address: http://api.nmask.cn/not_exist_page_calculation/?target_url=http://www.baidu.com/nmask
If you encounter a url that determines the error, you can leave a message below, or email: tzc@maskghost.com.

本文标题:404 page recognition based on cosine similarity


发布时间:2018年04月12日 - 16:04

最后更新:2019年08月16日 - 15:08


许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat