The exploration of technology is to constantly make assumptions and then continually overthrow it
Recently, when I was writing reptiles with my colleagues using phantomjs, I encountered a lot of interesting pits. After analyzing them, we came to some conclusions and solutions. Let’s share them.
The cause of the problem is because we want to use phantomjs to access a batch of websites to obtain the source code and url. Then when we look at the output, we find that the requested url does not correspond to the url obtained after the visit. For example, I use phantomjs to access baidu, and return The result shows that the current url is bing. This has led us to a series of conjectures. Because there are relatively few resources on the Internet, we can only guess and test it ourselves.
For the problem that the result value does not correspond, I temporarily define that the phantomjs state is contaminated or covered. To put it simply, we first go to the a website. After getting the results, we visit the b website and get the results of the b website. However, we found that the result of the b website is a website. Then we first think that when phantomjs reprocesses b website, its status is not updated, and the result of obtaining b website is still a website.
So what is the cause of the phantomjs state not updated?
My colleague’s blog details two reasons. For details, please see: https://eth.space/phantomjs-debug/, which will not be repeated here. .
As a supplementary explanation, I posted the test code here for reference.
When we first run cn.bing.com with phantomjs, then run the 123.114.com website, and notice that 123.114.com is not accessible.
Results of the:
We can see that we have returned to cn.bing.com when we obtained the information on the 123.114.com website.
When we visit a web page source with a page with the onbeforeunload element.
Results of the:
It can be seen that the above two cases will lead to phantomjs state pollution, and other conditions are still subject to later observation and testing.
Each time d.get() requests, d.quit() closes the phantomjs process and waits until a new request is opened. (very resource intensive)
Every time you get to determine whether the url can be parsed by dns, whether the url can be opened. (also a bit resource consuming)
After each get, save the value of current_url, compare it with this value after the next request, if it is the same, it means that the state has not been changed.
(Of course, except for some special cases, such as the same website for each get, or the same address in the batch get website.)
After each time you get a target url, go to the next get (“about:blank”) and reset the state.
[[phantomjs series] phantomjs correctly opened] (http://thief.one/2017/03/31/Phantomjs%E6%AD%A3%E7%A1%AE%E6%89%93%E5%BC% 80%E6%96%B9%E5%BC%8F/)
[[phantomjs series] phantomjs api introduction] (http://thief.one/2017/03/13/Phantomjs-Api%E4%BB%8B%E7%BB%8D/)
[[Phantomjs series] those pits that selenium+phantomjs climbed] (http://thief.one/2017/03/01/Phantomjs%E7%88%AC%E8%BF%87%E7%9A%84%E9 %82%A3%E4%BA%9B%E5%9D%91/)
[[phantomjs series] selenium+phantomjs performance optimization] (http://thief.one/2017/03/01/Phantomjs%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C% 96/)