phantomjs series those pits that Selenium+Phantomjs climbed

The exploration of technology is to constantly make assumptions and then continually overthrow it

Recently, when I was writing reptiles with my colleagues using phantomjs, I encountered a lot of interesting pits. After analyzing them, we came to some conclusions and solutions. Let’s share them.

The cause of the problem is because we want to use phantomjs to access a batch of websites to obtain the source code and url. Then when we look at the output, we find that the requested url does not correspond to the url obtained after the visit. For example, I use phantomjs to access baidu, and return The result shows that the current url is bing. This has led us to a series of conjectures. Because there are relatively few resources on the Internet, we can only guess and test it ourselves.
For the problem that the result value does not correspond, I temporarily define that the phantomjs state is contaminated or covered. To put it simply, we first go to the a website. After getting the results, we visit the b website and get the results of the b website. However, we found that the result of the b website is a website. Then we first think that when phantomjs reprocesses b website, its status is not updated, and the result of obtaining b website is still a website.
So what is the cause of the phantomjs state not updated?
My colleague’s blog details two reasons. For details, please see: https://eth.space/phantomjs-debug/, which will not be repeated here. .

As a supplementary explanation, I posted the test code here for reference.

phantomjs state pollution test

Test code

1
2
3
4
5
6
7
8
9
10
11
d=webdriver.PhantomJS("D:\python27\Scripts\phantomjs.exe",service_args=['--load-images=no','--disk-cache=yes'])
D.implicitly_wait(10) ##Set timeout
D.set_page_load_timeout(10) ##Set timeout time
def gethttp(url):
try:
d.get(url)
except Exception,e:
print is
print d.current_url

When we first run cn.bing.com with phantomjs, then run the 123.114.com website, and notice that 123.114.com is not accessible.

1
2
Gethttp("http://cn.bing.com") #Web site can open normally
Gethttp("http://123.114.com") #DNS resolution failed, the website can not open

Results of the:

1
2
http://cn.bing.com/
http://cn.bing.com/

We can see that we have returned to cn.bing.com when we obtained the information on the 123.114.com website.

When we visit a web page source with a page with the onbeforeunload element.

1
2
Gethttp("http://www.zzxzxyey.com") #page memory onbeforeunload element
Gethttp("http://cn.bing.com") #Web site can open normally

Results of the:

1
2
http://www.zzxzxyey.com/
http://www.zzxzxyey.com/

It can be seen that the above two cases will lead to phantomjs state pollution, and other conditions are still subject to later observation and testing.

solution

Thorough Law

Each time d.get() requests, d.quit() closes the phantomjs process and waits until a new request is opened. (very resource intensive)

Every time you get to determine whether the url can be parsed by dns, whether the url can be opened. (also a bit resource consuming)

Elegant Method

After each get, save the value of current_url, compare it with this value after the next request, if it is the same, it means that the state has not been changed.
(Of course, except for some special cases, such as the same website for each get, or the same address in the batch get website.)

# # # Supertheism

After each time you get a target url, go to the next get (“about:blank”) and reset the state.

Portal

[[phantomjs series] phantomjs correctly opened] (http://thief.one/2017/03/31/Phantomjs%E6%AD%A3%E7%A1%AE%E6%89%93%E5%BC% 80%E6%96%B9%E5%BC%8F/)
[[phantomjs series] phantomjs api introduction] (http://thief.one/2017/03/13/Phantomjs-Api%E4%BB%8B%E7%BB%8D/)
[[Phantomjs series] those pits that selenium+phantomjs climbed] (http://thief.one/2017/03/01/Phantomjs%E7%88%AC%E8%BF%87%E7%9A%84%E9 %82%A3%E4%BA%9B%E5%9D%91/)
[[phantomjs series] selenium+phantomjs performance optimization] (http://thief.one/2017/03/01/Phantomjs%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C% 96/)

本文标题:phantomjs series those pits that Selenium+Phantomjs climbed

文章作者:nmask

发布时间:2017年03月01日 - 16:03

最后更新:2019年07月11日 - 18:07

原始链接:https://thief.one/2017/03/01/Those pits that Phantomjs climbed/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: