phantomjs series Selenium+Phantomjs performance optimization

The road to life needs to be adhered to, and the way of technology is also

Friends who have written reptiles should have used a headless browser – phantomjs. The reason for using it is very simple: it can highly simulate browser access (against anti-crawling), headless browsing (can save performance). The most widely used Phantomjs application should still be used to execute js code, such as writing a js script, using phantomjs to execute, you can write a screenshot of the page, web performance testing and so on.

Phantomjs is also a great artifact in the reptile world. I originally used it to crawl some dynamically loaded web pages, which works well. Of course, Phantomjs is not perfect, although its performance as a headless browser is much faster than other browser-based kernel-based tools, but compared to ordinary crawlers, the speed is still very different.
The installation of phantomjs uses a big push on the Internet, and it is not repeated here. The focus of this article is on the performance optimization of Phantomjs. Because I am familiar with the Python language, I use this language to talk about the performance optimization method of Phantomjs.

The use of Phantomjs in Python requires the use of Selenium modules. Selenium itself is also used for Web automation testing, which encapsulates Phantomjs, so we can use it to use Phantomjs. The specific installation method is not introduced here. Phantomjs can set parameters at startup, so let’s see how to optimize the performance by setting parameters.

Code Testing

default allocation:

1
2
3
4
5
from selenium import webdriver
d=webdriver.PhantomJS("D:\python27\Scripts\phantomjs.exe",service_args=[])
d.get("http://thief.one")
d.quit()

Test result: 3.2s

Change setting:

1
2
3
4
5
6
7
8
9
10
from selenium import webdriver
service_args=[]
Service_args.append('--load-images=no') ##Close image loading
Service_args.append('--disk-cache=yes') ##Enable caching
Service_args.append('--ignore-ssl-errors=true') ##Ignore https errors
d=webdriver.PhantomJS("D:\python27\Scripts\phantomjs.exe",service_args=service_args)
d.get("http://thief.one")
d.quit()

Test result: 2.9s

Note: From the perspective of a single website, reasonable setting of parameters can speed up by 0.3s (if there are more resources such as pictures on the website, the effect of promotion will be more obvious).

Setting timeout

When using a crawler to access a batch of websites, websites that are slow to load tend to block for a long time, and websites that cannot be opened will always block, which seriously affects the performance of the crawler. We know the general crawlers, such as requests, urllib, etc. The module can set the timeout, which is the timeout period, and phantomjs can also be set.

1
2
3
4
5
6
7
8
9
10
11
12
from selenium import webdriver
service_args=[]
service_args.append('--load-images=no')
service_args.append('--disk-cache=yes')
service_args.append('--ignore-ssl-errors=true')
d=webdriver.PhantomJS("D:\python27\Scripts\phantomjs.exe",service_args=service_args)
D.implicitly_wait(10) ##Set timeout
D.set_page_load_timeout(10) ##Set timeout time
d.get("http://www.baidu.com")
d.quit()

Description: If phantomjs is loaded for more than 10s, an exception will be triggered. (Although the exception is triggered, current_url can still be used to get the current url, the source code can also be obtained, but the full source code is not loaded. Of course, only for slow-moving websites, except for websites that are completely inaccessible.)

Intermediate (reasonable switch)

During my time using phantomjs, through constant debugging, I found that the main performance cost of phantomjs lies in the opening of the phantomjs process. Because using phantomjs in Python is equivalent to opening and calling phantomjs.exe (windows) to perform some operations, so if the phantomjs process is turned on and off frequently, it will consume a lot of performance, so it is necessary to operate the switch reasonably.

Code Testing

Single-threaded access to Baidu 10 times:

Before optimization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from selenium import webdriver
def phantomjs_req(url):
service_args=[]
service_args.append('--load-images=no')
service_args.append('--disk-cache=yes')
service_args.append('--ignore-ssl-errors=true')
d=webdriver.PhantomJS("D:\python27\Scripts\phantomjs.exe",service_args=service_args)
d.get(url)
print d.current_url
d.quit()
url_list=["http://www.baidu.com"]*10
for i in url_list:
phantomjs_req(i)

Test results: 28.2s, during the running process, the phantomjs process keeps switching.

Optimized:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from selenium import webdriver
def phantomjs_req(url):
d.get(url)
print d.current_url
service_args=[]
service_args.append('--load-images=no')
service_args.append('--disk-cache=yes')
service_args.append('--ignore-ssl-errors=true')
d=webdriver.PhantomJS("D:\python27\Scripts\phantomjs.exe",service_args=service_args)
url_list=["http://www.baidu.com"]*10
for i in url_list:
phantomjs_req(i)
d.quit()

Test result: 4.2s

Description: You can see the difference between the pre-optimization and the optimized code, it is to put the phantomjs open and close operation outside the loop, so that it always only switches once. It can be seen that the performance difference is very large, so it can also be seen that the phantomjs switching process is very time consuming.

Note: Although this method saves a lot of money, it will cause another phantomjs bug (temporarily called Bug), which is the phantomjs state coverage problem. When you visit some websites in batches, you will find that the returned results do not correspond to the requested website. On this issue, please move to [Phantomjs climbed those pits] (http://thief.one/2017/03/01/Phantomjs%E7%88%AC%E8%BF%87%E7%9A%84% E9% 82% A3% E4% BA% 9B% E5% 9D% 91 /).

Advanced articles (phantomjs concurrency issues)

Through the previous optimization, we found that the performance of phantomjs has improved a lot, but even so, the above code only achieves optimization in single thread. When encountering high-volume websites, concurrency is a necessary choice, then how to use and optimize Phantomjs in concurrency?

Optimization Road

In the optimization of phantomjs concurrency performance, I did not have a smooth sailing, during the review a lot of information, but also stepped on a lot of pits.

Immature Optimization (1)

At first I used the most straightforward method in an attempt to turn on the performance of phantomjs concurrently. (Run a phantomjs process, open multithreading in the process)

1
2
3
4
5
6
7
8
d=webdriver.PhantomJS()
def test(url):
d.get(url)
url_list=["http://www.baidu.com"]*10
for url in url_list:
threading.Thread(target=test,args=(url,)).start()
d.quit()

However, running the connection error, after viewing the official website and other information found that phantomjs is single-threaded, so if you follow the above, then you can not use multi-threading to execute at the same time, this optimization failed!

Immature Optimization (2)

Since a phantomjs can only support single threads, then I will open multiple phantomjs.

1
2
3
4
5
6
7
8
def test(url):
d=webdriver.PhantomJS()
d.get(url)
d.quit()
url_list=["http://www.baidu.com"]*10
for url in url_list:
threading.Thread(target=test,args=(url,)).start()

Finally, I saw that 10 phantomjs processes were started at the same time, and requests from 10 websites could be executed concurrently. However, when the number of websites is 50, do you want to run 50 phantomjs processes at the same time? No, this will definitely mess up the server, this optimization failed!

Immature Optimization (3)

After the above two failures, I began to think about how to open only 10 phantomjs processes, and then each phantomjs process executes the requesting website in order. This is equivalent to 10 processes executing concurrently.
Finally on a night, I came up with the following code:

1
2
3
4
5
6
7
8
9
def test():
d=webdriver.PhantomJS()
for i in url_list:
d.get(url)
d.quit()
url_list=["http://www.baidu.com"]*50
for i in range(10):
threading.Thread(target=test).start()

Successfully opened 10 phantomjs processes, each executing 50 website requests in sequence. Wait, it looks like this design, every phantomjs process will visit 50 times Baidu, this is not the original request, oh, No!

Not mature but still optimizable

In the third stage, the prototype of concurrent optimization has come out, but it still needs to solve the problem of a multi-thread shared resource, which can be solved by the Queue module. Then look directly at the optimized concurrency code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
__author__="nMask"
__Date__="20170224"
__Blog__="http://thief.one"
import Queue
from selenium import webdriver
import threading
import time
class conphantomjs:
Phantomjs_max=1 ##Open phantomjs at the same time
Jiange=0.00001 ##Open phantomjs interval
Timeout=20 ##Set phantomjs timeout
Path="D:\python27\Scripts\phantomjs.exe" ##phantomjspath
Service_args=['--load-images=no','--disk-cache=yes'] ##Parameter Settings
def __init__(self):
Self.q_phantomjs=Queue.Queue() ##Store phantomjs process queue
def getbody(self,url):
'''
Use phantomjs to get website source and url
'''
d=self.q_phantomjs.get()
try:
d.get(url)
except:
print "Phantomjs Open url Error"
url=d.current_url
self.q_phantomjs.put(d)
print url
def open_phantomjs(self):
'''
Multi-threading to open the phantomjs process
'''
def open_threading():
d=webdriver.PhantomJS(conphantomjs.path,service_args=conphantomjs.service_args)
D.implicitly_wait(conphantomjs.timeout) ##Set timeout time
D.set_page_load_timeout(conphantomjs.timeout) ##Set timeout time
Self.q_phantomjs.put(d) #Save the phantomjs process to the queue
th=[]
for i in range(co phantomjs.phantomjs max):
t=threading.Thread(target=open_threading)
th.append(t)
for i in th:
i.start()
Time.sleep(conphantomjs.jiange) #Set the open interval
for i in th:
i.join()
def close_phantomjs(self):
'''
Multithreading to turn off phantomjs objects
'''
th=[]
def close_threading():
d=self.q_phantomjs.get()
d.quit()
for i in range(self.q_phantomjs.qsize()):
t=threading.Thread(target=close_threading)
th.append(t)
for i in th:
i.start()
for i in th:
i.join()
if __name__=="__main__":
'''
usage:
1. Instantiated class
2. Run open_phantomjs to open the phantomjs process
3. Run the getbody function, passing in the url
4. Run close_phantomjs to close the phantomjs process
'''
cur=conphantomjs()
conphantomjs.phantomjs_max=10
cut.open phantomjs()
print "phantomjs num is ",cur.q_phantomjs.qsize()
url_list=["http://www.baidu.com"]*50
th=[]
for i in url_list:
t=threading.Thread(target=cur.getbody,args=(i,))
th.append(t)
for i in th:
i.start()
for i in th:
i.join()
cur.close_phantomjs()
print "phantomjs num is ",cur.q_phantomjs.qsize()

Code Test:

Use single-thread optimized code to access 50 times Baidu: 10.3s.
Use 10 phantomjs to concurrently access 50 times Baidu: 8.1s

Description: The concurrently optimized code also opens 10 phantomjs processes for processing 50 requests to access Baidu. Since a phantomjs cannot handle 2 urls at the same time, that is, it does not support multi-thread processing, opening 10 phantomjs processes is equivalent to the number of concurrent programs. If you remove the time it takes to open 10 phantomjs, the total time for performing 50 visits is about 2s, and the speed is much faster.

Ultimate article

In the advanced article to solve the efficiency of concurrency, I actually use multiple processes, no matter how many threads Python opens at the same time to let the phantomjs process perform operations, a phantomjs process can only execute one access request at the same time. So the number of concurrency depends on the number of phantomjs turned on, and phantomjs runs in the form of a process.
Now that we understand the performance bottleneck, in the final article, we can use distributed +phantomjs multi-process concurrency to improve performance.

alternative plan

The above optimization solution does not fundamentally solve the phantomjs performance problem. For a better alternative, please move:
[Phantomjs correctly open way] (http://thief.one/2017/03/31/Phantomjs correctly open way /)

Portal

[[phantomjs series] phantomjs correctly opened] (http://thief.one/2017/03/31/Phantomjs%E6%AD%A3%E7%A1%AE%E6%89%93%E5%BC% 80%E6%96%B9%E5%BC%8F/)
[[phantomjs series] phantomjs api introduction] (http://thief.one/2017/03/13/Phantomjs-Api%E4%BB%8B%E7%BB%8D/)
[[Phantomjs series] those pits that selenium+phantomjs climbed] (http://thief.one/2017/03/01/Phantomjs%E7%88%AC%E8%BF%87%E7%9A%84%E9 %82%A3%E4%BA%9B%E5%9D%91/)
[[phantomjs series] selenium+phantomjs performance optimization] (http://thief.one/2017/03/01/Phantomjs%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C% 96/)

本文标题:phantomjs series Selenium+Phantomjs performance optimization

文章作者:nmask

发布时间:2017年03月01日 - 14:03

最后更新:2019年08月16日 - 15:08

原始链接:https://thief.one/2017/03/01/Phantomjs performance optimization/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: