The road to life needs to be adhered to, and the way of technology is also
Friends who have written reptiles should have used a headless browser – phantomjs. The reason for using it is very simple: it can highly simulate browser access (against anti-crawling), headless browsing (can save performance). The most widely used Phantomjs application should still be used to execute js code, such as writing a js script, using phantomjs to execute, you can write a screenshot of the page, web performance testing and so on.
Phantomjs is also a great artifact in the reptile world. I originally used it to crawl some dynamically loaded web pages, which works well. Of course, Phantomjs is not perfect, although its performance as a headless browser is much faster than other browser-based kernel-based tools, but compared to ordinary crawlers, the speed is still very different.
The installation of phantomjs uses a big push on the Internet, and it is not repeated here. The focus of this article is on the performance optimization of Phantomjs. Because I am familiar with the Python language, I use this language to talk about the performance optimization method of Phantomjs.
The use of Phantomjs in Python requires the use of Selenium modules. Selenium itself is also used for Web automation testing, which encapsulates Phantomjs, so we can use it to use Phantomjs. The specific installation method is not introduced here. Phantomjs can set parameters at startup, so let’s see how to optimize the performance by setting parameters.
Test result: 3.2s
Test result: 2.9s
Note: From the perspective of a single website, reasonable setting of parameters can speed up by 0.3s (if there are more resources such as pictures on the website, the effect of promotion will be more obvious).
When using a crawler to access a batch of websites, websites that are slow to load tend to block for a long time, and websites that cannot be opened will always block, which seriously affects the performance of the crawler. We know the general crawlers, such as requests, urllib, etc. The module can set the timeout, which is the timeout period, and phantomjs can also be set.
Description: If phantomjs is loaded for more than 10s, an exception will be triggered. (Although the exception is triggered, current_url can still be used to get the current url, the source code can also be obtained, but the full source code is not loaded. Of course, only for slow-moving websites, except for websites that are completely inaccessible.)
During my time using phantomjs, through constant debugging, I found that the main performance cost of phantomjs lies in the opening of the phantomjs process. Because using phantomjs in Python is equivalent to opening and calling phantomjs.exe (windows) to perform some operations, so if the phantomjs process is turned on and off frequently, it will consume a lot of performance, so it is necessary to operate the switch reasonably.
Single-threaded access to Baidu 10 times:
Test results: 28.2s, during the running process, the phantomjs process keeps switching.
Test result: 4.2s
Description: You can see the difference between the pre-optimization and the optimized code, it is to put the phantomjs open and close operation outside the loop, so that it always only switches once. It can be seen that the performance difference is very large, so it can also be seen that the phantomjs switching process is very time consuming.
Note: Although this method saves a lot of money, it will cause another phantomjs bug (temporarily called Bug), which is the phantomjs state coverage problem. When you visit some websites in batches, you will find that the returned results do not correspond to the requested website. On this issue, please move to [Phantomjs climbed those pits] (http://thief.one/2017/03/01/Phantomjs%E7%88%AC%E8%BF%87%E7%9A%84% E9% 82% A3% E4% BA% 9B% E5% 9D% 91 /).
Through the previous optimization, we found that the performance of phantomjs has improved a lot, but even so, the above code only achieves optimization in single thread. When encountering high-volume websites, concurrency is a necessary choice, then how to use and optimize Phantomjs in concurrency?
In the optimization of phantomjs concurrency performance, I did not have a smooth sailing, during the review a lot of information, but also stepped on a lot of pits.
At first I used the most straightforward method in an attempt to turn on the performance of phantomjs concurrently. (Run a phantomjs process, open multithreading in the process)
However, running the connection error, after viewing the official website and other information found that phantomjs is single-threaded, so if you follow the above, then you can not use multi-threading to execute at the same time, this optimization failed!
Since a phantomjs can only support single threads, then I will open multiple phantomjs.
Finally, I saw that 10 phantomjs processes were started at the same time, and requests from 10 websites could be executed concurrently. However, when the number of websites is 50, do you want to run 50 phantomjs processes at the same time? No, this will definitely mess up the server, this optimization failed!
After the above two failures, I began to think about how to open only 10 phantomjs processes, and then each phantomjs process executes the requesting website in order. This is equivalent to 10 processes executing concurrently.
Finally on a night, I came up with the following code:
Successfully opened 10 phantomjs processes, each executing 50 website requests in sequence. Wait, it looks like this design, every phantomjs process will visit 50 times Baidu, this is not the original request, oh, No!
In the third stage, the prototype of concurrent optimization has come out, but it still needs to solve the problem of a multi-thread shared resource, which can be solved by the Queue module. Then look directly at the optimized concurrency code:
Use single-thread optimized code to access 50 times Baidu: 10.3s.
Use 10 phantomjs to concurrently access 50 times Baidu: 8.1s
Description: The concurrently optimized code also opens 10 phantomjs processes for processing 50 requests to access Baidu. Since a phantomjs cannot handle 2 urls at the same time, that is, it does not support multi-thread processing, opening 10 phantomjs processes is equivalent to the number of concurrent programs. If you remove the time it takes to open 10 phantomjs, the total time for performing 50 visits is about 2s, and the speed is much faster.
In the advanced article to solve the efficiency of concurrency, I actually use multiple processes, no matter how many threads Python opens at the same time to let the phantomjs process perform operations, a phantomjs process can only execute one access request at the same time. So the number of concurrency depends on the number of phantomjs turned on, and phantomjs runs in the form of a process.
Now that we understand the performance bottleneck, in the final article, we can use distributed +phantomjs multi-process concurrency to improve performance.
The above optimization solution does not fundamentally solve the phantomjs performance problem. For a better alternative, please move:
[Phantomjs correctly open way] (http://thief.one/2017/03/31/Phantomjs correctly open way /)
[[phantomjs series] phantomjs correctly opened] (http://thief.one/2017/03/31/Phantomjs%E6%AD%A3%E7%A1%AE%E6%89%93%E5%BC% 80%E6%96%B9%E5%BC%8F/)
[[phantomjs series] phantomjs api introduction] (http://thief.one/2017/03/13/Phantomjs-Api%E4%BB%8B%E7%BB%8D/)
[[Phantomjs series] those pits that selenium+phantomjs climbed] (http://thief.one/2017/03/01/Phantomjs%E7%88%AC%E8%BF%87%E7%9A%84%E9 %82%A3%E4%BA%9B%E5%9D%91/)
[[phantomjs series] selenium+phantomjs performance optimization] (http://thief.one/2017/03/01/Phantomjs%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C% 96/)