phantomjs series Phantomjs correctly open

How did you get out of the haze of life?
Take a few more steps

Some time ago analyzed [Selenium+Phantomjs usage and performance optimization issues] (http://thief.one/2017/03/01/Phantomjs%E6%80%A7%E8%83%BD%E4%BC%98 %E5%8C%96/), during the analysis of the [Selenium+phantomjs crawler crawled some pit problems] (http://thief.one/2017/03/01/Phantomjs%E7%88%AC%E8 %BF%87%E7%9A%84%E9%82%A3%E4%BA%9B%E5%9D%91/). However, in the process of using phantomjs, the performance of phantomjs is not really improved, and the crawler performance is not improved. After the netizen’s reminder, it is found that the method of using phantomjs is a problem, so no matter how optimized, it can not fundamentally improve performance. Then let’s talk about this article, Phantomjs correctly open the way.

Abandon selenium+phantomjs

I used Selenium to use phantomjs before. The reason is that since selenium encapsulates some functions of phantomjs, selenium provides Python interface module. Selenium can be used well in Python language, and phantomjs can be used indirectly. However, what I am saying now is that it is time to discard selenium+phantomjs. One of the reasons why this packaged interface has not been updated for a long time (no one has maintained it), the second reason selenium only implements some of the phantomjs functions, and is not perfect. .

phantomjs APi

By looking at the official introduction of phantomjs, we can find that the function of phantomjs is extremely powerful, not just the function of the selenium package. Phantomjs provides a variety of APi, you can view: [Pantomjs api introduction] (http://thief.one/2017/03/13/Phantomjs-Api%E4%BB%8B%E7%BB%8D/), among them The most commonly used ones are Phantomjs WebService and Phantomjs WebPage, the former is used to open the http service, and the latter is used to initiate http requests.

Phantomjs correct use

Design Flow:

Python sends the task through http request, Phantomjs Webservice gets the task and then processes it, and then returns the result to Python after processing. Task scheduling, storage and other complex operations are handed over to Python. Python can be written asynchronously to request Phantomjs Webservice. It should be noted that currently a Phantomjs Webservice only supports 10 concurrent. But we can open a few phantomjs Webservice on a server to enable different ports, or you can cluster multiple servers and use nginx as a reverse proxy.

Phantomjs Webservice

Create a new test.js and write the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/ / This js is used to get the page source
var system=require('system'); //get args
var args=system.args;
if (args.length ===2){
var port=Number(args[1]);
}
else{
was port = 8080;
}
var webserver = require ('webserver');
var server = webserver.create()
var service = server.listen(port, function(request, response) {
try{
var postRaw=request.postRaw;
var aaa=new Array();
aaa = postRaw.split ("=");
var url = aaa [0];
var md5_url = aaa [1];
url=decodeURIComponent(url);
// Create page
var web Page = require('webpage');
var page = webPage.create();
page.settings.resourceTimeout = 20000;//timeout is 20s
// page error catching
page.onError = function(msg, trace) {
console.log("[Warning]This is page.onError");
var msgStack = ['ERROR: ' + msg];
if (trace && trace.length) {
msgStack.push('TRACE:');
trace.forEach(function(t) {
msgStack.push(' -> ' + t.file + ': ' + t.line + (t.function ? ' (in function "' + t.function +'")' : ''));
});
}
// console.error(msgStack.join('\n'));
};
// phantomjs error catching
phantom.onError = function(msg, trace) {
console.log("[Warning]This is phantom.onError");
var msgStack = ['PHANTOM ERROR: ' + msg];
if (trace && trace.length) {
msgStack.push('TRACE:');
trace.forEach(function(t) {
msgStack.push(' -> ' + (t.file || t.sourceURL) + ': ' + t.line + (t.function ? ' (in function ' + t.function +')' : ''));
});
}
console.error(msgStack.join('\n'));
phantom.exit(1);
};
// Open the web page and get the source code
page.open(url, function (status) {
Console.log('Target_url is ' + url); //output the website url to be detected
if(status=='success'){
var current_url = page.url;
var body= page.content;
}
else
{
var body = "";
var current_url="";
}
response.status=200;
// response.write(body); //Return the source code of the obtained webpage
Response.write(current_url); //return the current page url
page.close();
response.close();
});
}
catch(e)
{
console.log('[Error]'+e.message+ happen'+a.lineNumber+ line');
}
});

Role: handle http requests, get urls, take screenshots or get source code operations.
use:

1
phantomjs.exe test.js 8080

The web service will be opened locally, and the port is 8080.

Python Client

Create a new http_request.py and write the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#! -*- coding:utf-8 -*-
import requests
import hashlib
import base64
from multiprocessing.dummy import Pool
class http_request:
def __init__(self,port="8080"):
self.url="http://localhost:"+port
def getwebbody(self,domain):
'''
Get the page source code
'''
base_domain=base64.b64encode(domain)
md5_domain=hashlib.md5(base_domain).hexdigest()
payload={domain:md5_domain}
try:
response=requests.post(self.url,data=payload,timeout=30).content
return response
except requests.exceptions.ConnectionError:
print "requests connection error"
except Exception,e:
print is
return
if __name__=="__main__":
port="8080"
cur=http_request(port)
domain_list=["http://thief.one"]*10
def test(domain):
print "Result_url is ",cur.getwebbody(domain)
pool = Pool(processes=10)
For domain in domain_list: # concurrent delivery task
Pool.apply_async(test, args=(domain,)) #Maintain the total number of executed processes to be 10, and add a new process when a process is executed.
pool.close()
pool.join()

Role: Asynchronous concurrent delivery tasks.

Run screenshot

After running python, 10 tasks are delivered asynchronously. The Phantomjs server receives the url and starts processing, and processes 10 tasks and enters the result.

Exception handling

Phenomenon: the screenshot is a black screen
Reason: The webpage has not been loaded yet, so it starts to take a screenshot.
Solution: Determine the status value after opening in the code to determine whether the web page is loaded.

Phenomenon: Program error - windows error
Solution: Replace the latest version of phantomjs

Phenomenon: memory usage is too large, causing the error to stop the phantomjs process
Reason: phantomjs did not release the content
Solution: After the code is open, open.close();

Phenomenon: no screenshots succeeded
Reason: page.close is used, because onloadfinished is non-blocking, so page.close should be placed inside the open code layer.

Reprint please indicate the source: [Phantomjs correct opening method | nMask’Blog] (http://thief.one/2017/03/31/Phantomjs correctly open way /)
This article address: http://thief.one/2017/03/31/Phantomjs correct opening method /

Portal

[[phantomjs series] phantomjs correctly opened] (http://thief.one/2017/03/31/Phantomjs%E6%AD%A3%E7%A1%AE%E6%89%93%E5%BC% 80%E6%96%B9%E5%BC%8F/)
[[phantomjs series] phantomjs api introduction] (http://thief.one/2017/03/13/Phantomjs-Api%E4%BB%8B%E7%BB%8D/)
[[Phantomjs series] those pits that selenium+phantomjs climbed] (http://thief.one/2017/03/01/Phantomjs%E7%88%AC%E8%BF%87%E7%9A%84%E9 %82%A3%E4%BA%9B%E5%9D%91/)
[[phantomjs series] selenium+phantomjs performance optimization] (http://thief.one/2017/03/01/Phantomjs%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C% 96/)

本文标题:phantomjs series Phantomjs correctly open

文章作者:nmask

发布时间:2017年03月31日 - 11:03

最后更新:2019年08月16日 - 15:08

原始链接:https://thief.one/2017/03/31/Phantomjs opens correctly/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: