Python2 encoding

The exploration of technology is like weaving a story. The fun is that it can be told to others occasionally and get some approval!

The Python encoding problem has been bothering me for a long time. There have been some summaries before, but it is not systematic and messy. Of course, the python2.x coding problem itself is that the cut is still chaotic. This article will introduce some coding problems that will be encountered in python2.x programming, and give solutions. Based on the exploration of the coding problem, I also tried to write a transcoding module [Transcode] (https://github.com/tengzhangchao/Transcode), which should solve the intractable diseases of most newcomers. Of course, Python God can detour, as for the friends who use 3.x, will be introduced in the future.

Python programming often encounters operating system encoding, file encoding, console input and output encoding, web page encoding, source code encoding, python encoding, this article will be introduced one by one. First let’s take a look at some common coding situations:
1
2
3
4
Print sys.getdefaultencoding() #system default encoding
Print locale.getdefaultlocale() #System current encoding
Print sys.stdin.encoding #terminal input encoding
Print sys.stdout.encoding #terminal output encoding

Run the above code in windows and linux system separately to view the output.
Windows terminal results:

1
2
3
4
5
ascii
mbcs
(tb & gt; ______________________________________ & lt; tb & gt;
cp936
cp936

Linux terminal results:

1
2
3
4
5
ascii
UTF-8
('zh_CN', 'UTF-8')
UTF-8
UTF-8

Operating system code

The operating system default encoding can be obtained by the sys.getdefaultencoding() function. You can see that windows and linux default to ascii encoding, and we know that ascii encoding does not support Chinese. So where is the operating system code used in the python program? When will it trigger a bloody case?

Triggering anomalies

After testing, I found that when you need to store the unicode format string into the file, Python internally converts it to the system code of the Str format by default, and then performs the deposit step. In the process, it is easy to cause ascii anomalies.
The example proves:

1
2
3
#! -*- coding:utf-8 -*-
f=open("test.txt","w")
f.write(a)

Error exception information: UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-1……
Note: Because ascii does not support Chinese, and variable a is a Chinese string in unicode format, it cannot be encoded and throws an exception.

solution

Set the system code to utf-8 or gbk.

1
2
3
import sys
reload(sys)
sys.setdefaultencoding('gbk')

Description: Set it to gbk under windows and utf-8 in linux.

Terminal Code

The terminal under windows refers to the console. The input and output on the console has its own encoding format. For example, the input and output encoding of the windows console is cp936. Forgive me for the first time I saw this code, so I checked it online and found that it is a common GBK code; the input and output codes of the Linux terminal are UTF-8. If we write a program that does not end-input and output anything, we can ignore this encoding, if it is not, the terminal encoding will be very important.

garbled point

When we execute python scripts on the terminal, we often encounter output Chinese garbled characters, which is often because the output string itself is inconsistent with the console code.
The example proves:

1
2
3
4
#! -*- coding:utf-8 -*-
a="Chinese" # Define a variable, default is Str, utf-8 encoding
print a
print type(a)

Windows console output results:

1
<type 'str'>

Linux terminal output:

1
2
Chinese
<type 'str'>

The reason for this difference is that the windows console encodes gbk, and the variable a itself is utf-8 encoded.

solution

1
2
3
4
#! -*- coding:utf-8 -*-
a='Hello'
b=a.decode("utf-8").encode("gbk")
print b

The variable a is converted from utf-8 encoding to gbk encoding.

python coding

The content obtained by python2.x from the outside is string encoding, which is divided into String encoding and Unicode encoding, and String encoding is divided into UTF-8, GBK, GB2312 and so on. Therefore, in order to avoid the error caused by different encodings, Python internally is best converted to unicode encoding, and then converted to str encoding when output. You can use the encode()/decode() function to interchange string with unicode encoding.

Triggering anomalies

Basically, it is triggered when python internal variable encoding is combined with console encoding, or other encoding.
The example proves:

1
2
3
4
#! -*- coding:utf-8 -*-
a="Chinese" #Define a variable, the default is str, utf-8 encoding
print a
print type(a)

operation result:

1
<type 'str'>

Description: Windows console input and output are gbk encoding format, and the variable a defined in the code is str, utf-8 format, so there will be garbled characters. If you want to create a variable with a unicode encoded string, you can a=u”123” and put a u in front of the double quotes to indicate that a is a unicode encoding.

solution

1
2
3
#! -*- coding:utf-8 -*-
a='Hello'
print a.decode("utf-8").encode("gbk")

Description: First of all, the variable a we defined is the str format, encoded as a utf-8 string, we want to convert it to str format, GBK encoded string. It is not possible to convert directly inside Python, you need to use the decode() and encode() functions. The decode() function first converts the string a in str format to unicode, and then encodes unicode into the str format GBK. On the Unix system, there is no such problem, because it is utf-8 encoding, there will be no garbled. By default, the print statement will encode the unicode string, encode the str code of the corresponding system and output it (gbk under windows and utf-8 under Unix), so don’t worry about the print unicode encoded string will report an error.

Source code encoding

Source code encoding refers to the encoding of the python program itself, the default is ascii.

Triggering anomalies

The python program itself is parsed and executed by the interpreter and needs to be converted to binary code first. In this process, it is easy to cause an exception. The reason is that ascii does not support Chinese. Therefore, when Chinese appears in the python program, even if it is a comment, it will cause an ascii exception.
The example proves:

1
2
3
4
5
```
#### solution
```bash
#! -*- coding:utf-8 -*-

The python program starts with this code, specifying the python source code encoding format as utf-8.

file encoding

File encoding refers to the encoding format of the content that the python program gets from the file. Can be obtained by sys.getfilesystemencoding() function, mbcs under windows and utf-8 under linux. As for mbcs, it is a multi-byte encoding (not very clear).

Triggering anomalies (reading file contents)

When a python program gets content from a file and outputs it, it is easy to trigger an exception.
The example proves:

1
2
3
4
5
#! -*- coding:utf-8 -*-
f=open("test.txt","r")
content=f.read()
print type(content)
print content

operation result:

1
2
<type 'str'>
Hello there

solution

Under Windows, it is best to convert the file content to unicode, you can use codecs:

1
f=codecs.open("test.txt", encoding='gbk').read()

Convert the contents of the file format gbk to unicode format, of course, you can also use open(“”,”r”).read().decode(“gbk”)

Triggering anomalies (writing file contents)

Refer to the operating system code to trigger an exception. When a Chinese unicode character is written to a file, it is easy to trigger an exception.

solution

Refer to the operating system encoding solution, or manually convert the unicode encoding to str encoding.
The example proves:

1
2
3
#! -*- coding:utf-8 -*-
f=open("test.txt","w")
f.write(a.encode("gbk"))

Of course, if the variable a itself is Str, it will not report an error, but the content of the utf-8 code is written into the windows file, and the display will be garbled.

Web page coding

Web page coding, which is often encountered when writing crawlers, combined with system coding, Python coding, and file coding, often leads to a mess. In the program we should handle these encodings separately, all converted to unicode inside python. So what are the formats of web coding?
Common formats: utf-8, gbk, gb2312

Triggering anomalies

It is still the case that the source code and the terminal code obtained from the web page are inconsistent with the internal code of python.
The example proves:

1
2
3
4
5
#!coding=utf-8
import urllib2
body=urllib2.urlopen('http://thief.one').read()
print type(body)
print body

operation result:

1
2
<type 'str'>
Body Chinese display garbled

Description: The encoding of this website is utf-8, and the content crawled by python from the webpage is Str format, and the output will be garbled under the windows console.

solution

According to the previous practice, first convert it to unicode. The corresponding regular can also be unicode encoded, such as: res=r’’+u”new member”. The chardet module can be used to determine the type of web page encoding, and a dictionary with probability is returned.

Code judgment

Judging string encoding

1
isinstance(obj, (str, unicode))

Return True or False

Judging web page encoding

1
2
3
4
import chardet
import urllib2
body=urllib2.urlopen("http://thief.one").read()
chardet.detect(body)

Judging the encoding format, there will be a percentage, which is generally used to judge the webpage encoding is better.

Judging system coding

1
2
3
4
Print sys.getdefaultencoding() #system default encoding
Print locale.getdefaultlocale() #System current encoding
Print sys.stdin.encoding #terminal input encoding
Print sys.stdout.encoding #terminal output encoding

python2.x coding suggestions

  • Please try to program on Linux system. In summary, we can know that Linux is much better than Windows.
  • Python code should use unicode encoding internally. When getting external content, first decode it to unicode, and then output it to Str when it is output.
  • When defining variables or regulars, also define unicode characters, such as a=u”Chinese”;res=r””+u” regular”.

Other incurable diseases

Example 1:

1
2
a="\\u8fdd\\u6cd5\\u8fdd\\u89c4"
print a

The content of the variable a is itself unicode encoded. How do I display the input normally?
solution:

1
2
3
a="\\u8fdd\\u6cd5\\u8fdd\\u89c4" # unicode converted to Chinese
b=a.decode('unicode-escape')
print b



If you have read this chapter and added your knowledge of python coding issues, then I will be pleased, if there is a problem with python coding, you can leave a message below.
If you read this chapter, you still don’t know how to solve the Python garbled problem, it doesn’t matter, please continue to read [Transcode solve python encoding problem] (https://github.com/tengzhangchao/Transcode)

  • In order to make you pay attention, I have to reiterate that the key to solving the python2.x encoding problem is to understand that no matter where it comes from, it should be converted to unicode first when it is circulating inside Python. (Python3.x has improved in this area and achieved good results)*

Portal

[Continuation of Python 2 encoding] (http://thief.one/2017/04/14/1/)
[The beauty of Python3 encoding] (http://thief.one/2017/04/18/1/)

本文标题:Python2 encoding

文章作者:nmask

发布时间:2017年02月16日 - 12:02

最后更新:2019年08月16日 - 15:08

原始链接:https://thief.one/2017/02/16/Solve the problem of Python2-x encoding/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: