Python2 encoding sequel

Butterfly is beautiful, after all, the butterfly flies but the sea </

The story begins like this

1
a="\\u8fdd\\u6cd5\\u8fdd\\u89c4"

A simple analysis of this string of characters, it feels like unicode encoded content, but I feel a little less, so I started a series of experiments.
I want to figure out what this string is. First, I tested the unicode-encoded string to see how long it looks.

1
2
3
4
5
6
7
>>> a=u"Hello"
>>> a
u'\u4f60\u597d' #(unicode encoding)
>>> print type(a)
<type 'unicode'>
>>> print a
Hello there

The experimental result indicates that the unicode string is long like this: u’\u4f60\u597d’, but it actually represents Chinese: Hello. As for why input a, the output is unicode character content, and print a output is str format Chinese: Hello, the reason is that the print statement in python will automatically convert unicode characters into str format. If you don’t know about unicode and string, then please go back to the beginning of the article and move on to the article that analyzed the code before. I think it will help you.
Even knowing that unicode characters are long, then we can rule out that the strange string is not a unicode string. why? Obviously, because there is no u in front of it.

Are you a little confused when you see it here? Although it doesn’t have u(u”\u4f60….”) in front of it, it looks really like unicode characters. Don’t worry, let me introduce the difference between string variable encoding and string content encoding.

Description: The above two concepts are taken by myself and do not represent the official explanation. If there is any deviation, please understand

The so-called string variable encoding is what we usually call encoding, such as string, unicode, string and utf-8, gbk, gb2312 and so on. The judgment method is very simple, you can use the type function.

1
2
3
4
5
6
>>> a=u"Hello"
>>> print type(a)
<type 'unicode'>
>>> a="Hello"
>>> print type(a)
<type 'str'>

We can see that unicode or string represents an encoding format of a string variable, regardless of its content. We know to define a=”test”, then a is string encoding; otherwise it defines a=u”test”, a is unicode encoding, then I want to ask: What is the encoding of test? (The question is here, test, not a)
Some people will say that test is a normal string, yes it is indeed a string, which represents the content of a. Then the same reason

1
a="\\u8fdd\\u6cd5\\u8fdd\\u89c4"

When a is itself a string in str format, then

1
\\u8fdd\\u6cd5\\u8fdd\\u89c4

What about the content itself? That’s right, the content itself is a unicode encoded string. Ok, let’s do the experimental test.

Let’s take a look at the string content encoded by several common encoding formats:

1
2
3
4
5
6
7
8
9
10
11
12
>>> a=u"Hello".encode("gbk")
>>> a
'\xc4\xe3\xba\xc3' #Content for gbk encoding
>>> a=u"Hello".encode("utf-8")
>>> a
'\xe4\xbd\xa0\xe5\xa5\xbd' #content is utf-8 encoding
>>> a=u"Hello".encode("gb2312")
>>> a
'\xc4\xe3\xba\xc3' #Content for gb2312 encoding
>>> a=u"Hello"
>>> a
u'\u4f60\u597d' #Content is unicode encoding

Please pay attention to the above coded content, observe its characteristics, and then we will look at the strange string.

1
2
3
4
5
6
7
8
>>> a="\\u8fdd\\u6cd5\\u8fdd\\u89c4"
>>> a
'\\u8fdd\\u6cd5\\u8fdd\\u89c4'
>>> print type(a)
<type 'str'>
>>> print a
\u8fdd\u6cd5\u8fdd\u89c4
>>>

We see that the variable a is in string format.

1
2
3
4
5
>>> a=u"\\u8fdd\\u6cd5\\u8fdd\\u89c4" #Add a u in front, change the variable a to unicode
>>> print type(a)
<type 'unicode'>
>>> print a #equivalent to a.encode("utf-8")
\u8fdd\u6cd5\u8fdd\u89c4

We add a u in front of the variable “”, indicating that the variable a is a unicode string and its content is

1
\\u8fdd\\u6cd5\\u8fdd\\u89c4

Next to print a, I found the same result as the previous step, yes, because print changed a from unicode to string, and its content looked less slashed.

1
2
3
4
5
6
>>> b=u"\u8fdd\u6cd5\u8fdd\u89c4"
>>> print type(b)
<type 'unicode'>
>>> print b
Violation of laws and regulations
>>>

Then, I reassigned the content of a, ie \u8fdd\u6cd5\u8fdd\u89c4, to the variable b. At this time, “” also added a u to make it a unicode format, then print b, the magic scene happened. The output is actually converted into Chinese. The reason I think is that the print statement will not only convert the string variable a into a string, but also convert its contents to string.

1
2
3
4
5
>>> a="Hello"
>>> a
'\xc4\xe3\xba\xc3'
>>> a=u"\xc4\xe3\xba\xc3"
>>> print a

The above example defines the variable a as unicode encoding, and its content is string-utf-8 encoding. At this time, when printing a, the print statement tries to convert the content of a into string, but since it is string encoding itself, garbled characters appear. The opposite is ok.

1
2
3
4
5
6
7
8
>>> a="Hello"
>>> a
'\xc4\xe3\xba\xc3'
>>> b="\xc4\xe3\xba\xc3"
>>> b.decode("gbk")
in '\ u4f60 \ u597d'
>>> print b.decode("gbk")
Hello there

You may find it strange to see this. We define the content of the variable a as \u8fdd\u6cd5\u8fdd\u89c4, and the strange string is like this.

1
\\u8fdd\\u6cd5\\u8fdd\\u89c4

It seems that there are more slashes, and the list is urgent. After reading the following test, you can understand the difference between the two.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> b="\\xc4\\xe3\\xba\\xc3"
>>> b.decode("gbk")
u'\\xc4\\xe3\\xba\\xc3'
>>> print b.decode("gbk")
\xc4\xe3\xba\xc3
>>> c="\xc4\xe3\xba\xc3"
>>> print c.decode("gbk")
Hello there
#################
>>> print a
\u8fdd\u6cd5\u8fdd\u89c4
>>> print b
Violation of laws and regulations
>>>

Simply put, the strange string is the content after 2 unicode encodings.

Built-in function use

Of course, turning it into Chinese can be done with a built-in function. The reason why I distribute the demo is to show the specific meaning more clearly.
Convert unicode encoded content to Chinese (note the content, not the string variable)

1
2
3
a="\\u8fdd\\u6cd5\\u8fdd\u89c4" #Content a is unicode encoded, variable a is string encoding ("" don't add u before"
b=a.decode('unicode-escape')
print b

Convert string-encoded content to Chinese (note the content, not the string variable)

1
2
3
a="\\xe5\\x85\\xb3\\xe4\\xba\\x8e\\xe4" # Variable a content is string encoding, variable a is string encoding ("" do not add u before"
b=a.decode('string-escape')
print b

The difference between unicode-escape and utf-8

Added on April 27, 2017

1
2
3
4
5
6
7
8
>>>a="\u4e0a\u4f20\u6210\u529f"
>>>b=a.decode('utf-8')
>>>print type(b)
<type 'unicode'>
>>>b
in '\\ \\ u4e0a u4f20 \\ \\ u6210 u529f'
>>>print b
\u4e0a\u4f20\u6210\u529f

When the variable a is decoded (‘utf-8’), in addition to changing the type of the variable a from str to unicode, the content of the a variable is also utf-8 decoded, so there are some slashes.

1
2
3
4
5
6
7
8
>>>a="\u4e0a\u4f20\u6210\u529f"
>>>c=a.decode("unicode-escape")
>>>print type(c)
<type 'unicode'>
>>>c
in '\ u4e0a \ u4f20 \ u6210 \ u529f'
>>>print c
Upload success

When the variable a is decoded (‘unicode-escape’), it seems that only the variable itself is decoded into unicode, and its content has not changed.

We know that the print function will encode the variables and the contents of the variable into str, so the second example can output Chinese, and the first example outputs the contents of the unicode type, but with some slashes missing, because it still needs to Encode once.
Of course, the conversion of this example has a simpler method, as follows:

1
2
3
>>> d=u"\u4e0a\u4f20\u6210\u529f" #When defining the variable d, add a u in front and turn it into unicode
>>> print d
Upload success

I opened a round of speeding. I don’t know if you have motion sickness. If you can’t figure out the above coding relationships, you can remember the last two functions.

The story ends like this

Looking at the screen to output familiar Chinese characters, I am excited to throw the transcoded content to a certain division, and eagerly awaiting the reward, waiting to enjoy the sunset that the top floor room of the Super 8 Hotel greets, and savoring That strange smirk. Until the final screen jumped out of a line: 8:00 at 8:00, the theater sees .

Supplement

April 21, 2017
There is a list of lists. The fields in the list are in unicode format. When the list is output, the contents are as follows:

1
[u'\u827a\u672f\u9986', u'\u5b58\u50a8\u7ba1\u7406', u'\u609f\u8005', u'\u827a\u54c1', u'\u7ca4\u5907\u4eac', u'\u767e\u79cd', u'\u5fae\u55b7', u'\u827a\u672f\u4f5c\u54c1', u'\u57f9\u690d', u'\u6444\u5f71\u5bb6', u'\u666e\u53ca\u6559\u80b2', u'\u5927\u9053\u81f3\u7b80', u'\u88c5\u5e27', u'\u96c5\u660c\u4ee5', u'\u9274\u8bc1', u'\u4e07\u6377', u'\u6838\u5fc3\u6280\u672f', u'\u884d\u751f\u54c1']

How to make the contents of the list Chinese? I guess, when the list is output, the Chinese inside will be automatically encoded, so you can do this:

1
2
3
4
```
Look at the output
```bash
['Art Gallery', 'Storage Management', 'Wuren', 'Arts', 'Yuebeijing', 'Hundred Kinds', 'Micro-spray', 'Artwork', 'Pei Zhi', 'Photographer' , 'Universal Education', 'Avenue to Jane', 'Blocking', 'Yachang to', 'Forensic', 'Wanjie', 'Core Technology', 'Derivatives']

Portal

[The beauty of Python3 encoding] (http://thief.one/2017/04/18/1/)

本文标题:Python2 encoding sequel

文章作者:nmask

发布时间:2017年04月14日 - 17:04

最后更新:2019年08月16日 - 15:08

原始链接:https://thief.one/2017/04/14/01/en/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: