Python regular expression

Speak quietly, the daughter is not beautiful

I haven’t updated the article for a long time, mainly because I don’t know how to write it. Recently, I have been busy with security development, and many of my knowledge has been temporarily stagnant, so there is no new knowledge to share. I remember that I was asked how to write regular expressions. I will combine the Python language in this article to summarize how regular expressions should be written.

Regular rules

If you want to write a regular expression, learn the rules first!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[]: Specify a character set, '[abc\s\d]', a, b, c can satisfy any one.
{}: Repeat, a{8}=aaaaaaaa;b{1,3}: Repeat 1-3 times.
^: Matches the beginning of the line, '[^abc]', except a, b, c, others can be; '^hello' indicates that the line is the string of hello.
$: matches the end of the line, 'boy$' means the string at the end of the line is boy.
\: escape character
\d:[0-9]
\ To [a-zA-Z0-9_]
\s: space or tab or newline, including \r\n
*: Specify the previous character to match 0 or more times. 'ab*' can match a, ab, abb, abbb, etc.
+: Specifies that the previous character matches 1 or more times.
?: Specify the previous character to match 0 or 1 times. The load can be followed by a minimum match.
.: indicates any character (new line character \n except re.DOTALL can match all strings, including line breaks; re.compile(r".*", re.DOTALL)
(?i): By adding this to the regular expression, you can ignore the case; for example: res=r"(?i)[abc]"
(asp|php|jsp): indicates that the difference with [abc] is that it matches a string, and [] is just a single character.
r'((?[\d]*){2,3})': Add ?: before the inner brackets, and the result only matches the contents of the outer brackets.

Using regular rules in Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import module
import re
Res=r"[\d]{2}" #regular expression
Content="abcd12345678abcd" #to be matched
a=re.search(res,content)
if a:
print a.group()
P_tel=re.compile(res) Compiles regular expressions to improve speed.
P_tel.findall(content) finds all substrings that the RE matches and returns as a list.
P_tel.match(content) determines whether the RE matches at the beginning of the string.
P_tel.search(content) scans the string to find the location where this RE matches.
P_tel.group(num=0) returns all matching objects. (or specify a subgroup of number num)
P_tel.sub(pattern,string) Replaces all matching strings in the regular expression with pattern.
P_tel.split(pattern,string) Separates the character string into a list according to the separator in the regular expression pattern.

Greed mode

It is important to note that regular matching defaults to greedy matching, which means matching as many characters as possible.
example:

1
2
>>> re.match(r'^(\d+)(0*)$', '123400').groups()
('123400', '')

Note: Since \d+ uses greedy matching, all the following 0s are directly matched, and the result 0* can only match the empty string. You must let \d+ use a non-greedy match (that is, as few matches as possible) to match the next 0, plus one? Let \d+ use a non-greedy match:

1
2
>>> re.match(r'^(\d+?)(0*)$', '123400').groups()
('1234', '00')

Regular instance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Regular expression matching the IP address res=/(\d+)\.(\d+)\.(\d+)\.(\d+)/g
A regular expression that matches a blank line: res=\n[\s| ]*\r
Regular expression matching HTML tags: res=/<(.*)>.*<\/\1>|<(.*) \/>/
Regular expression matching the leading and trailing spaces: res=(^\s*)|(\s*$)
Regular expression matching Chinese characters: res=[\u4e00-\u9fa5] #python needs to be converted to unicode
Match double-byte characters (including Chinese characters): res=[^\x00-\xff]
user_phone=r"((?:13[0-9]|14[579]|15[0-3,5-9]|17[0135678]|18[0-9])\d{8})"
user_id_18=r"([1-9]\d{5}(?:1[9,8]\d{2}|20[0,1]\d)(?:0[1-9]|1[0-2])(?:0[1-9]|1[0-9]|2[0-9]|3[0,1])\d{3}[\dxX])"
user_id_15=r"([1-9]\d{7}(?:0[1-9]|1[0-2])(?:0[1-9]|1[0-9]|2[0-9]|3[0,1])\d{2}[\dxX])"
Car_id=r""+u"([Beijing, Tianjin, Anhui, Yun, Liao, Heilongjiang, Luxin, Jiangsu, Zhejiang, Hubei, Guizhou, Guizhou, Guizhou, Guizhou, Sichuan, Ningxia, Qionglai] [1}[AZ]{1}[A -Z0-9]{4}[A-Z0-9 Hanging Police Hong Kong and Macau]{1})"
email=r"([a-z_0-9.-]{2,64}@[a-z0-9-]{2,200}\.[a-z]{2,6})"
Python for unicode regular writing:
res=ur"[\u4e00-\u9fa5]"

Regular platform

It is recommended to test the quality of the regularity before using the regularity in the code. For the convenience of use, I wrote an online tool that I can use: http://tool.nmask.cn/python_re/

本文标题:Python regular expression

文章作者:nmask

发布时间:2017年10月30日 - 19:10

最后更新:2019年08月16日 - 15:08

原始链接:https://thief.one/2017/10/30/1/en/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!

热门文章推荐: