@Pigmon 2016-11-30T05:07:32.000000Z 字数 2469 阅读 1057

Python正则表达式

Python

pattern	描述
.	除了\n外的任意单个字符
\w	一个word字符，即字母或数字或下划线字符 [a-zA-Z0-9_]
\W	一个非word字符
\b	word和非word的边界
\s	[\n\r\t\f]
\S	[\t\n\r]
\d	[0-9]
^	开始标志
$	结束标志
+	匹配它左边模式1次或多次
*	匹配它左边模式0次或多次
?	匹配它左边模式0次或1次

基本例子

import re
str1 = 'a123b'
match = re.search(r'\w\w', str1)    # match.group() = 'a1'
match = re.search(r'\d\d', str1)    # match.group() = '12'
match = re.search(r'\d[a-z]', str1) # match.group() = '3b'
match = re.search(r'[0-9]+', str1)  # match.group() = '123'
match = re.search(r'i+', 'piigiii') # match.group() = 'ii'
match = re.search(r'\w\s*\w\w\w\w\s*', 'I have a pin.') # match.group() = 'I have ' 0或多个空白字符
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'[\w\s\.-]+\@[\w\s\.-]+', str)
# 找不到，因为要从字符串开头就匹配
match = re.search(r'^b\w+', 'foobar')
# match.group = 'bar'
match = re.search(r'b\w+', 'foobar')

group 例子

str = 'purple alice-b@google.com monkey dishwasher'
# 注意和上面例子不同，@前后两个pattern都加了小括号，否则group(n > 0)就不存在了
match = re.search(r'([\w\s\.-]+)\@([\w\s\.-]+)', str)
# group(0) = group() = 'purple alice-b@google.com monkey dishwasher'
# group(1) = 'purple alice-b' ，即第一个括号里的内容
# group(2) = 'google.com monkey dishwasher'，即第二个括号里的内容

findall

str = 'a3b25c002d'
match = re.findall(r'[a-z][0-9]+', str) # match = ['a3', 'b25', 'c002']

findall and Groups

普通findall会返回一个list，findall目标如果有group，也返回一个list，但list每个元素是一个tuple，每个tuple是针对pattern的一次match。tuple的元素是这次match里的group里的成员。
比如：

str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str) 
# tuples = [('alice', 'google.com'), ('bob', 'abc.com')]
# ('alice', 'google.com') 是pattern第一次match得到的group中的group(1)和group(2)
# 如果是用tuples = re.findall(r'[\w\.-]+\@[\w\.-]+', str)，即不分组，没小括号
# 则tuples是 ['alice@google.com', 'bob@abc.com']
# 这样分组进一个list的tuple的好处是可以把一些特有的成分全部提取出来，
# 比如所有的用户名：
for tuple in tuples:
    print tuple[0]  ## username

Options

option	desc
IGNORECASE	忽略大小写
DOTALL	让 . 不止查找当前行，而是整个文档
MULTILINE	字符串多行的情况，让 ^ 和 $ 能跨行

不贪婪 un-greedy

str = '<b>foo</b> and <i>so on</i>'
match = re.findall(r'<.*?>', str) # match.group = ['<b>', '</b>', '<i>', '</i>']
# 如果没有后面的 ? 即
# re.findall(r'<.*>', str)
# 它会一直找到最远的一个匹配才结束，也就是整个字符串
# 即贪婪形式 greedy

补充

在+, *, ?, {n}, {m,n} 等标定重复匹配次数的标志，在其后加'?'的意思是尽可能少的匹配。

match = re.findall(r'fo+', 'fooobarfoo')    # match = ['fooo', 'foo']
match = re.findall(r'fo+?', 'fooobarfoo')   # 有 ? match = ['fo', 'fo']

替换

# re.sub(pattern, replacement, str)
str = 'id_032'
## \1 代表group(1) and so on
output = re.sub(r'(id)_(\d+)', r'\1_008', str) # output = 'id_008'

重复匹配次数{m,n}

match = re.findall(r'fo{2}', 'fooobarfoo')   # ['foo', 'foo']
match = re.findall(r'fo{2,}', 'fooobarfoo')  # ['fooo', 'foo'] 代表2次以上
match = re.findall(r'fo{3,5}', 'fooobarfoo') # ['fooo']
match = re.findall(r'fo{,2}', 'fooobarfoof') # ['foo', 'foo', 'f'] 代表0-2次，注意这次字符串最后有个'f'