设为首页 收藏本站
查看: 283|回复: 0

[经验分享] Python Regex

[复制链接]
发表于 2017-4-21 09:02:35 | 显示全部楼层 |阅读模式
  <!-- SafClassName="docSection1Title"--><!-- SafTocEntry="1.6 Python"-->






16 Python


Python provides a rich,
Perl-like regular expression syntax in the re
module. The re
module uses a Traditional NFA match engine. For an explanation of the rules
behind an NFA engine, see Section
1.2
.
This chapter covers the version of re
included with
Python 2.2, although the module has been available in similar form since Python
1.5.

1.6.1 Supported Metacharacters

The re
module supports the metacharacters and
metasequences listed in Table
1-21
through Table
1-25
. For expanded definitions of each metacharacter, see Section
1.2.1
.
  


Table 1-21. Character representations

Sequence
Meaning


\a



Alert (bell), x07
.


\b



Backspace, x08
, supported only in character
class.


\n



Newline, x0A
.


\r



Carriage return, x0D
.


\f



Form feed, x0C
.


\t



Horizontal tab, x09
.


\v



Vertical tab, x0B
.


\
octal




Character specified by up to three octal digits.


\x
hh




Character specified by a two-digit hexadecimal
code.


\u
hhhh




Character specified by a four-digit hexadecimal
code.


\U
hhhhhhhh




Character specified by an eight-digit hexadecimal
code.

  


Table 1-22. Character classes and class-like
constructs


Class
Meaning



[...]




Any character listed or contained within a listed
range.





[^...]




Any character that is not listed and is not contained within a
listed range.





.



Any character, except a newline (unless DOTALL
mode).





\w




Word character, [a-zA-z0-9_]
(unless LOCALE
or UNICODE
mode).





\W




Non-word character, [^a-zA-z0-9_]
(unless
LOCALE
or UNICODE
mode).





\d




Digit character, [0-9]
.





\D




Non-digit character, [^0-9]
.





\s




Whitespace character, [ \t\n\r\f\v]
.





\S




Nonwhitespace character, [
\t\n\r\f\v]
.


  


Table 1-23. Anchors and zero-width tests

Sequence
Meaning



^




Start of string, or after any newline if in MULTILINE
match mode.





\A




Start of search string, in all match modes.





$




End of search string or before a string-ending newline, or
before any newline in MULTILINE
match mode.





\Z




End of string or before a string-ending newline, in any match
mode.





\b




Word boundary.





\B




Not-word-boundary.





(?=...)




Positive lookahead.





(?!...)




Negative lookahead.





(?<=...)




Positive lookbehind.





(?<!...)




Negative
lookbehind.


  


Table 1-24. Comments and mode modifiers

Modifier/sequence
Mode character
Meaning



I
or IGNORECASE




i




Case-insensitive matching.





L
or LOCALE




L




Cause \w
, \W
, \b
, and \B
to
use current locale's definition of alphanumeric.





M
or MULTILINE
or (?m)




m




^
and $
match next to embedded
\n
.





S
or DOTALL
or (?s)




s




Dot (.) matches newline.





U
or UNICODE
or (?u)




u




Cause \w
, \W
, \b
, and \B
to
use Unicode definition of alphanumeric.





X
or VERBOSE
or (?x)




x




Ignore whitespace and allow comments (#
) in
pattern.





(?
mode

)



 

Turn listed modes (iLmsux
) on for the entire regular
expression.





(?#...)



 

Treat substring as a comment.





#..
.


 

Treat rest of line as a comment in VERBOSE
mode.


  


Table 1-25. Grouping, capturing, conditional, and
control


Sequence
Meaning



(...)




Group subpattern and capture submatch into
\1
,\2
,...





(?P<
name

>
...)




Group subpattern and capture submatch into named capture group,
name

.





(?P=
name

)



Match text matched by earlier named capture group,
name

.





\
n





Contains the results of the n
th
earlier submatch.





(?:...)




Groups subpattern, but does not capture submatch.





...|..
.



Try subpatterns in alternation.





*




Match 0 or more times.





+




Match 1 or more times.





?




Match 1 or 0 times.





{
n

}




Match exactly n

times.





{
x

,y

}




Match at least x

times but no more than
y

times.





*?




Match 0 or more times, but as few times as
possible.





+?




Match 1 or more times, but as few times as
possible.





??




Match 0 or 1 time, but as few times as possible.





{
x

,y

}?




Match at least x

times, no more than
y

times, and as few times as
possible.



1.6.2 re Module Objects and Functions

The re
module
defines all regular expression functionality. Pattern matching is done directly
through module functions, or patterns are compiled into regular expression
objects that can be used for repeated pattern matching. Information about the
match, including captured groups, is retrieved through match objects.
Python's raw string syntax, r'
' or r"
",
allows you to specify regular expression patterns without having to escape
embedded backslashes. The raw-string pattern, r'\n
', is equivalent to
the regular string pattern, '\\n
'. Python also provides triple-quoted
raw strings for multiline regular expressions: r'''text''
' and
r"""text""
".





Module Functions



The re
module
defines the following functions and one exception.
compile(
pattern
[,
flags
])
Return a regular expression object with the optional mode
modifiers, flags
.
match(
pattern
,
string
[, flags
])
Search for pattern
at starting position of
string
, and return a match object or None
if no
match.
search(
pattern
,
string
[, flags
])
Search for pattern
in string
,
and return a match object or None
if no match.
split(
pattern
,
string
[, maxsplit
=0])
Split string
on pattern
. Limit
the number of splits to maxsplit
. Submatches from capturing
parentheses are also returned.
sub(
pattern
, repl
,
string
[, count
=0])
Return a string with all or up to count
occurrences of pattern
in string
replaced with
repl
. repl
may be either a string or a function
that takes a match object argument.
subn(
pattern
, repl
,
string
[, count
=0])
Perform sub( )
but return a tuple of the new string
and the number of replacements.
findall(
pattern
,
string
)
Return matches of pattern
in
string
. If pattern
has capturing groups, returns
a list of submatches or a list of tuples of submatches.
finditer(
pattern
,
string
)
Return an iterator over matches of pattern
in
string
. For each match, the iterator returns a match object.
escape(
string
)
Return string with alphanumerics backslashed so that
string
can be matched literally.
exception error
Exception raised if an error occurs during compilation or
matching. This is common if a string passed to a function is not a valid regular
expression.


RegExp



Regular expression objects are created with the
re.compile
function.
flags
Return the flags argument used when the object was compiled or
0.
groupindex
Return a dictionary that maps symbolic group names to group
numbers.
pattern
Return the pattern string used when the object was
compiled.
match(
string
[,
pos
[, endpos
]])
search(
string
[,
pos
[, endpos
]])
split(
string
[,
maxsplit
=0])
sub(
repl
,
string
[, count
=0])
subn(
repl
,
string
[, count
=0])
findall(
string
)
Same as the re
module functions, except
pattern
is implied. pos
and endpos
give
start and end string indexes for the match.





Match Objects



Match objects are created by the match
and find
functions.
pos
endpos
Value of pos
or endpos
passed to
search
or match
.
re
The regular expression object whose match
or
search
returned this object.
string
String passed to match
or search
.
group([
g1
, g2
,
...])
Return one or more submatches from capturing groups. Groups may
be either numbers corresponding to capturing groups or strings corresponding to
named capturing groups. Group zero corresponds to the entire match. If no
arguments are provided, this function returns the entire match. Capturing groups
that did not match have a result of None
.
groups([
default
])
Return a tuple of the results of all capturing groups. Groups
that did not match have the value None
or default
.
groupdict([
default
])
Return a dictionary of named capture groups, keyed by group
name. Groups that did not match have the value None
or
default
.
start([
group
])
Index of start of substring matched by group
(or start of entire matched string if no group
).
end([
group
])
Index of end of substring matched by group
(or
start of entire matched string if no group
).
span([
group
])
Return a tuple of starting and ending indexes of
group
(or matched string if no group
).
expand([
template
])
Return a string obtained by doing backslash substitution on
template
. Character escapes, numeric backreferences, and named
backreferences are expanded.
lastgroup
Name of the last matching capture group, or None
if no
match or if the group had no name.
lastindex
Index of the last matching capture group, or None
if
no match.

1.6.3 Unicode Support

re
provides limited
Unicode
support. Strings may contain Unicode characters, and individual Unicode
characters can be specified with \u
. Additionally, the UNICODE
flag causes \w
, \W
, \b
, and \B
to recognize
all Unicode alphanumerics. However, re
does not provide support for
matching Unicode properties, blocks, or categories.

1.6.4 Examples


Example 1-13. Simple
match


#Match Spider-Man, Spiderman, SPIDER-MAN, etc.
import re
dailybugle = 'Spider-Man Menaces City!'
pattern    = r'spider[- ]?man.'
if re.match(pattern, dailybugle, re.IGNORECASE):
print dailybugle

Example 1-14. Match
and capture group


#Match dates formatted like MM/DD/YYYY, MM-DD-YY,...
import re
date = '12/30/1969'
regex = re.compile(r'(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)')
match = regex.match(date)
if match:
month = match.group(1) #12
day   = match.group(2) #30
year  = match.group(3) #1969

Example 1-15. Simple
substitution


#Convert <br> to <br /> for XHTML compliance
import re
text  = 'Hello world. <br>'
regex = re.compile(r'<br>', re.IGNORECASE);
repl  = r'<br />'
result = regex.sub(repl,text)

Example 1-16. Harder
substitution


#urlify - turn URL's into HTML links
import re
text = 'Check the website, http://www.oreilly.com/catalog/repr.'
pattern  =  r'''                                               
\b                          # start at word boundary            
(                           # capture to \1
(https?|telnet|gopher|file|wais|ftp) :
# resource and colon
[\w/#~:.?+=&%@!\-] +?       # one or more valid chars
# take little as possible
)                                                               
(?=                         # lookahead
[.:?\-] *                   #  for possible punc
(?: [^\w/#~:.?+=&%@!\-]     #  invalid character
| $ )                       #  or end of string
)'''
regex = re.compile(pattern,  re.IGNORECASE
+ re.VERBOSE);
result = regex.sub(r'<a href="\1">\1</a>', text)
1.6.5 Other Resources



  • Python's online documentation at http://www.python.org/doc/current/lib/module-re.html
    .






[
 


运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-367148-1-1.html 上篇帖子: (Python编程)"添加Python,充分混和。" 下篇帖子: 初识 Python
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表