Python Regex

yanfangsheng123 · 发表于 2017-4-21 09:02:35

16 Python Python provides a rich, Perl-like regular expression syntax in the re module. The re module uses a Traditional NFA match engine. For an explanation of the rules behind an NFA engine, see Section 1.2 . This chapter covers the version of re included with Python 2.2, although the module has been available in similar form since Python 1.5. 1.6.1 Supported Metacharacters The re module supports the metacharacters and metasequences listed in Table 1-21 through Table 1-25 . For expanded definitions of each metacharacter, see Section 1.2.1 . 　　 Table 1-21. Character representations Sequence Meaning
\a	Alert (bell), x07 .
\b	Backspace, x08 , supported only in character class.
\n	Newline, x0A .
\r	Carriage return, x0D .
\f	Form feed, x0C .
\t	Horizontal tab, x09 .
\v	Vertical tab, x0B .
\ octal	Character specified by up to three octal digits.
\x hh	Character specified by a two-digit hexadecimal code.
\u hhhh	Character specified by a four-digit hexadecimal code.
\U hhhhhhhh	Character specified by an eight-digit hexadecimal code.

Table 1-22. Character classes and class-like
constructs

Class
Meaning

[...]

Any character listed or contained within a listed
range.

[^...]

Any character that is not listed and is not contained within a
listed range.

.

Any character, except a newline (unless DOTALL
mode).

\w

Word character, [a-zA-z0-9_]
(unless LOCALE
or UNICODE
mode).

\W

Non-word character, [^a-zA-z0-9_]
(unless
LOCALE
or UNICODE
mode).

\d

Digit character, [0-9]
.

\D

Non-digit character, [^0-9]
.

\s

Whitespace character, [ \t\n\r\f\v]
.

\S

Nonwhitespace character, [
\t\n\r\f\v]
.

　　

Table 1-23. Anchors and zero-width tests

Sequence
Meaning

^

Start of string, or after any newline if in MULTILINE
match mode.

\A

Start of search string, in all match modes.

$

End of search string or before a string-ending newline, or
before any newline in MULTILINE
match mode.

\Z

End of string or before a string-ending newline, in any match
mode.

\b

Word boundary.

\B

Not-word-boundary.

(?=...)

Positive lookahead.

(?!...)

Negative lookahead.

(?<=...)

Positive lookbehind.

(?<!...)

Negative
lookbehind.

　　

Table 1-24. Comments and mode modifiers

Modifier/sequence
Mode character
Meaning

I
or IGNORECASE

i

Case-insensitive matching.

L
or LOCALE

L

Cause \w
, \W
, \b
, and \B
to
use current locale's definition of alphanumeric.

M
or MULTILINE
or (?m)

m

^
and $
match next to embedded
\n
.

S
or DOTALL
or (?s)

s

Dot (.) matches newline.

U
or UNICODE
or (?u)

u

Cause \w
, \W
, \b
, and \B
to
use Unicode definition of alphanumeric.

X
or VERBOSE
or (?x)

x

Ignore whitespace and allow comments (#
) in
pattern.

(?
mode

)

Turn listed modes (iLmsux
) on for the entire regular
expression.

(?#...)

Treat substring as a comment.

#..
.

Treat rest of line as a comment in VERBOSE
mode.

　　

Table 1-25. Grouping, capturing, conditional, and
control

Sequence
Meaning

(...)

Group subpattern and capture submatch into
\1
,\2
,...

(?P<
name

>
...)

Group subpattern and capture submatch into named capture group,
name

.

(?P=
name

)

Match text matched by earlier named capture group,
name

.

\
n

Contains the results of the n
th
earlier submatch.

(?:...)

Groups subpattern, but does not capture submatch.

...|..
.

Try subpatterns in alternation.

*

Match 0 or more times.

+

Match 1 or more times.

?

Match 1 or 0 times.

{
n

}

Match exactly n

times.

{
x

,y

}

Match at least x

times but no more than
y

times.

*?

Match 0 or more times, but as few times as
possible.

+?

Match 1 or more times, but as few times as
possible.

??

Match 0 or 1 time, but as few times as possible.

{
x

,y

}?

Match at least x

times, no more than
y

times, and as few times as
possible.

1.6.2 re Module Objects and Functions

The re
module
defines all regular expression functionality. Pattern matching is done directly
through module functions, or patterns are compiled into regular expression
objects that can be used for repeated pattern matching. Information about the
match, including captured groups, is retrieved through match objects.
Python's raw string syntax, r'
' or r"
",
allows you to specify regular expression patterns without having to escape
embedded backslashes. The raw-string pattern, r'\n
', is equivalent to
the regular string pattern, '\\n
'. Python also provides triple-quoted
raw strings for multiline regular expressions: r'''text''
' and
r"""text""
".

Module Functions

The re

module

defines the following functions and one exception.

compile(

pattern

[,

flags

])

Return a regular expression object with the optional mode

modifiers, flags

.

match(

pattern

,

string

[, flags

])

Search for pattern

at starting position of

string

, and return a match object or None

if no

match.

search(

pattern

,

string

[, flags

])

Search for pattern

in string

,

and return a match object or None

if no match.

split(

pattern

,

string

[, maxsplit

=0])

Split string

on pattern

. Limit

the number of splits to maxsplit

. Submatches from capturing

parentheses are also returned.

sub(

pattern

, repl

,

string

[, count

=0])

Return a string with all or up to count

occurrences of pattern

in string

replaced with

repl

. repl

may be either a string or a function

that takes a match object argument.

subn(

pattern

, repl

,

string

[, count

=0])

Perform sub( )

but return a tuple of the new string

and the number of replacements.

findall(

pattern

,

string

)

Return matches of pattern

in

string

. If pattern

has capturing groups, returns

a list of submatches or a list of tuples of submatches.

finditer(

pattern

,

string

)

Return an iterator over matches of pattern

in

string

. For each match, the iterator returns a match object.

escape(

string

)

Return string with alphanumerics backslashed so that

string

can be matched literally.

exception error

Exception raised if an error occurs during compilation or

matching. This is common if a string passed to a function is not a valid regular

expression.

RegExp

Regular expression objects are created with the

re.compile

function.

flags

Return the flags argument used when the object was compiled or

0.

groupindex

Return a dictionary that maps symbolic group names to group

numbers.

pattern

Return the pattern string used when the object was

compiled.

match(

string

[,

pos

[, endpos

]])

search(

string

[,

pos

[, endpos

]])

split(

string

[,

maxsplit

=0])

sub(

repl

,

string

[, count

=0])

subn(

repl

,

string

[, count

=0])

findall(

string

)

Same as the re

module functions, except

pattern

is implied. pos

and endpos

give

start and end string indexes for the match.

Match Objects

Match objects are created by the match

and find

functions.

pos

endpos

Value of pos

or endpos

passed to

search

or match

.

re

The regular expression object whose match

or

search

returned this object.

string

String passed to match

or search

.

group([

g1

, g2

,

...])

Return one or more submatches from capturing groups. Groups may

be either numbers corresponding to capturing groups or strings corresponding to

named capturing groups. Group zero corresponds to the entire match. If no

arguments are provided, this function returns the entire match. Capturing groups

that did not match have a result of None

.

groups([

default

])

Return a tuple of the results of all capturing groups. Groups

that did not match have the value None

or default

.

groupdict([

default

])

Return a dictionary of named capture groups, keyed by group

name. Groups that did not match have the value None

or

default

.

start([

group

])

Index of start of substring matched by group

(or start of entire matched string if no group

).

end([

group

])

Index of end of substring matched by group

(or

start of entire matched string if no group

).

span([

group

])

Return a tuple of starting and ending indexes of

group

(or matched string if no group

).

expand([

template

])

Return a string obtained by doing backslash substitution on

template

. Character escapes, numeric backreferences, and named

backreferences are expanded.

lastgroup

Name of the last matching capture group, or None

if no

match or if the group had no name.

lastindex

Index of the last matching capture group, or None

if

no match.

1.6.3 Unicode Support

re
provides limited
Unicode
support. Strings may contain Unicode characters, and individual Unicode
characters can be specified with \u
. Additionally, the UNICODE
flag causes \w
, \W
, \b
, and \B
to recognize
all Unicode alphanumerics. However, re
does not provide support for
matching Unicode properties, blocks, or categories.

1.6.4 Examples

Example 1-13. Simple
match

#Match Spider-Man, Spiderman, SPIDER-MAN, etc.
import re
dailybugle = 'Spider-Man Menaces City!'
pattern = r'spider[- ]?man.'
if re.match(pattern, dailybugle, re.IGNORECASE):
print dailybugle

Example 1-14. Match
and capture group

#Match dates formatted like MM/DD/YYYY, MM-DD-YY,...
import re
date = '12/30/1969'
regex = re.compile(r'(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)')
match = regex.match(date)
if match:
month = match.group(1) #12
day = match.group(2) #30
year = match.group(3) #1969

Example 1-15. Simple
substitution

#Convert to for XHTML compliance
import re
text = 'Hello world. '
regex = re.compile(r' ', re.IGNORECASE);
repl = r' '
result = regex.sub(repl,text)

Example 1-16. Harder
substitution

#urlify - turn URL's into HTML links
import re
text = 'Check the website, http://www.oreilly.com/catalog/repr.'
pattern = r'''
\b # start at word boundary
( # capture to \1
(https?|telnet|gopher|file|wais|ftp) :
# resource and colon
[\w/#~:.?+=&%@!\-] +? # one or more valid chars
# take little as possible
)
(?= # lookahead
[.:?\-] * # for possible punc
(?: [^\w/#~:.?+=&%@!\-] # invalid character
| $ ) # or end of string
)'''
regex = re.compile(pattern, re.IGNORECASE
+ re.VERBOSE);
result = regex.sub(r'<a href="\1">\1</a>', text)
1.6.5 Other Resources

Python's online documentation at http://www.python.org/doc/current/lib/module-re.html
.

[

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] Python Regex

浏览过的版块

扫码加入运维网微信交流群