设为首页 收藏本站
查看: 624|回复: 0

[经验分享] Python Regular Expression HOWTO

[复制链接]

尚未签到

发表于 2017-4-28 10:33:15 | 显示全部楼层 |阅读模式
  

Regular Expressions in Python 3


  • By Mark Summerfield
  • Jan 13, 2009
  • Sample Chapter is provided courtesy of Addison-Wesley Professional



原文地址为http://www.informit.com/articles/article.aspx?p=1310965






This chapter introduces and explains all the key regular expression concepts and shows pure regular expression syntax, and then shows how to use regular expressions in the context of Python programming.


  • Python’s Regular Expression Language
  • The Regular Expression Module

A regular expression is a compact notation for representing a collection of strings. What makes regular expressions so powerful is that a single regular expression can represent an unlimited number of strings—providing they meet the regular expression’s requirements. Regular expressions (which we will mostly call “regexes” from now on) are defined using a mini-language that is completely different from Python—but Python includes the re
 module through which we can seamlessly create and use regexes.*

Regexes are used for four main purposes:


  • Validation: checking whether a piece of text meets some criteria, for example, contains a currency symbol followed by digits
  • Searching: locating substrings that can have more than one form, for example, finding any of “pet.png”, “pet.jpg”, “pet.jpeg”, or “pet.svg” while avoiding “carpet.png” and similar
  • Searching and replacing: replacing everywhere the regex matches with a string, for example, finding “bicycle” or “human powered vehicle” and replacing either with “bike”
  • Splitting strings: splitting a string at each place the regex matches, for example, splitting everywhere “: ” or “=” is encountered

At its simplest a regular expression is an expression (e.g., a literal character), optionally followed by a quantifier. More complex regexes consist of any number of quantified expressions and may include assertions and may be influenced by flags.
This chapter’s first section introduces and explains all the key regular expression concepts and shows pure regular expression syntax—it makes minimal reference to Python itself. Then the second section shows how to use regular expressions in the context of Python programming, drawing on all the material covered in the earlier sections. Readers familiar with regular expressions who just want to learn how they work in Python could skip to the second section (starting on page 455). The chapter covers the complete regex language offered by the re
 module, including all the assertions and flags. We indicate regular expressions in the text using bold

, show where they match using underlining

, and show captures usingshading


.


Python’s Regular Expression Language

In this section we look at the regular expression language in four subsections. The first subsection shows how to match individual characters or groups of characters, for example, match a
, or match b
, or match either a
 or b
. The second subsection shows how to quantify matches, for example, match once, or match at least once, or match as many times as possible. The third subsection shows how to group subexpressions and how to capture matching text, and the final subsection shows how to use the language’s assertions and flags to affect how regular expressions work.


Characters and Character Classes

The simplest expressions are just literal characters, such as a

 or 5

, and if no quantifier is explicitly given it is taken to be “match one occurrence”. For example, the regex tune

 consists of four expressions, each implicitly quantified to match once, so it matches one t
 followed by one u
 followed by one n
 followed by one e
, and hence matches the strings tune

 and at
tuned

.

Although most characters can be used as literals, some are “special characters”—these are symbols in the regex language and so must be escaped by preceding them with a backslash (\
) to use them as literals. The special characters are \.^$?+*{}[]()|
. Most of Python’s standard string escapes can also be used within regexes, for example, \n
 for newline and \t
 for tab, as well as hexadecimal escapes for characters using the \x
HH

, \u
HHHH

, and \U
HHHHHHHH

 syntaxes.


 
String escapes 62  DSC0000.jpg



In many cases, rather than matching one particular character we want to match any one of a set of characters. This can be achieved by using a character class
—one or more characters enclosed in square brackets. (This has nothing to do with a Python class, and is simply the regex term for “set of characters”.) A character class is an expression, and like any other expression, if not explicitly quantified it matches exactly one character (which can be any of the characters in the character class). For example, the regexr[ea]d

 matches both red

 and rad

ar
, but not read
. Similarly, to match a single digit we can use the regex[0123456789]

. For convenience we can specify a range of characters using a hyphen, so the regex [0-9]

 also matches a digit. It is possible to negate the meaning of a character class by following the opening bracket with a caret, so [^0-9]

 matches any character that is not
 a digit.

Note that inside a character class, apart from \
, the special characters lose their special meaning, although in the case of ^
 it acquires a new meaning (negation) if it is the first character in the character class, and otherwise is simply a literal caret. Also, -
 signifies a character range unless it is the first character, in which case it is a literal hyphen.

Since some sets of characters are required so frequently, several have shorthand forms—these are shown in Table 12.1. With one exception the shorthands can be used inside character sets, so for example, the regex [\dA-Fa-f]

 matches any hexadecimal digit. The exception is .

 which is a shorthand outside a character class but matches a literal .
 inside a character class.


Table 12.1 Character Class Shorthands


Symbol


Meaning


.




Matches any character except newline; or any character at all with the re.DOTALL
 flag; or inside a character class matches a literal .



\d




Matches a Unicode digit; or [0-9]

 with the re.ASCII
 flag



\D




Matches a Unicode nondigit; or [^0-9]

 with the re.ASCII
 flag



\s




Matches a Unicode whitespace; or [ \t\n\r\f\v]

 with the re.ASCII
 flag



\S




Matches a Unicode nonwhitespace; or [^ \t\n\r\f\v]

 with the re.ASCII
 flag



\w




Matches a Unicode “word” character; or [a-zA-Z0-9_]

 with the re.ASCII
 flag



\W




Matches a Unicode non-“word” character; or [^a-zA-Z0-9_]

 with the re.ASCII
 flag



 
Meaning of the flags  DSC0001.jpg
 451




Quantifiers

A quantifier has the form {

m

,

n

}

 where m

 and n

 are the minimum and maximum times the expression the quantifier applies to must match. For example, both e{1,1}e{1,1}

 and e{2,2}

 match f
ee

l
, but neither matches felt
.

Writing a quantifier after every expression would soon become tedious, and is certainly difficult to read. Fortunately, the regex language supports several convenient shorthands. If only one number is given in the quantifier it is taken to be both the minimum and the maximum, so e{2}

 is the same as e{2,2}

. And as we noted in the preceding section, if no quantifier is explicitly given, it is assumed to be one (i.e., {1,1}

or {1}

); therefore, ee

 is the same as e{1,1}e{1,1}

 and e{1}e{1}

, so both e{2}

 and ee

 match f
ee

l
 but notfelt
.

Having a different minimum and maximum is often convenient. For example, to match travelled
 andtraveled
 (both legitimate spellings), we could use either travel{1,2}ed
 or travell{0,1}ed

. The {0,1}

quantification is so often used that it has its own shorthand form, ?

, so another way of writing the regex (and the one most likely to be used in practice) is travell?ed

.

Two other quantification shorthands are provided: +

 which stands for {1,

 n

}

 (“at least one”) and *

 which stands for {0,

n

}

 (“any number of”); in both cases n

 is the maximum possible number allowed for a quantifier, usually at least 32 767. All the quantifiers are shown in Table 12.2.


Table 12.2 Regular Expression Quantifiers


Syntax


Meaning


e

?

 ore

{0,1}




Greedily match zero or one occurrence of expression e




e

??

 ore

{0,1}?




Nongreedily match zero or one occurrence of expression e e

+

 or e

{1,}

 Greedily match one or more occurrences of expression e




e

+

 ore

{1,}




Greedily match one or more occurrences of expression e




e

+?

 ore

{1,}?




Nongreedily match one or more occurrences of expression e




e

*

 ore

{0,}




Greedily match zero or more occurrences of expression e




e

*?

 or
e

{0,}?




Nongreedily match zero or more occurrences of expression e




e

{

 m

}




Match exactly m

 occurrences of expression e




e

{

 m

,}




Greedily match at least m

 occurrences of expression e




e

{

 m

,}?




Nongreedily match at least m

 occurrences of expression e




e

{,

 n

}




Greedily match at most n

 occurrences of expression e




e

{,

 n

}?




Nongreedily match at most n

 occurrences of expression e




e

{

 m

,

 n

}




Greedily match at least m

 and at most n

 occurrences of expression e




e

{

 m

,

n

}?




Nongreedily match at least m

 and at most n

 occurrences of expression e




The +

 quantifier is very useful. For example, to match integers we could use \d+

 since this matches one or more digits. This regex could match in two places in the string 4588.91
, for example, 4588

.91
 and4588.
91

. Sometimes typos are the result of pressing a key too long. We could use the regex bevel+ed

to match the legitimate beveled

 and bevelled

, and the incorrect bevellled

. If we wanted to standardize on the one l
 spelling, and match only occurrences that had two or more l
s, we could use bevell+ed

 to find them.

The *

 quantifier is less useful, simply because it can so often lead to unexpected results. For example, supposing that we want to find lines that contain comments in Python files, we might try searching for #*

. But this regex will match any line whatsoever, including blank lines because the meaning is “match any number of #s”—and that includes none. As a rule of thumb for those new to regexes, avoid using *

 at all, and if you do use it (or if you use ?

), make sure there is at least one other expression in the regex that has a nonzero quantifier—so at least one quantifier other than *

 or ?

 since both of these can match their expression zero times.

It is often possible to convert *

 uses to +

 uses and vice versa. For example, we could match “tasselled” with at least one l
 using tassell*ed

 or tassel+ed

, and match those with two or more l
s usingtasselll*ed

 or tassell+ed

.

If we use the regex \d+

 it will match 136

. But why does it match all the digits, rather than just the first one? By default, all quantifiers are greedy
—they match as many characters as they can. We can make any quantifier nongreedy (also called minimal
) by following it with a ?

 symbol. (The question mark has two different meanings—on its own it is a shorthand for the {0,1}

 quantifier, and when it follows a quantifier it tells the quantifier to be nongreedy.) For example, \d+?

 can match the string 136
 in three different places:1

36
, 1
3

6
, and 13
6

. Here is another example: \d??

 matches zero or one digits, but prefers to match none since it is nongreedy—on its own it suffers the same problem as *

 in that it will match nothing, that is, any text at all.

Nongreedy quantifiers can be useful for quick and dirty XML and HTML parsing. For example, to match all the image tags, writing <img.*>

 (match one “<”, then one “i”, then one “m”, then one “g”, then zero or more of any character apart from newline, then one “>”) will not work because the .*

 part is greedy and will match everything including the tag’s closing >
, and will keep going until it reaches the last >
 in the entire text.

Three solutions present themselves (apart from using a proper parser). One is <img[^>]*>

 (match<img
, then any number of non->
 characters and then the tag’s closing >
 character), another is <img.*?>

 (match <img
, then any number of characters, but nongreedily, so it will stop immediately before the tag’s closing >
, and then the >
), and a third combines both, as in <img[^>]*?>

. None of them is correct, though, since they can all match <img>
, which is not valid. Since we know that an image tag must have a src
 attribute, a more accurate regex is <img\s+[^>]*?src=\w+[^>]*?>

. This matches the literal characters <img
, then one or more whitespace characters, then nongreedily zero or more of anything except >
 (to skip any other attributes such as alt
), then the src
 attribute (the literal characterssrc=
 then at least one “word” character), and then any other non->
 characters (including none) to account for any other attributes, and finally the closing >
.


Grouping and Capturing

In practical applications we often need regexes that can match any one of two or more alternatives, and we often need to capture the match or some part of the match for further processing. Also, we sometimes want a quantifier to apply to several expressions. All of these can be achieved by grouping with ()
, and in the case of alternatives using alternation with |
.

Alternation is especially useful when we want to match any one of several quite different alternatives. For example, the regex aircraft|airplane|jet

 will match any text that contains “aircraft” or “airplane” or “jet”. The same thing can be achieved using the regex air(craft|plane)|jet

. Here, the parentheses are used to group expressions, so we have two outer expressions, air(craft|plane)

 and jet

. The first of these has an inner expression, craft|plane

, and because this is preceded by air

 the first outer expression can match only “aircraft” or “airplane”.

Parentheses serve two different purposes—to group expressions and to capture the text that matches an expression. We will use the term group
 to refer to a grouped expression whether it captures or not, andcapture
 and capture group
 to refer to a captured group. If we used the regex (aircraft|airplane|jet)

 it would not only match any of the three expressions, but would also capture whichever one was matched for later reference. Compare this with the regex (air(craft|plane)|jet)

 which has two captures if the first expression matches (“aircraft” or “airplane” as the first capture and “craft” or “plane” as the second capture), and one capture if the second expression matches (“jet”). We can switch off the capturing effect by following an opening parenthesis with ?:

, so for example, (air(?:craft|plane)|jet)

 will have only one capture if it matches (“aircraft” or “airplane” or “jet”).

A grouped expression is an expression and so can be quantified. Like any other expression the quantity is assumed to be one unless explicitly given. For example, if we have read a text file with lines of the formkey=value
, where each key
 is alphanumeric, the regex (\w+)=(.+)

 will match every line that has a nonempty key and a nonempty value. (Recall that .
 matches anything except newlines.) And for every line that matches, two captures are made, the first being the key and the second being the value.

For example, the key=value
 regular expression will match the entire line topic


=
 
physical geography


with the two captures shown shaded. Notice that the second capture includes some whitespace, and that whitespace before the =
 is not accepted. We could refine the regex to be more flexible in accepting whitespace, and to strip off unwanted whitespace using a somewhat longer version:


[ \t]*(\w+)[ \t]*=[ \t]*(.+)


This matches the same line as before and also lines that have whitespace around the =
 sign, but with the first capture having no leading or trailing whitespace, and the second capture having no leading whitespace. For example: topic


 =
 physical geography


.

We have been careful to keep the whitespace matching parts outside the capturing parentheses, and to allow for lines that have no whitespace at all. We did not use \s

 to match whitespace because that matches newlines (\n
) which could lead to incorrect matches that span lines (e.g., if the re.MULTILINE
flag is used). And for the value we did not use \S

 to match nonwhitespace because we want to allow for values that contain whitespace (e.g., English sentences). To avoid the second capture having trailing whitespace we would need a more sophisticated regex; we will see this in the next subsection.


 
Regex flags 
 460




Captures can be referred to using backreferences
, that is, by referring back to an earlier capture group.*

One syntax for backreferences inside regexes themselves is \

i

 where i

 is the capture number. Captures are numbered starting from one and increasing by one going from left to right as each new (capturing) left parenthesis is encountered. For example, to simplistically match duplicated words we can use the regex(\w+)\s+\1

 which matches a “word”, then at least one whitespace, and then the same word as was captured. (Capture number 0 is created automatically without the need for parentheses; it holds the entire match, that is, what we show underlined.) We will see a more sophisticated way to match duplicate words later.

In long or complicated regexes it is often more convenient to use names rather than numbers for captures. This can also make maintenance easier since adding or removing capturing parentheses may change the numbers but won’t affect names. To name a capture we follow the opening parenthesis with ?P<

name

>

. For example, (?P<key>\w+)=(?P<value>.+)

 has two captures called "key"
 and"value"
. The syntax for backreferences to named captures inside a regex is (?P=

name

)

. For example, (?P<word>\w+)\s+(?P=word)

 matches duplicate words using a capture called "word"
.


Assertions and Flags

One problem that affects many of the regexes we have looked at so far is that they can match more or different text than we intended. For example, the regex aircraft|airplane|jet

 will match “waterjet” and “jetski” as well as “jet”. This kind of problem can be solved by using assertions. An assertion does not match any text, but instead says something about the text at the point where the assertion occurs.

One assertion is \b

 (word boundary), which asserts that the character that precedes it must be a “word” (\w

) and the character that follows it must be a non“word” (\W

), or vice versa. For example, although the regex jet

 can match twice in the text the jet and jetski are noisy
, that is, the jet and jetski are noisy
, the regex \bjet\b

 will match only once, the jet and jetski are noisy
. In the context of the original regex, we could write it either as \baircraft\b|\bairplane\b|\bjet\b

 or more clearly as \b(?:aircraft|airplane|jet)\b

, that is, word boundary, noncapturing expression, word boundary.

Many other assertions are supported, as shown in Table 12.3. We could use assertions to improve the clarity of a key=value
 regex, for example, by changing it to ^(\w+)=([^\n]+)

 and setting there.MULTILINE
 flag to ensure that each key=value
 is taken from a single line with no possibility of spanning lines. (The flags are shown in Table 12.5 on page 460, and the syntaxes for using them are described at the end of this subsection and are shown in the next section.) And if we also want to strip leading and trailing whitespace and use named captures, the full regex becomes:


^[ \t]*(?P<key>\w+)[ \t]*=[ \t]*(?P<value>[^\n]+)(?<![ \t])


Table 12.3 Regular Expression Assertions


Symbol


Meaning


^




Matches at the start; also matches after each newline with the re.MULTILINE
 flag



$




Matches at the end; also matches before each newline with the re.MULTILINE
 flag



\A




Matches at the start


\b




Matches at a “word” boundary; influenced by the re.ASCII
 flag—inside a character class this is the escape for the backspace character



\B




Matches at a non-“word” boundary; influenced by the re.ASCII
 flag



\Z




Matches at the end


(?=

 e

)




Matches if the expression e

 matches at this assertion but does not advance over it—calledlookahead
 or positive lookahead



(?!

 e

)




Matches if the expression e

 does not match at this assertion and does not advance over it—called negative lookahead



(?<=

e

)




Matches if the expression e

 matches immediately before this assertion—called positive lookbehind



(?<!

e

)




Matches if the expression e

 does not match immediately before this assertion—callednegative lookbehind



 
Regex flags 
 460




Even though this regex is designed for a fairly simple task, it looks quite complicated. One way to make it more maintainable is to include comments in it. This can be done by adding inline comments using the syntax (?#

the comment

)

, but in practice comments like this can easily make the regex even more difficult to read. A much nicer solution is to use the re.VERBOSE
 flag—this allows us to freely use whitespace and normal Python comments in regexes, with the one constraint that if we need to match whitespace we must either use \s

 or a character class such as [ ]
. Here’s the key=value
 regex with comments:


^[ \t]*             # start of line and optional leading whitespace

(?P<key>\w+)        # the key text

[ \t]*=[ \t]*       # the equals with optional surrounding whitespace

(?P<value>[^\n]+)   # the value text

(?<![ \t])          # negative lookbehind to avoid trailing whitespace



margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px; font-size: 12px; line-height: 1.5em; padding:

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-370296-1-1.html 上篇帖子: Python 2.7 Tutorial —— 开胃菜 下篇帖子: mod_python官方手册翻译-全文
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表