Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Programming \ Python regular expressions: how to use regular expressions     - Oracle first Automated Installation Packages (Database)

- Linux filtration empty file command summary (Linux)

- How to use OpenVPN and PrivacyIDEA build two-factor authentication for remote access (Server)

- Linux Network Programming --epoll model Detailed principles and examples (Programming)

- Examples of Python any parameters (Programming)

- Redhat 7 modify the default run level method --RHEL7 use systemd to create a symbolic link to the default runlevel (Linux)

- Ubuntu installed racing game Speed Dreams 2.1 (Linux)

- Linux Fundamentals of the text, data flow processing orders (Linux)

- Linux can modify the maximum number of open files (Linux)

- Getting CentOS Learning Notes (Linux)

- Use HugePages optimize memory performance (Database)

- Repair Chrome for Linux is (Linux)

- Linux scheduling summary (Linux)

- grep search from files and display the file name (Linux)

- Java environment to build a number of issues (Linux)

- CentOS How quickly customize kernel binary RPM package (Linux)

- Java Concurrency -volatile keywords (Programming)

- shell script error dirname: invalid option - b (Database)

- Getting Started with Linux system to learn: how to configure a static IP address for CentOS7 (Linux)

- Python interview must look at 15 questions (Programming)

 
         
  Python regular expressions: how to use regular expressions
     
  Add Date : 2018-11-21      
         
         
         
  Regular expressions (abbreviated RE) can be seen as a small, highly specialized nature of the programming language, you can use it by re module in Python. Using regular expressions, you need to match the string you want to specify a set of rules for the collection, a collection of strings can contain English sentences, e-mail address, TeX commands, or anything else you want to string. Then you can raise the question: "Does this string match this mode?", Or "matching this pattern exists in this string do?." You can also use regular expressions to modify a string or separate it.
Regular expressions are compiled into a series of byte code, then execute a C language implementation of the matching engine. In some advanced scenarios, the engine must be concerned about how to perform a RE, prepared in accordance with the characteristics of the engine to improve the processing efficiency of RE bytecode. This article does not contain optimized internal matching engine optimization needs to achieve a good understanding.

Regular expressions relatively small and there is a limit, not all of the string processing tasks can be solved with regular expressions. There are some tasks you can use regular expressions to do, but the expression is very complex. In these cases, a better option is to use the Python code to handle, but the regular expression Python code relatively slower, but it may be better understood.

Regular expressions Description

We will start from the simple regular expression beginning, since the regular expression is used to operate the string, we start from the most common task: matching characters.
Matching string

Most letters and characters will match their own, for example, the regular expression test will match the string test (you can start a case-insensitive mode, so that RE could match Test or TEST).
There are exceptions to this rule, some characters are special metacharacters do not match their own. They suggest some unusual things will be matched, or they affect the rest of the RE, such as duplicate them or change their meaning. The rest of the article will mainly discuss the various metacharacters and their meanings.
Below is a list of metacharacters, later described their meanings:
. ^ $ * + {} [] \ |? ()

First we look at [and] they are used to specify a character class that you want to match a character set. Character can be listed separately, or use '-' to indicate a range of characters, for example: [abc] will match any one character a, b, or c; [a-c] also matches a, b or c any one. If you want to match only lowercase letters, your RE would be [a-z].
In the class ([] within) meta-character is not active, such as: [akm $] will match 'a', 'k', 'm' or '$' any one, '$' is a metacharacter but in character classes use it as a normal character.
You can also exclude character set classes listed by the '^' as the first character class, pay attention to outside the class of '^' will only match '^' character, for example: [^ 5] will match 5. in addition to any character.
Perhaps the most important element is the backslash character \. In Python, the backslash character can be followed as a variety of special sequence to use. It can also be used to remove the special nature of metacharacters itself as a match, for example: If you need to match a [or \, you can bring them before a backslash to remove their special meaning, ie, \ [ or\\.
With '\' special sequence indicates the beginning of some predefined character sets are often used, such as a digital set, set of characters, or any non-empty set of characters.
Let's look at an example: \ w matches any alphanumeric. If the regular expression pattern, in bytes, which is equivalent to the class [a-zA-Z0-9_]. If the regular expression pattern is a string, \ w will match the total of all characters in the Unicode database is provided unicodedata module. When compiling a regular expression, you can add re.ASCII flag to \ w tighter restrictions.
Here are some of the special sequence for reference:
\ D: Match Any number, equivalent to [0-9];
\ D: Matches any non-data character, equivalent to [^ 0-9];
\ S: Matches any whitespace character, equivalent to [\ t \ n \ r \ f \ v];
\ S: Matches any non-whitespace character, equivalent to [^ \ t \ n \ r \ f \ v];
\ W: matches any alphanumeric equivalent to [a-zA-Z0-9_];
\ W: matches any non-alphanumeric characters is equivalent to [^ a-zA-Z0-9_].
These sequences can be included in a character class, for example: '.' [\ S ,.] is a character class that will match any whitespace characters, or '', either.
In this section, the last element is the character '.', It matches newline characters in addition to any character, using an alternating pattern (re.DOTALL) will match all characters including the new line, '' is often used You need to match "any character" scenario.

Repeat process

The primary function of the regular expressions are matched character set, and another regular expression is the ability to specify a particular part of the RE must be repeated many times.
Repeat the process first metacharacters '*', '*' will not match the character '*', which represents the previous character can be matched zero or more times.
For example: ca * t will match ct (0 Ge a), cat (1 Ge a), caaat (3 Ge a), and the like. Internal RE engine will limit the number of a match, but usually sufficient.
Repeat (such as *) algorithm is greedy for repetitive RE, the engine will try to match as many repetitions, if the latter part of the pattern does not match, the matching engine will be rolled back and try again fewer repetitions.
For example, consider the expression a [bcd] * b, which matches the word 'a', 0 or many from the class [bcd] letters, and finally to 'b' end. Here is the RE matches abcbd procedure:
1, matching a: RE matching a success;
2, matching abcbd: Engine matches [bcd] *, because the more matches as possible, so the match the entire string;
3, the match fails: Try to match engine b, but has reached the end of the string, so the failure;
4, matching abcb: rollback, [bcd] * matches one less character;
5, the match fails: Try b again, but the character at the current position of d;
6, matching abc: Continue back up, so that [bcd] * matches only bc;
7, matching abcb: Try b again, this character at the current position is b, the match is successful, the end.
RE final match abcb, the whole process demonstrates the matching process matching engine, the first match as many characters, if not match, then continue to roll back to try again. It rolled back until [bcd] * matches zero characters, if any course fails, the engine concludes "string does not match the RE".
Another repeating metacharacter is + match one or more times. Caution * + between different and, * matches zero or more times, which can match empty; + you need to appear at least once. For example: ca + t will match cat (1 Ge a), caaat (3 Ge a), but does not match ct.
There are also two repeat qualifiers, one of which is a question mark, which matches one or zero times, such as '?': Home- brew homebrew match or home-brew?.
The most complicated repeated qualifier is {m, n}, where m and n are positive integers, m represents at least match the times match up to n times. For example: a / {1,3} b will match a / b, a // b, and a /// b, it will not match ab, or a //// b.
Can you ignore the m or n, m represents a minimum value of 0 is ignored and ignore n no limit.
You may have noticed that using the last qualifier can replace a front three qualifiers: {0} is equivalent to *; {1,} is equivalent to +; {0,1} is equivalent to?. Why use *, +, or? It? Mainly, shorter expression is more conducive to read and understand.
Using regular expressions

Now that we understand the basic syntax of regular expressions, the following look at how to use regular expressions in Python. re module provides an interface to use regular expressions, allowing you to compile the RE object, and then use them.
Compiling a regular expression

Regular expressions are compiled into the object model, provides a variety of operating methods, such as pattern matching or replaced.
>>> Import re
>>> P = re.compile ( 'ab *')
>>> P
re.compile ( 'ab *')

re.compile () also provides an optional flags parameter is used to activate various features will be described in detail later, the following is a simple example:

>>> P = re.compile ( 'ab *', re.IGNORECASE)

RE as a string to re.compile (). RE is treated as strings because regular expressions are not part of the core Python language, no specific language used to create them. re just a Python module that contains a C language extension modules like the same socket and zlib module.
The RE as a string holding a simple Python language, but there are also negative, for example, the next section will be described.

Backslash problem

As mentioned earlier, regular expressions use the backslash to indicate some special combination or to allow special characters to use as an ordinary character. This made for a slash and Python use conflicts.
If you want to write a RE matching string \ section, we look at how to construct a regular expression object: First, we use the entire string as a regular expression; secondly, to identify backslash characters and other elements, in front of them backslash, becomes \\ section; finally, the string is passed to re.compile (), due to be passed in \\ section, combined with Python syntax, each preceded by \ must again add a \, Therefore, the final pass in Python string is "\\\\ section".
In short, to match a backslash, in Python you need to write '\\\\' as the RE string. This led to a lot of repeated backslashes make grammar very difficult to understand.
The solution is to use regular expressions native Python strings comment. When the string with the prefix 'r', the backslash character is not special treatment, so r "\ n" that contains the string '\' and 'n' two characters, and the "\ n" is character string that contains a newline. In Python regular expression will often be written in this way.

Executive Match

Methods and properties of a regular expression object Once you have a compiled, you can use the object, the following make a brief introduction.
1) match ()
OK RE matches at the beginning of the string.
2) search ()
Anywhere scanning strings, search and RE match.
3) findall ()
Find all RE matching substring, and returned as a list.
4) finditer ()
Find all RE substring matching, and returned as an iterator.
If a match is found, match () and search () returns None; if the match is successful, it returns a match object instance, contains a matching information: start and end points, matching substring, and so on.
Let's look at how to make the Python regular expressions.
First, run the Python interpreter, import the re module, and compile a RE:
>>> Import re
>>> P = re.compile ( '[a-z] +')
>>> P
re.compile ( '[a-z] +')

Now, you can try to match a variety of string, an empty string will not match at all, since + means 'one or more', match () will return None, you can print the results directly:

>>> P.match ( "")
>>> Print (p.match ( ""))
None

Next, we try to match a string of case, match () will return a match object, so you should store the result in a variable for later use:

>>> M = p.match ( 'tempo')
>>> M
<_sre.SRE_Match Object; span = (0, 5), match = 'tempo'>

Now you can ask information about the object matching the string. Matching objects also have several methods and properties, the most important ones are:
1) group ()
RE returns the matching string
2) start ()
Return to the start position match
3) end ()
Back end of match
4) span ()
Returns a tuple containing the matching position (start, end)
Here are some examples of the use of these methods:

>>> M.group ()
'Tempo'
>>> M.start (), m.end ()
(0, 5)
>>> M.span ()
(0, 5)

Since the match () only checks if the RE matches the start of the string, start () will always return 0. However, search () method to scan the entire string, not necessarily the starting position 0:

>>> Print (p.match ( '::: message'))
None
>>> M = p.search ( '::: message'); print (m)
<_sre.SRE_Match Object; span = (4, 11), match = 'message'>
>>> M.group ()
'Message'
>>> M.span ()
(4, 11)

In the actual programming summary, usually matching objects stored in a variable, and then check whether it is None, for example:

findall () returns a list of matching strings:

>>> P = re.compile ( '\ d +')
>>> P.findall ('12 drummers drumming, 11 pipers piping, 10 lords a-leaping ')
[ '12', '11', '10']

findall () must be created before returning a complete list of results, and finditer () returns the matching object instance as an iterator:

>>> Iterator = p.finditer ('12 drummers drumming, 11 ... 10 ... ')
>>> Iterator
< Callable_iterator object at 0x ... >
>>> For match in iterator:
... Print (match.span ())
...
(0, 2)
(22, 24)
(29, 31)

Module-level functions

You do not necessarily need to create a schema object and then call its methods, re module also provides a module-level function match (), search (), findall (), sub (), and so on. These functions use the same method and the corresponding model parameters, also returns None matching or object instance:
>>> Print (re.match (r'From \ s + ',' Fromage amk '))
None
>>> Re.match (r'From \ s + ',' From amk Thu May 14 19:12:10 1998 ')
< _sre.SRE_Match Object; span = (0, 5), match = 'From' >

These functions create a schema object, and call it the above method, the object is also stored compiled them into the cache, so that the future use of the same RE will not need to be recompiled.
You should use these modules and functions, or should be called by the module object? If you're doing a cycle of regular expressions, it will save a lot of pre-compiled function call, otherwise, there is not much difference between the two ways.

Compile flags

Compile flags let you modify some aspects of how regular expressions work. Flag in the re module can use two names, long names, such as IGNORECASE, and a short name, for example I. Or by bit operation, more flags can be specified, for example re.I | re.M set I and M flags.
The following is a list and explanation of each sign available flags:
1) ASCII, A
When using \ w, \ b, \ s matches only ASCII characters and \ d time;
2) DOTALL, S
So that matches any character, including a new line '.';
3) IGNORECASE, I
Ignore case matching;
4) LOCALE, L
Do local correlation matching;
5) MULTILINE, M
Multi-line matching, affecting ^ and $;
6) VERBOSE, X (for 'extended')
Start detail RE, more clearly organized and better understanding.
For example, the following uses re.VERBOSE, making RE easier to read:
charref = re.compile (r "" "
 & [#] # Start of a numeric entity reference
 (
    0 [0-7] + # Octal form
  | [0-9] + # Decimal form
  | X [0-9a-fA-F] + # Hexadecimal form
 )
 ; # Trailing semicolon
"" ", Re.VERBOSE)

If no re.VERBOSE, the RE would look like this:

charref = re.compile ( "& # (0 [0-7] +"
                    "| [0-9] +"
                    "| X [0-9a-fA-F] +);")

In the above example, Python's automatic string concatenation is used to RE differentiation into multiple fragments, but use it more than any course re.VERBOSE more difficult to understand.

Regular expressions more features

So far we only covers part of the characteristics of a regular expression, where we will explore some of the new features.
More metacharacters

Here we will introduce more metacharacters.
1) |
Means 'or' operator, if A and B are regular expressions, A | B matches A or represents Match B. In order to work effectively, | has a very low priority, for example: Crow | Servo will match Crow or Servo, rather than Cro, a 'w' or a 'S', followed by the ervo.
In order to match the character '|', you need to use \ |, or wrap it into a character class, as [|].
2) ^
Matches the beginning of the row. Unless the MULTILINE flag, which would match only the beginning of the string. In MULTILINE mode, this also matches the start of each new line.
For example: If you want to match the line start matching words From, RE will use ^ From.
>>> Print (re.search ( '^ From', 'From Here to Eternity'))
<_sre.SRE_Match Object; span = (0, 4), match = 'From'>
>>> Print (re.search ( '^ From', 'Reciting From Memory'))
None

3) $
Matches the end of the line. It may be the end of the string, or a newline followed by part.

>>> Print (re.search ( '} $', '{block}'))
<_sre.SRE_Match Object; span = (6, 7), match = '}'>
>>> Print (re.search ( '} $', '{block}'))
None
>>> Print (re.search ( '} $', '{block} \ n'))
<_sre.SRE_Match Object; span = (6, 7), match = '}'>

In order to match the character '$', use \ $ or it was dispensed to a character class, as [$].
4) \ A
Match only the beginning of the string. When not using MULTILINE mode, \ A and ^ are the same; in MULTILINE mode, they are different: \ A matches only any natural beginning of the string, and ^ can match the beginning of each new line.
5) \ Z
Matches only at the end of the string.
6) \ b
Word boundary. This is a zero-width assertion, only matching words begin and end. A word is defined as a sequence of letters, so that the end of the word is represented as a blank or non-alphabetic characters.
Here is an example of matching words class, but it is in a word will not be matched when the inside:

>>> P = re.compile (r '\ bclass \ b')
>>> Print (p.search ( 'no class at all'))
< _sre.SRE_Match Object; span = (3, 8), match = 'class' >
>>> Print (p.search ( 'the declassified algorithm'))
None
>>> Print (p.search ( 'one subclass is'))
None

When using Two things to note: First, in Python, \ b represents the backspace character, ASCII value is 8, if you do not have the original strings, then Python will convert \ b to a backspace character, your RE will do not match according to your vision. The following example is similar to our example above, the only difference is that the RE string of 'r' prefix:

>>> P = re.compile ( '\ bclass \ b')
>>> Print (p.search ( 'no class at all'))
None
>>> Print (p.search ( '\ b' + 'class' + '\ b'))
< _sre.SRE_Match Object; span = (0, 7), match = '\ x08class \ x08' >

Second, in the character class, \ b represents the backspace character, and Python the meaning expressed unanimous.
7) \ B
Another zero-width assertion, and \ b contrast, only match the current position is not a word boundary.

Packet

Group by '(' and ')' meta-character identifier, '(' and ')' here and mathematical expressions of the same meaning as they would in an expression classified as a group, you can specify a group of repetitions repeat using metacharacters *, + ,? or {m, n}. For example, (ab) * matches zero or more ab.
>>> P = re.compile ( '(ab) *')
>>> Print (p.match ( 'ababababab'). Span ())
(0, 10)

It can also get their match start and end points of the string, by passing a parameter to the group (), start (), end () and span (). Group numbering begins with 0, group 0 is always present, he is the whole RE, so the matching object methods to group 0 as their default parameters.

>>> P = re.compile ( '(a) b')
>>> M = p.match ( 'ab')
>>> M.group ()
'Ab'
>>> M.group (0)
'Ab'

Subgroup numbers from left to right, starting with 1. Groups can be nested. To determine the number, counting only open from left to right parenthesis characters.

>>> P = re.compile ( '(a (b) c) d')
>>> M = p.match ( 'abcd')
>>> M.group (0)
'Abcd'
>>> M.group (1)
'Abc'
>>> M.group (2)
'B'

group () can be transferred more than once the group number, in which case it returns a tuple:

>>> M.group (2,1,2)
( 'B', 'abc', 'b')

groups () method returns all subgroups matching tuple string subset starting at 1:

>>> M.groups ()
( 'Abc', 'b')

Application of the reverse mode allows you to specify the contents of a previous group, for example, \ 1 indicates the current position of the same group and the matching content. Note that in Python must use the original string.
For example, the following RE appear simultaneously detect two cases of the same word:

>>> P = re.compile (r '(\ b \ w +) \ s + \ 1')
>>> P.search ( 'Paris in the the spring'). Group ()
'The the'

This matching is rarely used in the search, but they are enormously useful to replace the string.

Non-capturing and Named Groups

RE can use many groups to capture substring of interest or make complex RE clearer structure, which makes the group number by tracking very difficult. There are two ways to solve this problem, we first look at the first one.
Sometimes you will want to use a group represents part of the regular expression, but do not want to get the contents of the group. In this case, you can use the non-capturing group: (?: ...), Will be replaced ... any regular expression.
>>> M = re.match ( "([abc]) +", "abc")
>>> M.groups ()
( 'C',)
>>> M = re.match ( "(:? [Abc]) +", "abc")
>>> M.groups ()
()

Except that you can not get group match the content of a non-identical acts and capture group capture group, you can put anything in it, you can use the repeat metacharacters (such as *) to repeat it, or nested other group ( capture or non-capture). When modifying an existing mode (:? ...) It is particularly useful because you can add new groups without changing the existing group number. Note, however, use the non-capturing group and capturing groups did not differ on any efficiency in the match.
Another more significant feature is named groups: substituted for the group number used instead specify a name for the group.
Named Group is one of the Python-specific extensions, grammar :(? P ...), name is the name of the group. Matching object methods can accept the group number or name of the group, so you can use two methods to obtain information group matches:

>>> P = re.compile (r '(? P \ b \ w + \ b)')
>>> M = p.search ( '((((Lots of punctuation)))')
>>> M.group ( 'word')
'Lots'
>>> M.group (1)
'Lots'

Named groups are handy, since names are easier to remember than numbers, the following is a RE module from imaplib examples:

InternalDate = re.compile (r'INTERNALDATE " '
        r '(? P < day > [123] [0-9]) - (? P < mon > [A-Z] [a-z] [a-z]) -'
        r '(? P < year > [0-9] [0-9] [0-9] [0-9])'
        r '(? P < hour > [0-9] [0-9]) :(? P < min > [0-9] [0-9]) :(? P < sec > [0-9] [ 0-9]) '
        r '(? P < zonen > [- +]) (? P < zoneh > [0-9] [0-9]) (? P < zonem > [0-9] [0-9])'
        r ' "')

Obviously the name the way m.group ( 'zonem') than the group number 9 Get a matching value way easier to use.
The syntax for backward applications, such as (...) \ 1, refers to the group number, group name instead of the number syntax some changes. This is another Python extension :(? P = name), showing the contents of the group called name should match the contents of the current point. To find two consecutive words repeated regular expression, (\ b \ w +) \ s + \ 1 can be written as \ s + (P = word?) (P < word > \ b \ w +?):

>>> P = re.compile (r '(? P < word > \ b \ w +) \ s + (? P = word)')
>>> P.search ( 'Paris in the the spring'). Group ()
'The the'

Lookahead assertion

Another zero-width assertion is the lookahead assertion. Lookahead assertion can be positive, negative form, like this:
1) (? = ...)
Positive lookahead assertion. If you include ... Regular expressions are represented successfully matches at the current location, succeed or fail. However, although included in the regular expression is attempted, but the matching engine does not advance, the rest of the pattern match or start from the beginning where the assertion.
2)(?!...)
Negative lookahead assertion. And positive lookahead assertion contrary, if it contains the regular expression does not match the current position of the string is successful.
In order to make a more specific description, we see an example of the role of lookahead. Consider a simple pattern to match a filename and split it to a file name and extension. For example, news.rc in, news is the file name, rc represents extension.
Matching model is simple:


* [.] * $


Note that you need into a character class, because it is a metacharacter; also note the $ is used to ensure that all the rest of the string to be included in the extension. This regular expression matches foo.bar, autoexec.bat, sendmail.cf and printers.conf.
Now, consider the case a little more complex, and if you want to match the extension of the file name of the bat is not how to do? Here are some incorrect attempt:
1). * [.] [^ B]. * $
Try asking this extension is not the first character b to exclude bat. At this wrong, because the pattern does not match foo.bar.
2) * ([^ b] ... [.] |. [^ A] |. .. [^ T]) $
This is more complicated than a request: Extended character mismatch b, or the second character does not match a, or the third character does not match the t. This matches the foo.bar, does not match the autoexec.bat, but it requires extension must be three characters will not match the file extension with two characters, such as sendmail.cf. We will continue to improve it.
3) * ([^ b]. [.] |.?.?. [^ A] |?.??? .. [^ T]) $
In this attempt, the second and third characters are optional, in order to match the case of an extension of less than three characters, for example, sendmail.cf.
Start now complicated pattern, and began to be difficult to read and understand. Worse, if the problem changes, while excluding the extension you want to bat and exe, patterns will become more complex and difficult to understand.
A negative lookahead assertion can solve this problem.
. * [.] (?! Bat $). * $
Meaning: if the expression bat does not match the current point, try the rest of the pattern; if bat $ match, the whole pattern will fail. Ending $ sample.batch for preventing the situation.
Exclude another file extension is now easy, simply add it as an assertion of a second election. The following patterns while excluding bat and exe:
. * [.]. (?! Bat $ | exe $) * $
Modify the string

So far, we only apply the regular expression query strings, regular expressions can also be used to modify the string, use the following method:
1) split ()
From wherever the RE matches a string into a list of strings;
2) sub ()
Find all substrings RE matches, and replace them with a different string;
3) subn ()
And sub do the same thing, but returns the number of new strings and replace.
Decomposition string

split () method for decomposing a string, use the RE matched substring as a delimiter substring returned list decomposed. split it and strings () method is similar, but offers more general delimiter; split string () method supports only a space or a fixed string. re also provides a module-level re.split () function.
split (string [, maxsplit = 0])
By regular expression matching string decomposition. If you use parentheses in the RE, then the regular expression matching will also appear in the results list. Maxsplit If the value is greater than 0, the second most do maxsplit decomposition.
You maxsplit by setting limits the number of decomposition. When maxsplit greater than zero up to be maxsplit times decomposed remainder of the string is returned as the last element of the list. In the following example, the delimiter character combinations or any non-numeric characters:
>>> P = re.compile (r '\ W +')
>>> P.split ( 'This is a test, short and sweet, of split ().')
[ 'This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> P.split ( 'This is a test, short and sweet, of split ().', 3)
[ 'This', 'is', 'a', 'test, short and sweet, of split ().']

Sometimes you not only for the separator between what is of interest, but also need to know what the delimiter Oh yes. If you use parentheses in the RE, then their value will also appear in the returned list. Compare the following calls:

>>> P = re.compile (r '\ W +')
>>> P2 = re.compile (r '(\ W +)')
>>> P.split ( 'This ... is a test.')
[ 'This', 'is', 'a', 'test', '']
>>> P2.split ( 'This ... is a test.')
[ 'This', '...', 'is', '', 'a', '', 'test', '.', '']

Module-level function re.split () adds RE as the first argument, the rest is the same:

>>> Re.split ( '[\ W] +', 'Words, words, words.')
[ 'Words', 'words', 'words', '']
>>> Re.split ( '([\ W] +)', 'Words, words, words.')
[ 'Words', ',', 'words', ',', 'words', '.', '']
>>> Re.split ( '[\ W] +', 'Words, words, words.', 1)
[ 'Words', 'words, words.']

replace

Another common operation is to find all occurrences of the string, and replace it with another string. sub () method of passing parameters replacement, can be a string or a function.
sub (replacement, string [, count = 0])
Returns a string after replacement, replace the use of non-overlapping manner and left to right. If the pattern is found, string is returned unchanged.
Optional parameter count specifies the maximum number of times for replacement; count must be non-negative. The default value of 0 means to replace all.
Here is a simple example. It uses colour replace all matching color name:
>>> P = re.compile ( '(blue | white | red)')
>>> P.sub ( 'colour', 'blue socks and red shoes')
'Colour socks and colour shoes'
>>> P.sub ( 'colour', 'blue socks and red shoes', count = 1)
'Colour socks and red shoes'

subn () method to do the same thing, but returns a tuple of length 2, including new string and frequency of replacement:

>>> P = re.compile ( '(blue | white | red)')
>>> P.subn ( 'colour', 'blue socks and red shoes')
( 'Colour socks and colour shoes', 2)
>>> P.subn ( 'colour', 'no colours at all')
( 'No colours at all', 0)

Only when the match is not empty and when to do before a match adjacent substitutions:

>>> P = re.compile ( 'x *')
>>> P.sub ( '-', 'abxd')
'-a-B-d-'

If replacement is a string in it any backslash escapes will be processed. That is, \ n is converted to a newline character, \ r is converted to a carriage return, and so on. Unknown escapes such as \ j is left. Back-references such as \ 6, substituted RE substring matching the corresponding group. This allows you to replace in the result string after the merger can be part of the original string.
The following example matches the word section, {} is a string containing the followers, and changes to section subsection:

>>> P = re.compile ( 'section {([^}] *)}', re.VERBOSE)
>>> P.sub (r'subsection {\ 1} ',' section {First} section {second} ')
'Subsection {First} subsection {second}'

You can also use (? P < name > ...) named group. \ G < name > will be matched by the group name, \ g < number > will be matched by group number. Thus \ g < 2 > is equivalent to \ 2, to avoid ambiguity when, for example, \ g < 2 > 0 represents 2 group matches, and \ 20 will be interpreted as a matched group of 20. Replace the following examples are equivalent, but the use of three different ways:

>>> P = re.compile ( 'section {(? P < name > [^}] *)}', re.VERBOSE)
>>> P.sub (r'subsection {\ 1} ',' section {First} ')
'Subsection {First}'
>>> P.sub (r'subsection {\ g < 1 >} ',' section {First} ')
'Subsection {First}'
>>> P.sub (r'subsection {\ g < name >} ',' section {First} ')
'Subsection {First}'

replacement can also be a function that can give you more control. If replacement is a function, the function will deal with each of the non-overlapping pattern matching substring. In each call, the function is passed a matching object as a parameter, the function can use this information to calculate the replacement string and return it.
In the following example, replacement function converts decimal to hexadecimal:

>>> Def hexrepl (match):
... "Return the hex string for a decimal number"
... Value = int (match.group ())
... Return hex (value)
...
>>> P = re.compile (r '\ d +')
>>> P.sub (hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'

When using the module level re.sub () function, mode as the first parameter. Mode can be an object or a string; if you need to specify regular expression flags, you must use a schema object as the first parameter, or the pattern string used in the embedded modifiers, such as: sub ( "(i? ) b + "," x "," bbbb BBBB ") returns 'x x'.

common problem

Regular expressions are a useful tool in some applications, but their behavior is not intuitive, and sometimes does not work as you expect. This section describes some common pitfalls.
The method of the String

Sometimes using re module is wrong. If you're matching a fixed string, or a single character class, and you did not use any re features such IGNORECASE flag, then the power of regular expressions is not needed. String a few strings to a fixed method to perform the operation, and they are usually faster, because their implementation is a single C cycle, and for that scene optimized.
A common example is the replacement of a fixed string to another, for example, you want to replace the word for the deed, re.sub () seems to be used in this scenario, but you should consider replace () method. Note that replace () also replaces the word within a word, such as modifying the swordfish is sdeedfish, but simple RE word would do that. (In order to avoid replacing the word within the pattern will have to be \ bword \ b, to claim word is a separate word. This is beyond the replace () capability.)
Another common task is to detect the position of a character string, or replace it with another character. You can use something like this operation to achieve: re.sub ( '\ n', '', S), but translate () can also accomplish this task, and are faster than any regular expression operations.
In short, before using the re module, consider whether your problem can be a faster, simpler way to solve the string.
match () VS search ()

match () function only checks if the RE matches at the beginning of the string, and search () will scan the entire string. It is important to remember, match () will only report a successful match starting at 0 carried out; starting point if the match is not 0, match () will not report it.
>>> Print (re.match ( 'super', 'superstition'). Span ())
(0, 5)
>>> Print (re.match ( 'super', 'insuperable'))
None

Another aspect, search () will scan the entire string, the first successful match report found.

>>> Print (re.search ( 'super', 'superstition'). Span ())
(0, 5)
>>> Print (re.search ( 'super', 'insuperable'). Span ())
(2, 7)

Sometimes you'll be tempted to use re.match (), only increase. * Prior to your RE. You should reject the temptation to turn to re.search (). Regular expression compiler will do some of the RE analysis, in order to accelerate the process to find a match. Such an analysis is the analysis of the first character must match what is; for example, a pattern begins to Crow must match the first character in 'C'. This analysis enables fast scan engine query string start character, when only 'C' is found and continue down the match.
Increase. * This will optimize invalid, the scanner to the end of the string, a match the rest of the RE found in the back. Therefore, priority re.search ().

Non-greedy greedy VS

When repeating a regular expression, for example, a *, conduct regular expression is matched as much as possible. This often leads to some problems when you try to match a pair of qualifiers, such as HTML tags contain angle brackets, as * greed characteristics matching a simple HTML tag mode does not work:
>>> S = '< html > < head > < title > Title < / title >'
>>> Len (s)
32
>>> Print (re.match ( '<. *>', S) .span ())
(0, 32)
>>> Print (re.match ( '<. *>', S) .group ())
< Html > < head > < title > Title < / title >

RE in the < html > match ' < ', then. * Consumption string all the rest, since the last RE > can not match, so the regular expression engine has to backtrack character until it finds a> a match. The final match is from the < html > ' < ' is to < / title > The ' > ', not what you want.
In this scenario, you should use non-greedy qualifiers *? +?, ??, Or {m, n} ?, they will match as few characters. In the example above, the first '<' after the match, '>' will be tried immediately, if it fails, the engine each time a character forward, try again, and finally get the correct result:

>>> Print (re.match ( ' < . *? > ', S) .group ())
< Html >

(Note the use of regular expressions to parse HTML or XML is painful, because writing a scene can handle all of the regular expression is very complex, using HTML or XML parser to accomplish this task.)

Use re.VERBOSE

Now you may have noticed that regular expressions are a very compact form, but they are not very readable. RE moderate complexity can become lengthy collections backslash, parentheses, and metacharacters, making them difficult to read and understand it.
Such as RE, when compiling a regular expression is specified re.VERBOSE flag is helpful because it allows you to format the regular expression to make it clearer.
There are several signs re.VERBOSE influence. Spaces in the regular expression, but not in the character class will be ignored, which means that an expression such as dog | cat is equivalent to the dog | cat, but [ab] matches any natural character 'a', 'b' and spaces. In addition, you can put a comment in the RE; comments from a # character to the next line. When a three-quoted strings, RE formatted more clearly:
pat = re.compile (r "" "
 \ S * # Skip leading whitespace
 (? P
[^:] +) # Header name
 \ S *: # Whitespace, and a colon
 (?.? P *) # The header's value -? * Used to
                    # Lose the following trailing whitespace
 \ S * $ # Trailing whitespace to end-of-line
"" ", Re.VERBOSE)

And compared to the following expression, which is more readable:

pat = re.compile (r "? \ s * (P < header > [^:] +) \ s * :( P < value > *?.?) \ s * $")
     
         
         
         
  More:      
 
- On event processing browser compatibility notes (Programming)
- Ubuntu 32 compile Android 4.0.4 Problems (Linux)
- Android Fragment everything you need to know (Programming)
- Use window.name + iframe cross-domain access to data Detailed (Programming)
- Make command tutorial (Linux)
- Installation in lxml Python module (Linux)
- bash login and welcome message: / etc / issue, / etc / motd (Linux)
- JavaScript function part (Programming)
- TWiki LDAP error appears the problem is solved (Linux)
- Linux C programming and Shell Programming in the development of practical tools similarities summary (Programming)
- tcpdump Linux system security tools (Linux)
- How to determine whether the Linux server was hacked (Linux)
- Hadoop scheduling availability of workflow platform - Oozie (Server)
- Linux shell script under the use of randomly generated passwords (Programming)
- Axel install plug under CentOS 5/6 acceleration yum downloads (Linux)
- Spark parquet merge metadata problem (Server)
- How to properly set up a Linux swap partition (Linux)
- Analysis of MySQL Dockerfile 5.6 (Database)
- Installation of JDK and Tomcat under Linux (CentOS) (Linux)
- Unix average load average load calculation method (Server)
     
           
     
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.