Regular expression in Python

Basic usage

import re
# compile re `ab*` and return pattern obj `p`
p = re.compile('ab*')

Duplicate pattern can be used w/ .compile method
White space sensitive
Case sensitive

Pattern obj’s 4 methods

Returned pattern obj offers 4 methods

Method	Object
match()	Return match obj by searching from very beginning one by one, return None if not found
search()	Return match obj in all string, return None if not found
findall()	Return substring list that matches w/ re
finditer()	Return substring iterable that matches w/ re

Following example uses pattern object from this:

import re
p = re.compile('[a-z]+')

match()

m = p.match("python")
print(m) # <_sre.SRE_Match object at 0x01F3F9F8>
# "python" matches w/ [a-z]+

m = p.match("3 python")
print(m) # None
# "3" is not matching w/ [a-z]+, thus, return None

General template for match() method due to 2 return types:

p = re.compile(reg_exp)
m = p.match( 'string goes here' )
if m:
    print('Match found: ', m.group())
else:
    print('No match')

search()

Return same match obj w/ match()

m = p.search("python")
print(m) # <_sre.SRE_Match object at 0x01F3FA68>

Returns match obj since it searches entire string.

m = p.search("3 python")
print(m) # <_sre.SRE_Match object at 0x01F3FA30>

Notice the difference.

match() start from beginning of string, while search() entire string

findall()

Returns list grouped by word

result = p.findall("life is too short")
print(result) #['life', 'is', 'too', 'short']

finditer()

Returns iterable. Each element in iterable is match obj.

result = p.finditer("life is too short")
print(result) # <callable_iterator object at 0x01F5E390>
for r in result: print(r)
...
<_sre.SRE_Match object at 0x01F3F9F8>
<_sre.SRE_Match object at 0x01F3FAD8>
<_sre.SRE_Match object at 0x01F3FAA0>
<_sre.SRE_Match object at 0x01F3F9F8>

Match obj’s 4 methods

Returned match obj offers 4 methods

Method	Role
group()	Return matched string
start()	Return matched string’s start index
end()	Return matched string’s end index
span()	Return matched string’s (start, end) tuple

Examples from `match()` and `search()`

m = p.match("python")
m.group() # 'python'
m.start() # 0 : Always 0 for match()
m.end() # 6
m.span() # (0, 6)

m = p.search("3 python")
m.group() # 'python'
m.start() # 2
m.end() # 8
m.span() # (2, 8)

Compile

Compile by module

Use separate compilation:
1. if pattern obj. is reused a lot
2. if you want to use compile options
```
p = re.compile('[a-z]+', re.I)
m = p.match("python")
```
Otherwise, compile and match at once is easier and faster.
```
m = re.match('[a-z]+', "python")
```

Compile options

Full Option	shortcut	Role
DOTALL	S	`.` includes `\n` -> include every characters
IGNORECASE	I	Ignore case
MULTILINE	M	Match multiple lines (`^`, `$` utilizable this option )
VERBOSE	X	Allow verbose mode. (Allow easier view and commentable)

example: re.DOTALL == re.S

DOTALL

Used when searching in strings w/ lots of line changes (\n).

w/o option:

import re
m = re.match('a.b', 'a\nb')
print(m) # None since . ignores \n

w/ option:

import re
p = re.compile('a.b', re.DOTALL)
m = p.match('a\nb')
print(m) # <_sre.SRE_Match object at 0x01FCF3D8>

IGNORECASE

p = re.compile('[a-z]', re.I)
p.match('python') # <_sre.SRE_Match object at 0x01FCFA30>
p.match('Python') # <_sre.SRE_Match object at 0x01FCFA68>
p.match('PYTHON') # <_sre.SRE_Match object at 0x01FCF9F8>

MULTILINE

Related to ^, $ meta characters
Example: Must start w/ python + white space + word

import re
p = re.compile("^python\s\w+")

data = """python one
life is too short
python two
you need python
python three"""

print(p.findall(data)) # ['python one']

Use re.M if you want to use ^ for each line’s first, not first for entire string.

import re
p = re.compile("^python\s\w+", re.MULTILINE)

data = """python one
life is too short
python two
you need python
python three"""

print(p.findall(data)) # ['python one', 'python two', 'python three']

VERBOSE

Professional RE user’s RE is very complicated to understand.

charref = re.compile(r'&[#](0[0-7]+|[0-9]+|x[0-9a-fA-F]+);')

w/ re.X:

charref = re.compile(r"""
 &[#]                # Start of a numeric entity reference
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

When it compiles, re.VERBOSE option deletes all white spaces used in string (Except white spaces used inside [ ]).
# to comment for each line

Backslash `\`

Escape for meta characters

\section: \s recognized white space == [ \t\n\r\f\v]ection –> Not intended

\\section: recognized as \section
In python, \\ becomes \ : same error due to Python’s literal rules.

Unix’s grep, vim works fine
```
p = re.compile('\\section')
```
Using 4 \ is complicated.
```
p = re.compile('\\\\section')
```
To resolve this issue, Python offers Raw String rule as shown below:
```
p = re.compile(r'\\section')
```

If RE don’t have \, it will be the same RE w/ or w/o raw string indicator r

Reference: https://wikidocs.net/4308