Regular expression in Python

import re
# compile re `ab*` and return pattern obj `p`
p = re.compile('ab*')

Pattern obj’s 4 methods

Method Object
match() Return match obj by searching from very beginning one by one, return None if not found
search() Return match obj in all string, return None if not found
findall() Return substring list that matches w/ re
finditer() Return substring iterable that matches w/ re

Following example uses pattern object from this:

import re
p = re.compile('[a-z]+')

match()

m = p.match("python")
print(m) # <_sre.SRE_Match object at 0x01F3F9F8>
# "python" matches w/ [a-z]+
m = p.match("3 python")
print(m) # None
# "3" is not matching w/ [a-z]+, thus, return None

General template for match() method due to 2 return types:

p = re.compile(reg_exp)
m = p.match( 'string goes here' )
if m:
    print('Match found: ', m.group())
else:
    print('No match')

Return same match obj w/ match()

m = p.search("python")
print(m) # <_sre.SRE_Match object at 0x01F3FA68>

Returns match obj since it searches entire string.

m = p.search("3 python")
print(m) # <_sre.SRE_Match object at 0x01F3FA30>

Notice the difference.

match() start from beginning of string, while search() entire string

findall()

Returns list grouped by word

result = p.findall("life is too short")
print(result) #['life', 'is', 'too', 'short']

finditer()

Returns iterable. Each element in iterable is match obj.

result = p.finditer("life is too short")
print(result) # <callable_iterator object at 0x01F5E390>
for r in result: print(r)
...
<_sre.SRE_Match object at 0x01F3F9F8>
<_sre.SRE_Match object at 0x01F3FAD8>
<_sre.SRE_Match object at 0x01F3FAA0>
<_sre.SRE_Match object at 0x01F3F9F8>

Match obj’s 4 methods

Method Role
group() Return matched string
start() Return matched string’s start index
end() Return matched string’s end index
span() Return matched string’s (start, end) tuple
m = p.match("python")
m.group() # 'python'
m.start() # 0 : Always 0 for match()
m.end() # 6
m.span() # (0, 6)
m = p.search("3 python")
m.group() # 'python'
m.start() # 2
m.end() # 8
m.span() # (2, 8)

Compile

DOTALL

w/o option:

import re
m = re.match('a.b', 'a\nb')
print(m) # None since . ignores \n

w/ option:

import re
p = re.compile('a.b', re.DOTALL)
m = p.match('a\nb')
print(m) # <_sre.SRE_Match object at 0x01FCF3D8>

IGNORECASE

p = re.compile('[a-z]', re.I)
p.match('python') # <_sre.SRE_Match object at 0x01FCFA30>
p.match('Python') # <_sre.SRE_Match object at 0x01FCFA68>
p.match('PYTHON') # <_sre.SRE_Match object at 0x01FCF9F8>

MULTILINE

import re
p = re.compile("^python\s\w+")

data = """python one
life is too short
python two
you need python
python three"""

print(p.findall(data)) # ['python one']

Use re.M if you want to use ^ for each line’s first, not first for entire string.

import re
p = re.compile("^python\s\w+", re.MULTILINE)

data = """python one
life is too short
python two
you need python
python three"""

print(p.findall(data)) # ['python one', 'python two', 'python three']

VERBOSE

charref = re.compile(r'&[#](0[0-7]+|[0-9]+|x[0-9a-fA-F]+);')

w/ re.X:

charref = re.compile(r"""
 &[#]                # Start of a numeric entity reference
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

Backslash \

If RE don’t have \, it will be the same RE w/ or w/o raw string indicator r

Reference: https://wikidocs.net/4308