Regular expression

Why do you use RE?

Suppose you need social security #. You want to hide important digits with *.

data = """
shaun 800905-1049118
raina 700905-1059119
"""

result = []
for line in data.split("\n"):
    word_result = []
    for word in line.split(" "):
        if len(word) == 14 and word[:6].isdigit() and word[7:].isdigit():
            word = word[:6] + "-" + "*******"
        word_result.append(word)
    result.append(" ".join(word_result))
print("\n".join(result))
import re

data = """
shaun 800905-1049118
raina 700905-1059119
"""

pat = re.compile("(\d{6})[-]\d{7}")
print(pat.sub("\g<1>-*******", data))

Voila ! This is the power of RE !

RE can be visualized in regexr.com


Reference: http://www.wikidocs.net/1642

Meta characters

RE has meta characters which is used for special method. . ^ $ * + ? { } [ ] \ | ( )

Character class [ ]

Dot .

Repeat *

Repeat +

Repeat ({m,n}, ?)

Meta chars above update cursor as it finds pattern


Meta chars below do not change the cursor while finding all

Zerowidth assertions meta characters

Or |

m = re.match('Crow|Servo', 'CrowHello')
print(m) # <re.Match object; span=(0, 4), match='Crow'>

Home ^

print(re.search('^Life', 'Life is too short')) # <re.Match object; span=(0, 4), match='Life'>
print(re.search('^Life', 'My Life')) # None

End $

print(re.search('short$', 'Life is too short')) # <re.Match object; span=(12, 17), match='short'>
print(re.search('short$', 'Life is too short, you need python')) # None

\A

\Z

\b

p = re.compile(r'\bclass\b')
print(p.search('no class at all')) # <re.Match object; span=(3, 8), match='class'>
print(p.search('the declassified algorithm')) # None

\B

p = re.compile(r'\Bclass\B')
print(p.search('no class at all')) # None
print(p.search('the declassified algorithm')) # <re.Match object; span=(6, 11), match='class'>
print(p.search('one subclass is')) # None

if there is white space in one end, it gives None.

Grouping ( )

m = re.search('(ABC)+', 'ABCABCABC OK?')
print(m) # <re.Match object; span=(0, 9), match='ABCABCABC'>
print(m.group()) # ABCABCABC
p = re.compile(r"\w+\s+\d+[-]\d+[-]\d+")
m = p.search("park 010-1234-1234")

with grouping :

p = re.compile(r"(\w+)\s+(\d+[-]\d+[-]\d+)") # group1 : name, roup2: phone #
m = p.search("park 010-1234-1234")
print(m.group(1)) # park
print(m.group(2)) # 010-1234-1234
group(index) description
group(0) All matched string
group(1) String in first group
group(2) String in second group
group(n) String in n-th group
p = re.compile(r"(\w+)\s+((\d+)[-]\d+[-]\d+)")
m = p.search("park 010-1234-1234")
print(m.group(3)) # 010

Backreference \1

Group naming (?P<Groupname>RE)

(?P<name>\w+)\s+((\d+)[-]\d+[-]\d+) :

notice (\w+) became (?P<name>\w+) from above example

p = re.compile(r"(?P<name123>\w+)\s+((\d+)[-]\d+[-]\d+)")
m = p.search("park 010-1234-1234")
print(m.group("name123")) # park

Lookahead Assertions

m = re.search(".+:", "http://google.com")
print(m.group()) # http:

If you want to get only http from above example with condition that you cannot group anymore due to too much complexity.

e positive lookahead assertions: (?=RE)

Positive lookahead assertions RE1(?=RE2)

ex1)

m = re.search(".+(?=:)", "http://google.com")
print(m.group()) # http

ex2) .*[.].*$ -> name + . + extension If there’s foo.bar, goo.bat, hoo.io and you want to declude bat file, first thing you could think of might be .*[.][^b].*$. However, this also decludes foo.bar

.*[.]([^b]..|.[^a].|..[^t])$ cannot check 2 digit extensions.

.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$ This will do the job.

However, it requires much more complex expression if you want to declude exe file as well -> use negative lookahead assertion

Negative lookahead assertions RE1(?!RE2)

.*[.](?!bat$).*$ : allow when extension is not bat. While searching, the string is not consumed; therefore, it returns matched obj when it concludes that the pattern does not include bat.

.*[.](?!bat$|exe$).*$ : return if extension is not exe or bat

VOILA!

Lookbehind Assertions

Positive lookbehind assertions (?<=RE2)RE1

Return if it matches w/ RE2 before RE1

Negative lookbehind assertions (?<!RE2)RE1

Return if it does not matches w/ RE2 before RE1

String replacement sub() and subn()

p = re.compile('(blue|white|red)')
p.sub('colour', 'blue socks and red shoes') # 'colour socks and colour shoes'
p.sub('colour', 'blue socks and red shoes', count=1) # 'colour socks and red shoes'
p = re.compile('(blue|white|red)')
p.subn( 'colour', 'blue socks and red shoes') # ('colour socks and colour shoes', 2)

Using referece w/ sub method

\g<groupName> can reference group name in RE.

ex) swap name and number from target

p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<phone> \g<name>", "park 010-1234-1234")) # 010-1234-1234 park

Same result using index instead of name

p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<2> \g<1>", "park 010-1234-1234")) # 010-1234-1234 park

Using function as parameter in sub method

def hexrepl(match):
...     value = int(match.group())
...     return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'

hexrepl function converts match obj into hexadecimal value. If function is used in sub method, returned match obj from RE is used in function’s first parameter. Then, matched string becomes function’s return value.

Greedy vs Non-Greedy

s = '<html><head><title>Title</title>'
print(len(s)) # 32
print(re.match('<.*>', s).span()) # (0, 32)
print(re.match('<.*>', s).group()) # <html><head><title>Title</title>

* is too greedy that RE <.*> does not return just <html>, but consumes all <html><head><title>Title</title>

re.search(r'i+', 'piigiii') # span=(1,3), match='ii'
print(re.match('<.*?>', s).group()) # <html>

Reference: https://wikidocs.net/4309