Regular expression

Why do you use RE?

search, replace, remove string that matches w/ certain pattern

Suppose you need social security #. You want to hide important digits with *.

Python code w/o re:

data = """
shaun 800905-1049118
raina 700905-1059119
"""

result = []
for line in data.split("\n"):
    word_result = []
    for word in line.split(" "):
        if len(word) == 14 and word[:6].isdigit() and word[7:].isdigit():
            word = word[:6] + "-" + "*******"
        word_result.append(word)
    result.append(" ".join(word_result))
print("\n".join(result))

Python w/ re:

import re

data = """
shaun 800905-1049118
raina 700905-1059119
"""

pat = re.compile("(\d{6})[-]\d{7}")
print(pat.sub("\g<1>-*******", data))

Voila ! This is the power of RE !

RE can be visualized in regexr.com

Reference: http://www.wikidocs.net/1642

Meta characters

RE has meta characters which is used for special method. . ^ $ * + ? { } [ ] \ | ( )

Character class `[ ]`

Matches any 1 char in [ ]
- [abc.^]: a or b or c or . or ^
- [asd]: any 1 char in asd a (O) wsx (O) zxc (X) -> doesn’t include a or s or d.`j
- [ab][cd]: any 2 char w/ (a or b) + (c or d)
- used as range inside [ ]
- [f-j]: any 1 char in fghij
- [a-zA-Z]: any 1 char in every alphabet
- [0-9]: any 1 char in every numeric
- range order matters !!
  - [A-z] (O) Upper case -> lower case
  - [Z-a] (X)
  - [0-6] (O)
  - [5-2] (X)
Any 1 char except characters in [ ]: ^ used as not.
- [^0-9]: any 1 char except numbers
- [^A-D]: 1 char (including lower cases !) except A-D
Some shortcuts
- \d : every digits (== [0-9])
- \D : every char except digits (== [^0-9])
- \s : every white space (== [ \t\n\r\f\v])
- \S : every char except white space (== [^ \t\n\r\f\v])
- \w : alphanumeric (== [a-zA-Z0-9_])
- \W : every chat except alphanumeric (== [^a-zA-Z0-9_])

Dot `.`

Match any 1 char except \n

re.DOTALL can include \n exclusively
- a.b: a + single char in all char + b
  - aab (O)
  - a1b (O)
  - abe (X)
- [ab].: 2 chars starting w/ a or b
- ...: any chars w/ size of 3 (개 단위)
  - ABCDEFGHIJK: ABC DEF GHI (O) JK (X)
. inside [ ] refers to . itself, not all chararacters.
- a[.]b: a + . char + b
  - a.b (O)
  - acb (X)

Repeat `*`

Character in front of * repeats 0 ~ inf (Due to memory limit, 0.2 billion repeat)
- ca*t: c + a (0 ~ inf times) + t
  - ct (O)
  - cat (O)
  - caaaaat (O)
  - cadt (X)

Repeat `+`

Character in front of * repeats 1 ~ inf (Due to memory limit, 0.2 billion repeat)
- ca+t: c + a (1 ~ inf times) + t
  - ct (X)
  - cat (O)
  - caaaaat (O)
  - cadt (X)

Repeat `({m,n}, ?)`

Set repeat count (range inclusive)
- ca{2}t: c + aa + t
  - cat (X)
  - caat (O)
- ca{2,4}t
  - cat (X)
  - caat (O)
  - caaaat (O)
  - cadaat (X)
- [aeo]{1,3}: a e o 조합으로 만들수 있는 1개 ~ 3개단위 매치
Either m, n can be empty
- {,n}: from 0 to n
- {m,}: from m to inf
- {0,} == *
- {1,} == +
? == {0, 1}
- ab?c: a + b 있어도 되고 없어도 됨 + c
  - abc (O)
  - ac (O)
  - adc (X) b 는 있어도 되고 없어도 되지만 다른건 용납 못함!

Meta chars above update cursor as it finds pattern

Meta chars below do not change the cursor while finding all

Zerowidth assertions meta characters

Or `|`

Subpattern re1 | re2
asd|[zxc]: asd or z or x or c
(Mon|Tues)day: Monday or Tuesday

m = re.match('Crow|Servo', 'CrowHello')
print(m) # <re.Match object; span=(0, 4), match='Crow'>

Home `^`

Matched word (not a certain pattern in one word) MUST start w/ the keyword.
^Word: Word at very first

print(re.search('^Life', 'Life is too short')) # <re.Match object; span=(0, 4), match='Life'>
print(re.search('^Life', 'My Life')) # None

End `$`

Match word must end w/ the keyword
Word$: Word at very last

print(re.search('short$', 'Life is too short')) # <re.Match object; span=(12, 17), match='short'>
print(re.search('short$', 'Life is too short, you need python')) # None

`\A`

\A means it matches w/ string’s start.
Same meaning w/ ^ but iterprets differently w/ re.M option:
- ^ matches w/ start of each line.
- \A matches w/ start of entire string regardless of line.

`\Z`

\Z means it matches w/ string’s end.
Same rule applies w/ re.M option like \A

`\b`

\b : Word boundary (Usually separated by white space)
- \bclass\b: \s + class + \s

p = re.compile(r'\bclass\b')
print(p.search('no class at all')) # <re.Match object; span=(3, 8), match='class'>
print(p.search('the declassified algorithm')) # None

In Python, \b means backspace: Must use raw string indicator r

`\B`

Opposite of \b: words not separated by white space.

p = re.compile(r'\Bclass\B')
print(p.search('no class at all')) # None
print(p.search('the declassified algorithm')) # <re.Match object; span=(6, 11), match='class'>
print(p.search('one subclass is')) # None

if there is white space in one end, it gives None.

Grouping `( )`

Allow to find grouped pattern e.g., (ABC)+ -> ABCABCAB….

m = re.search('(ABC)+', 'ABCABCABC OK?')
print(m) # <re.Match object; span=(0, 9), match='ABCABCABC'>
print(m.group()) # ABCABCABC

Find specific string from matched string

p = re.compile(r"\w+\s+\d+[-]\d+[-]\d+")
m = p.search("park 010-1234-1234")

with grouping :

p = re.compile(r"(\w+)\s+(\d+[-]\d+[-]\d+)") # group1 : name, roup2: phone #
m = p.search("park 010-1234-1234")
print(m.group(1)) # park
print(m.group(2)) # 010-1234-1234

group index:

group(index)	description
group(0)	All matched string
group(1)	String in first group
group(2)	String in second group
group(n)	String in n-th group

Nested grouping
- index increases from outer to inner.

p = re.compile(r"(\w+)\s+((\d+)[-]\d+[-]\d+)")
m = p.search("park 010-1234-1234")
print(m.group(3)) # 010

Backreference `\1`

Backreferencing: re-use previously defined group
- \n to use n-th group
  
  \1 means 1st group in RE.
e.g. (\b\w+)\s+\1 : (word) + “ “ + same word in group
```
p = re.compile(r'(\b\w+)\s+\1')
p.search('Paris in the the spring').group() # 'the the'
```

Group naming `(?P<`Groupname`>`RE`)`

Set nickname for the group when there’s a lot of groups
- Thus, free from complexity and index

(?P<name>\w+)\s+((\d+)[-]\d+[-]\d+) :

notice (\w+) became (?P<name>\w+) from above example

p = re.compile(r"(?P<name123>\w+)\s+((\d+)[-]\d+[-]\d+)")
m = p.search("park 010-1234-1234")
print(m.group("name123")) # park

Backreference by group name (?P=groupname)

p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
p.search('Paris in the the spring').group() # 'the the'

Lookahead Assertions

m = re.search(".+:", "http://google.com")
print(m.group()) # http:

If you want to get only http from above example with condition that you cannot group anymore due to too much complexity.

e positive lookahead assertions: (?=RE)

Must match w/ RE and matched obj is not consumed
negative lookahead assertions: (?!RE)
- Must not match w/ RE and matched obj is not consumed

Positive lookahead assertions RE1`(?=`RE2`)`

RE2 is included in matching but does not return it.

ex1)

m = re.search(".+(?=:)", "http://google.com")
print(m.group()) # http

ex2) .*[.].*$ -> name + . + extension If there’s foo.bar, goo.bat, hoo.io and you want to declude bat file, first thing you could think of might be .*[.][^b].*$. However, this also decludes foo.bar

.*[.]([^b]..|.[^a].|..[^t])$ cannot check 2 digit extensions.

.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$ This will do the job.

However, it requires much more complex expression if you want to declude exe file as well -> use negative lookahead assertion

Negative lookahead assertions RE1`(?!`RE2`)`

.*[.](?!bat$).*$ : allow when extension is not bat. While searching, the string is not consumed; therefore, it returns matched obj when it concludes that the pattern does not include bat.

.*[.](?!bat$|exe$).*$ : return if extension is not exe or bat

VOILA!

Lookbehind Assertions

Positive lookbehind assertions `(?<=`RE2`)`RE1

Return if it matches w/ RE2 before RE1

Negative lookbehind assertions `(?<!`RE2`)`RE1

Return if it does not matches w/ RE2 before RE1

String replacement `sub()` and `subn()`

patternObj .sub (replacement, targetString [, count = #])

p = re.compile('(blue|white|red)')
p.sub('colour', 'blue socks and red shoes') # 'colour socks and colour shoes'

If you want to replace only once:

p.sub('colour', 'blue socks and red shoes', count=1) # 'colour socks and red shoes'

subn() returns tuple (Replaced result, replacement count)

p = re.compile('(blue|white|red)')
p.subn( 'colour', 'blue socks and red shoes') # ('colour socks and colour shoes', 2)

Using referece w/ `sub` method

\g<groupName> can reference group name in RE.

ex) swap name and number from target

p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<phone> \g<name>", "park 010-1234-1234")) # 010-1234-1234 park

Same result using index instead of name

p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<2> \g<1>", "park 010-1234-1234")) # 010-1234-1234 park

Using function as parameter in `sub` method

Function can be used in 1st parameter

def hexrepl(match):
...     value = int(match.group())
...     return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'

hexrepl function converts match obj into hexadecimal value. If function is used in sub method, returned match obj from RE is used in function’s first parameter. Then, matched string becomes function’s return value.

Greedy vs Non-Greedy

Return pattern that matches as much as possible

s = '<html><head><title>Title</title>'
print(len(s)) # 32
print(re.match('<.*>', s).span()) # (0, 32)
print(re.match('<.*>', s).group()) # <html><head><title>Title</title>

* is too greedy that RE <.*> does not return just <html>, but consumes all <html><head><title>Title</title>

Return first found pattern

re.search(r'i+', 'piigiii') # span=(1,3), match='ii'

non-greedy meta char ? limits greedy *.

Limit to minimum repeat

print(re.match('<.*?>', s).group()) # <html>

? can be used like *?, +?, ??, {m,n}?

Reference: https://wikidocs.net/4309

Regular expression

Why do you use RE?

Meta characters

Character class [ ]

Dot .

Repeat *

Repeat +

Repeat ({m,n}, ?)

Zerowidth assertions meta characters

Or |

Home ^

End $

\A

\Z

\b

\B

Grouping ( )

Backreference \1

Group naming (?P<Groupname>RE)