Suppose you need social security #. You want to hide important digits with *.
data = """
shaun 800905-1049118
raina 700905-1059119
"""
result = []
for line in data.split("\n"):
word_result = []
for word in line.split(" "):
if len(word) == 14 and word[:6].isdigit() and word[7:].isdigit():
word = word[:6] + "-" + "*******"
word_result.append(word)
result.append(" ".join(word_result))
print("\n".join(result))
import re
data = """
shaun 800905-1049118
raina 700905-1059119
"""
pat = re.compile("(\d{6})[-]\d{7}")
print(pat.sub("\g<1>-*******", data))
Voila ! This is the power of RE !
RE can be visualized in regexr.com
Reference: http://www.wikidocs.net/1642
RE has meta characters which is used for special method.
. ^ $ * + ? { } [ ] \ | ( )
[ ]
[ ]
[abc.^]
: a or b or c or . or ^[asd]
: any 1 char in asd
a (O)
wsx (O)
zxc (X) -> doesn’t include a
or s
or d
.`j[ab][cd]
: any 2 char w/ (a
or b
) + (c
or d
)-
used as range inside [ ]
[f-j]
: any 1 char in fghij
[a-zA-Z]
: any 1 char in every alphabet[0-9]
: any 1 char in every numeric[A-z]
(O) Upper case -> lower case[Z-a]
(X)[0-6]
(O)[5-2]
(X)[ ]
: ^
used as not
.
[^0-9]
: any 1 char except numbers[^A-D]
: 1 char (including lower cases !) except A-D
\d
: every digits (== [0-9]
)\D
: every char except digits (== [^0-9]
)\s
: every white space (== [ \t\n\r\f\v]
)\S
: every char except white space (== [^ \t\n\r\f\v]
)\w
: alphanumeric (== [a-zA-Z0-9_]
)\W
: every chat except alphanumeric (== [^a-zA-Z0-9_]
).
\n
re.DOTALL
can include\n
exclusively
a.b
: a
+ single char in all char + b
aab
(O)a1b
(O)abe
(X)[ab].
: 2 chars starting w/ a
or b
...
: any chars w/ size of 3 (개 단위)
ABCDEFGHIJK
: ABC
DEF
GHI
(O) JK
(X).
inside [ ]
refers to .
itself, not all chararacters.
a[.]b
: a
+ .
char + b
a.b
(O)acb
(X)*
*
repeats 0 ~ inf (Due to memory limit, 0.2 billion repeat)
ca*t
: c
+ a
(0 ~ inf times) + t
ct
(O)cat
(O)caaaaat
(O)cadt
(X)+
*
repeats 1 ~ inf (Due to memory limit, 0.2 billion repeat)
ca+t
: c
+ a
(1 ~ inf times) + t
ct
(X)cat
(O)caaaaat
(O)cadt
(X)({m,n}, ?)
ca{2}t
: c
+ aa
+ t
cat
(X)caat
(O)ca{2,4}t
cat
(X)caat
(O)caaaat
(O)cadaat
(X)[aeo]{1,3}
: a
e
o
조합으로 만들수 있는 1개 ~ 3개단위 매치m
, n
can be empty
{,n}
: from 0 to n
{m,}
: from m
to inf{0,}
== *
{1,}
== +
?
== {0, 1}
ab?c
: a
+ b
있어도 되고 없어도 됨 + c
abc
(O)ac
(O)adc
(X) b
는 있어도 되고 없어도 되지만 다른건 용납 못함!Meta chars above update cursor as it finds pattern
Meta chars below do not change the cursor while finding all
|
re1 | re2
asd|[zxc]
: asd
or z
or x
or c
(Mon|Tues)
day: Monday
or Tuesday
m = re.match('Crow|Servo', 'CrowHello')
print(m) # <re.Match object; span=(0, 4), match='Crow'>
^
^Word
: Word
at very firstprint(re.search('^Life', 'Life is too short')) # <re.Match object; span=(0, 4), match='Life'>
print(re.search('^Life', 'My Life')) # None
$
Word$
: Word
at very lastprint(re.search('short$', 'Life is too short')) # <re.Match object; span=(12, 17), match='short'>
print(re.search('short$', 'Life is too short, you need python')) # None
\A
\A
means it matches w/ string’s start.^
but iterprets differently w/ re.M option:
^
matches w/ start of each line.\A
matches w/ start of entire string regardless of line.\Z
\Z
means it matches w/ string’s end.\A
\b
\b
: Word boundary (Usually separated by white space)
\bclass\b
: \s
+ class + \s
p = re.compile(r'\bclass\b')
print(p.search('no class at all')) # <re.Match object; span=(3, 8), match='class'>
print(p.search('the declassified algorithm')) # None
\b
means backspace: Must use raw string indicator r
\B
\b
: words not separated by white space.p = re.compile(r'\Bclass\B')
print(p.search('no class at all')) # None
print(p.search('the declassified algorithm')) # <re.Match object; span=(6, 11), match='class'>
print(p.search('one subclass is')) # None
if there is white space in one end, it gives None.
( )
m = re.search('(ABC)+', 'ABCABCABC OK?')
print(m) # <re.Match object; span=(0, 9), match='ABCABCABC'>
print(m.group()) # ABCABCABC
p = re.compile(r"\w+\s+\d+[-]\d+[-]\d+")
m = p.search("park 010-1234-1234")
with grouping :
p = re.compile(r"(\w+)\s+(\d+[-]\d+[-]\d+)") # group1 : name, roup2: phone #
m = p.search("park 010-1234-1234")
print(m.group(1)) # park
print(m.group(2)) # 010-1234-1234
group(index) | description |
---|---|
group(0) | All matched string |
group(1) | String in first group |
group(2) | String in second group |
group(n) | String in n-th group |
p = re.compile(r"(\w+)\s+((\d+)[-]\d+[-]\d+)")
m = p.search("park 010-1234-1234")
print(m.group(3)) # 010
\1
Backreferencing: re-use previously defined group
\n
to use n-th group
\1
means 1st group in RE.
e.g. (\b\w+)\s+\1
: (
word)
+ “ “ + same word in group
p = re.compile(r'(\b\w+)\s+\1')
p.search('Paris in the the spring').group() # 'the the'
(?P<
Groupname>
RE)
(?P<name>\w+)\s+((\d+)[-]\d+[-]\d+)
:
notice
(\w+)
became(?P<name>\w+)
from above example
p = re.compile(r"(?P<name123>\w+)\s+((\d+)[-]\d+[-]\d+)")
m = p.search("park 010-1234-1234")
print(m.group("name123")) # park
(?P=groupname)
p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
p.search('Paris in the the spring').group() # 'the the'
m = re.search(".+:", "http://google.com")
print(m.group()) # http:
If you want to get only http from above example with condition that you cannot group anymore due to too much complexity.
e positive lookahead assertions: (?=
RE)
(?!
RE)
(?=
RE2)
ex1)
m = re.search(".+(?=:)", "http://google.com")
print(m.group()) # http
ex2) .*[.].*$
-> name + . + extension
If there’s foo.bar
, goo.bat
, hoo.io
and you want to declude bat
file, first thing you could think of might be .*[.][^b].*$
. However, this also decludes foo.bar
.*[.]([^b]..|.[^a].|..[^t])$
cannot check 2 digit extensions.
.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$
This will do the job.
However, it requires much more complex expression if you want to declude exe
file as well -> use negative lookahead assertion
(?!
RE2)
.*[.](?!bat$).*$
: allow when extension is not bat
.
While searching, the string is not consumed; therefore, it returns matched obj when it concludes that the pattern does not include bat
.
.*[.](?!bat$|exe$).*$
: return if extension is not exe
or bat
VOILA!
(?<=
RE2)
RE1Return if it matches w/ RE2 before RE1
(?<!
RE2)
RE1Return if it does not matches w/ RE2 before RE1
sub()
and subn()
p = re.compile('(blue|white|red)')
p.sub('colour', 'blue socks and red shoes') # 'colour socks and colour shoes'
p.sub('colour', 'blue socks and red shoes', count=1) # 'colour socks and red shoes'
subn()
returns tuple (Replaced result, replacement count)
p = re.compile('(blue|white|red)')
p.subn( 'colour', 'blue socks and red shoes') # ('colour socks and colour shoes', 2)
sub
method\g<groupName>
can reference group name in RE.
ex) swap name and number from target
p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<phone> \g<name>", "park 010-1234-1234")) # 010-1234-1234 park
Same result using index instead of name
p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<2> \g<1>", "park 010-1234-1234")) # 010-1234-1234 park
sub
methoddef hexrepl(match):
... value = int(match.group())
... return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'
hexrepl function converts match obj into hexadecimal value.
If function is used in sub
method, returned match obj from RE is used in function’s first parameter. Then, matched string becomes function’s return value.
s = '<html><head><title>Title</title>'
print(len(s)) # 32
print(re.match('<.*>', s).span()) # (0, 32)
print(re.match('<.*>', s).group()) # <html><head><title>Title</title>
*
is too greedy that RE <.*>
does not return just <html>
, but consumes all <html><head><title>Title</title>
re.search(r'i+', 'piigiii') # span=(1,3), match='ii'
?
limits greedy *
.
Limit to minimum repeat
print(re.match('<.*?>', s).group()) # <html>
?
can be used like *?
, +?
, ??
, {m,n}?
Reference: https://wikidocs.net/4309