Regular Expressions

Regular expressions

A regular expression (shortened as regex [...]) is a sequence of characters that specifies a search pattern in text. [...] used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

  1. Import the regex module with import re.
  2. Create a Regex object with the re.compile() function. (Remember to use a raw string.)
  3. Pass the string you want to search into the Regex object’s search() method. This returns a Match object.
  4. Call the Match object’s group() method to return a string of the actual matched text.

All the regex functions in Python are in the re module:

>>> import re

Regex symbols

SymbolMatches
?zero or one of the preceding group.
*zero or more of the preceding group.
+one or more of the preceding group.
{n}exactly n of the preceding group.
{n,}n or more of the preceding group.
{,m}0 to m of the preceding group.
{n,m}at least n and at most m of the preceding p.
{n,m}? or *? or +?performs a non-greedy match of the preceding p.
^spammeans the string must begin with spam.
spam$means the string must end with spam.
.any character, except newline characters.
\d, \w, and \sa digit, word, or space character, respectively.
\D, \W, and \Sanything except a digit, word, or space, respectively.
[abc]any character between the brackets (such as a, b, ).
[^abc]any character that isn’t between the brackets.

Matching regex objects

>>> phone_num_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

>>> mo = phone_num_regex.search('My number is 415-555-4242.')

>>> print(f'Phone number found: {mo.group()}')
# Phone number found: 415-555-4242

Grouping with parentheses

>>> phone_num_regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = phone_num_regex.search('My number is 415-555-4242.')

>>> mo.group(1)
# '415'

>>> mo.group(2)
# '555-4242'

>>> mo.group(0)
# '415-555-4242'

>>> mo.group()
# '415-555-4242'

To retrieve all the groups at once use the groups() method:

>>> mo.groups()
('415', '555-4242')

>>> area_code, main_number = mo.groups()

>>> print(area_code)
415

>>> print(main_number)
555-4242

Multiple groups with Pipe

You can use the | character anywhere you want to match one of many expressions.

>>> hero_regex = re.compile (r'Batman|Tina Fey')

>>> mo1 = hero_regex.search('Batman and Tina Fey.')
>>> mo1.group()
# 'Batman'

>>> mo2 = hero_regex.search('Tina Fey and Batman.')
>>> mo2.group()
# 'Tina Fey'

You can also use the pipe to match one of several patterns as part of your regex:

>>> bat_regex = re.compile(r'Bat(man|mobile|copter|bat)')
>>> mo = bat_regex.search('Batmobile lost a wheel')

>>> mo.group()
# 'Batmobile'

>>> mo.group(1)
# 'mobile'

Optional matching with the Question Mark

The ? character flags the group that precedes it as an optional part of the pattern.

>>> bat_regex = re.compile(r'Bat(wo)?man')

>>> mo1 = bat_regex.search('The Adventures of Batman')
>>> mo1.group()
# 'Batman'

>>> mo2 = bat_regex.search('The Adventures of Batwoman')
>>> mo2.group()
# 'Batwoman'

Matching zero or more with the Star

The * (star or asterisk) means “match zero or more”. The group that precedes the star can occur any number of times in the text.

>>> bat_regex = re.compile(r'Bat(wo)*man')
>>> mo1 = bat_regex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'

>>> mo2 = bat_regex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'

>>> mo3 = bat_regex.search('The Adventures of Batwowowowoman')
>>> mo3.group()
'Batwowowowoman'

Matching one or more with the Plus

The + (or plus) means match one or more. The group preceding a plus must appear at least once:

>>> bat_regex = re.compile(r'Bat(wo)+man')

>>> mo1 = bat_regex.search('The Adventures of Batwoman')
>>> mo1.group()
# 'Batwoman'

>>> mo2 = bat_regex.search('The Adventures of Batwowowowoman')
>>> mo2.group()
# 'Batwowowowoman'

>>> mo3 = bat_regex.search('The Adventures of Batman')
>>> mo3 is None
# True

Matching specific repetitions with Curly Brackets

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets:

>>> ha_regex = re.compile(r'(Ha){3}')

>>> mo1 = ha_regex.search('HaHaHa')
>>> mo1.group()
# 'HaHaHa'

>>> mo2 = ha_regex.search('Ha')
>>> mo2 is None
# True

Instead of one number, you can specify a range with minimum and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match ‘HaHaHa’, ‘HaHaHaHa’, and ‘HaHaHaHaHa’.

>>> ha_regex = re.compile(r'(Ha){2,3}')
>>> mo1 = ha_regex.search('HaHaHaHa')
>>> mo1.group()
# 'HaHaHa'

Greedy and non-greedy matching

Python’s regular expressions are greedy by default: in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

>>> greedy_ha_regex = re.compile(r'(Ha){3,5}')

>>> mo1 = greedy_ha_regex.search('HaHaHaHaHa')
>>> mo1.group()
# 'HaHaHaHaHa'

>>> non_greedy_ha_regex = re.compile(r'(Ha){3,5}?')
>>> mo2 = non_greedy_ha_regex.search('HaHaHaHaHa')
>>> mo2.group()
# 'HaHaHa'

The findall() method

The findall() method will return the strings of every match in the searched string.

>>> phone_num_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups

>>> phone_num_regex.findall('Cell: 415-555-9999 Work: 212-555-0000')
# ['415-555-9999', '212-555-0000']

Making your own character classes

You can define your own character class using square brackets. For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and uppercase.

>>> vowel_regex = re.compile(r'[aeiouAEIOU]')
>>> vowel_regex.findall('Robocop eats baby food. BABY FOOD.')
# ['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can also include ranges of letters or numbers by using a hyphen. For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.

By placing a caret character (^) just after the character class’s opening bracket, you can make a negative character class that will match all the characters that are not in the character class:

>>> consonant_regex = re.compile(r'[^aeiouAEIOU]')
>>> consonant_regex.findall('Robocop eats baby food. BABY FOOD.')
# ['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', '
# ', 'B', 'B', 'Y', ' ', 'F', 'D', '.']

The Caret and Dollar sign characters

  • You can also use the caret symbol ^ at the start of a regex to indicate that a match must occur at the beginning of the searched text.

  • Likewise, you can put a dollar sign $ at the end of the regex to indicate the string must end with this regex pattern.

  • And you can use the ^ and $ together to indicate that the entire string must match the regex.

The r'^Hello’ regular expression string matches strings that begin with ‘Hello’:

>>> begins_with_hello = re.compile(r'^Hello')
>>> begins_with_hello.search('Hello world!')
# <_sre.SRE_Match object; span=(0, 5), match='Hello'>

>>> begins_with_hello.search('He said hello.') is None
# True

The r'\d\

#39; regular expression string matches strings that end with a numeric character from 0 to 9:

>>> whole_string_is_num = re.compile(r'^\d+
#39;
) >>> whole_string_is_num.search('1234567890') # <_sre.SRE_Match object; span=(0, 10), match='1234567890'> >>> whole_string_is_num.search('12345xyz67890') is None # True >>> whole_string_is_num.search('12 34567890') is None # True

The Wildcard character

The . (or dot) character in a regular expression will match any character except for a newline:

>>> at_regex = re.compile(r'.at')

>>> at_regex.findall('The cat in the hat sat on the flat mat.')
['cat', 'hat', 'sat', 'lat', 'mat']

Matching everything with Dot-Star

>>> name_regex = re.compile(r'First Name: (.*) Last Name: (.*)')

>>> mo = name_regex.search('First Name: Al Last Name: Sweigart')
>>> mo.group(1)
# 'Al'

>>> mo.group(2)
'Sweigart'

The .* uses greedy mode: It will always try to match as much text as possible. To match any and all text in a non-greedy fashion, use the dot, star, and question mark (.*?). The question mark tells Python to match in a non-greedy way:

>>> non_greedy_regex = re.compile(r'<.*?>')
>>> mo = non_greedy_regex.search('<To serve man> for dinner.>')
>>> mo.group()
# '<To serve man>'

>>> greedy_regex = re.compile(r'<.*>')
>>> mo = greedy_regex.search('<To serve man> for dinner.>')
>>> mo.group()
# '<To serve man> for dinner.>'

Matching newlines with the Dot character

The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile(), you can make the dot character match all characters, including the newline character:

>>> no_newline_regex = re.compile('.*')
>>> no_newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()
# 'Serve the public trust.'

>>> newline_regex = re.compile('.*', re.DOTALL)
>>> newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()
# 'Serve the public trust.\nProtect the innocent.\nUphold the law.'

Case-Insensitive matching

To make your regex case-insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile():

>>> robocop = re.compile(r'robocop', re.I)

>>> robocop.search('Robocop is part man, part machine, all cop.').group()
# 'Robocop'

>>> robocop.search('ROBOCOP protects the innocent.').group()
# 'ROBOCOP'

>>> robocop.search('Al, why does your programming book talk about robocop so much?').group()
# 'robocop'

Substituting strings with the sub() method

The sub() method for Regex objects is passed two arguments:

  1. The first argument is a string to replace any matches.
  2. The second is the string for the regular expression.

The sub() method returns a string with the substitutions applied:

>>> names_regex = re.compile(r'Agent \w+')

>>> names_regex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
# 'CENSORED gave the secret documents to CENSORED.'

Managing complex Regexes

To tell the re.compile() function to ignore whitespace and comments inside the regular expression string, “verbose mode” can be enabled by passing the variable re.VERBOSE as the second argument to re.compile().

Now instead of a hard-to-read regular expression like this:

phone_regex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

you can spread the regular expression over multiple lines with comments like this:

phone_regex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)

Subscribe to pythoncheatsheet.org

A bullshit free publication, full of interesting, relevant links.