Regular expressions or "regexes" will enable us to examine patterns within our code. They allow us to search, match, and manipulate strings based on specific patterns, making them highly useful for tasks like validation, parsing, and text processing. https://docs.python.org/3/library/re.html
email = input("What's your email? ").strip()
if "@" in email:
print("Valid")
else:
print("Invalid")
This code appears to work but it actually broken. One could input
@@
alone and the input could be regarded as valid.
email = input("What's your email? ").strip()
if "@" in email and "." in email:
print("Valid")
else:
print("Invalid")
This code looks for
@
and a.
in order to validate the email address, but again, is still not robust enough as an input of@.
will be validated.
email = input("What's your email? ").strip()
username, domain = email.split("@")
if username and "." in domain:
print("Valid")
else:
print("Invalid")
The
strip()
method is used to split the string at the@
and assign first part tousername
variable and second part todomain
variable.
if username
determines ifusername
exists, and"." in domain
checks for a.
in domain.
email = input("What's your email? ").strip()
username, domain = email.split("@")
if username and domain.endswith(".edu"):
print("Valid")
else:
print("Invalid")
endswith()
method will check if domain contains.edu
. However, an input of[email protected]
would be considered valid.
We could keep iterating through this code manually. However, Python re
library has built-in functions that can validate user inputs against patterns.
import re
email = input("What's your email? ").strip()
if re.search("@", email):
print("Valid")
else:
print("Invalid")
Notice that we are only checking for the presence of
@
in the
The
search
library follows the signaturere.search(pattern, string, flags=0)
.
To enhance our programs functionality, we need to introduce validation
vocabulary. In regular expressions there are symbols that allow us to identify patterns:
. any character except a new line
* 0 or more repetitions
+ 1 or more repetitions
? 0 or 1 repetition
{m} m repetitions
{m,n} m-n repetitions
import re
email = input("What's your email? ").strip()
if re.search(".+@.+", email):
print("Valid")
else:
print("Invalid")
In ".+@.+",
.+
is used to determine if at least one or more characters are present to the left and to the right of the@
.
import re
email = input("What's your email? ").strip()
if re.search(".+@.+.edu", email):
print("Valid")
else:
print("Invalid")
Notice here that we added a check
.edu
but the result will not be the one expected. In this context.
means any character and not an actual.
We can use the escape character \
to include the .
in our string instead of our validation expression:
if re.search(".+@.+\.edu", email):
print("Valid")
Python might misinterpret the use of
\.
as an escape sequence similar to\n
. To solve this we can useraw strings
.
raw strings are strings that don't format special characters. Placing an r
in front of a string tells the Python interpreter to take each character of the string at face-value. (r"\n"
would be considered \
and n
instead of new line).
import re
email = input("What's your email? ").strip()
if re.search(r".+@.+\.edu", email):
print("Valid")
else:
print("Invalid")
We still have a problem. User could input
My email address is [email protected].
and it would be considered valid.
To address this, we need to include more validation special symbols:
^ matches the start of the string
$ matches the end of the string or just before the newline at the end of the string
import re
email = input("What's your email? ").strip()
if re.search(r"^.+@.+\.edu$", email):
print("Valid")
else:
print("Invalid")
Now, this version would render Invalid the input
My email is [email protected].
because the regular expression expects"u"
to be the last character of the input.
Users could still type as many
@
symbols as they wish.malan@@@harvard.edu
would be considered valid.
We can can add symbols to our regular expression to address this problem:
[] set of characters
[^] complementing the set
[^]
import re
email = input("What's your email? ").strip()
if re.search(r"^[^@]+@[^@]+\.edu$", email):
print("Valid")
else:
print("Invalid")
Notice that
[^@]+
means any 1 or more characters except an@
. This means that before and after the@
the regular expression only accepts characters that are not@
.
[]
import re
email = input("What's your email? ").strip()
if re.search(r"^[a-zA-Z0-9_]+@[a-zA-Z0-9_]+\.edu$", email):
print("Valid")
else:
print("Invalid")
Notice that set of characters
[a-zA-Z0-9_]
tells the validation that characters must be betweena
andz
, betweenA
andZ
, between0
and9
and potentially include an_
symbol.
To simplify this process, common patterns have been built into regular expressions by other programmers:
import re
email = input("What's your email? ").strip()
if re.search(r"^\w+@\w+\.edu$", email):
print("Valid")
else:
print("Invalid")
Notice that
\w
is the same as[a-zA-Z0-9_]
.
Additional patterns:
\d decimal digit
\D not a decimal digit
\s whitespace characters
\S not a whitespace character
\w word character, as well as numbers and the underscore
\W not a word character
- Include more
Top-Level Domains
import re
email = input("What's your email? ").strip()
if re.search(r"^\w+@\w+\.(com|edu|gov|net|org)$", email):
print("Valid")
else:
print("Invalid")
Notice that in
(com|edu|gov|net|org)
, the|
meansor
and(...)
is used to group them together.
A|B either A or B
(...) a group
(?:...) non-capturing version
Recall that within the re.search
function, there is a parameter for flags re.search(pattern, string, flags=0
).
Some built-in flag variables:
re.IGNORECASE
re.MULTILINE
re.DOTALL
re.IGNORECASE
import re
email = input("What's your email? ").strip()
if re.search(r"^\w+@\w+\.edu$", email, re.IGNORECASE):
print("Valid")
else:
print("Invalid")
The input
[email protected]
would now be valid.
?
Notice that the email address [email protected]
would be considered invalid because of the additional .
import re
email = input("What's your email? ").strip()
if re.search(r"^\w+@(\w+\.)?\w+\.edu$", email, re.IGNORECASE):
print("Valid")
else:
print("Invalid")
In this version we added a new grouped expression
(\w+\.)?
, that means to accept an alphanumeric character or underscore (\w1
), 1 or more times (+
), and a literal dot (.
), followed by the?
quantifier, that makes the entire group optional. (Remember that the symbol?
, means 0 or 1 repetitions.)
Now inputs
[email protected]
and[email protected]
are considered valid.
The full regular expression used by most browsers to validate email addresses is far more complicated than the one we implemented. It looks like this:
^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$`
Thankfully, we can, and should, take advantage of libraries built by experienced programmers that simplify the process of validating an email address.
format.py
name = input("What's your name? ").strip()
print(f"hello, {name}")
This program expects users to input their names. The user could input their name in whatever order they decide (
Malan, David
). This could make it difficult to standardize how names are stored and used by our program.
name = input("What's your name? ").strip()
if "," in name:
last, first = name.split(", ")
name = f"{first} {last}"
print(f"hello, {name}")
last, first = name.split(", ")
is run if there's a,
in the name. The name is then standardized as first and last.
If the user enters
Malan,David
with no spaces, the compiler will throw and error.
()
-.groups()
import re
name = input("What's your name? ").strip()
matches = re.search(r"^(.+), (.+)$", name)
if matches:
last, first = matches.groups()
name = f"{first} {last}"
print(f"hello, {name}")
The grouping symbol
(...)
has the capability of capturing the matching expression entered by the user.re.search
can return those sets of matches, which we stored in the variablematches
.
last, first = matches.groups()
then accesses those values (matches.groups()
) and assigns them to variableslast
andfirst
.
.group(n)
import re
name = input("What's your name? ").strip()
matches = re.search(r"^(.+), (.+)$", name)
if matches:
name = matches.group(2) + " " + matches.group(1)
print(f"hello, {name}")
Notice in this version we are requesting specific groups using singular
.group()
and concatenating them with a single space" "
in the order we wanted.
group(1)
is the first to appear, at the left of the comma.
Now, we are still expecting a space after the comma as per (.+), (.+)
. An input of Malan,David
will not return the expected result.
*
import re
name = input("What's your name? ").strip()
matches = re.search(r"^(.+), *(.+)$", name)
if matches:
name = matches.group(2) + " " + matches.group(1)
print(f"hello, {name}")
Notice the addition of the
*
(0 or more repetitions) in our validation statement. Now the code will accept no spacesMalan,David
or many spacesMalan, David
.
- walrus operator
:=
import re
name = input("What's your name? ").strip()
if matches := re.search(r"^(.+), *(.+)$", name):
name = matches.group(2) + " " + matches.group(1)
print(f"hello, {name}")
Notice the use of the walrus operator
:=
. This operator allows us to combine two lines of code by assigning a value from right to left and ask a Boolean question at the same time.
Let's build a program that extracts some specific information form user input.
twitter.py
url = input("URL: ").strip()
print(url)
Notice that if we type URL
https://twitter.com/davidjmalan
, it prints exactly what the user typed.
replace()
url = input("URL: ").strip()
username = url.replace("https://twitter.com/", "")
print(f"Username: {username}")
Notice the use of
replace()
method, which allows us to find part of the URL and replace it with nothing""
.
This could still be problematic if user only enters twitter.com
instead of including the full expected format.
If user enters My URL is https://twitter.com/davidjmalan
, the output will be My URL is davidjmalan
.
removeprefix()
url = input("URL: ").strip()
username = url.removeprefix("https://twitter.com/")
print(f"Username: {username}")
The
removeprefix()
method does not resolve our problem but does simplify the removal of the url and anything that precedes it.
re.sub()
Within the re
library, there is a method called sub
that allows us to substitute a pattern with something else.
re.sub(pattern, repl, string, count=0, flags=0)
.
import re
url = input("URL: ").strip()
username = re.sub(r"https://twitter.com/", "", url)
print(f"Username: {username}")
This version of the code uses the regular expressions way to substitute elements but still does not cover all input variations. Also, the
.
could be interpreted improperly by the compiler.
import re
url = input("URL: ").strip()
username = re.sub(r"^(https?://)?(www\.)?twitter\.com/", "", url)
print(f"Username: {username}")
^
caret was added to signal the beginning of the match\
was added to all the dots.
?
was added afterhttps
making the "s" optional to toleratehttp
(www\.)?
Was added to accept the option of including "www."(https?:\\)?
Grouping and making not only thes
optional with?
but also the whole protocol.
Still, we are blindly expecting that the user inputted a url that matches the pattern and has a username.
re.search()
import re
url = input("URL: ").strip()
matches = re.search(r"^https?://(www\.)?twitter\.com/(.+)$", url, re.IGNORECASE)
if matches:
print(f"Username:", matches.group(2))
Notice now, how we are capturing the end of the URL using
(.+)$
regular expression and only returning it (matches.group(2)
) if the user's input matches our regular expression.
Notice the importance of respecting the order of the groups. Using
matches.group(1)
will returnwww.
if included in the input orNone
if not included.
(?:...)
import re
url = input("URL: ").strip()
matches = re.search(r"^https?://(?:www\.)?twitter\.com/(.+)$", url, re.IGNORECASE)
if matches:
print(f"Username:", matches.group(1))
Notice here that adding
?:
at the beginning of the group(?:www\.)
will tell the compiler not to capture and only group the expression, so that we could access the first (and only) captured group(.+)
withmatches.group(1)
.
- walrus operator
:=
import re
url = input("URL: ").strip()
if matches := re.search(r"^https?://(?:www\.)?twitter\.com/(.+)$", url, re.IGNORECASE):
print(f"Username:", matches.group(1))
[]
import re
url = input("URL: ").strip()
if matches := re.search(r"^https?://(?:www\.)?twitter\.com/([a-z0-9_]+)$", url, re.IGNORECASE):
print(f"Username:", matches.group(1))
Notice in this version that we used
([a-z0-9_]+)
to only accept twitter's valid username format.
Program that validates international phone numbers.
import re
# Dictionary of country codes and corresponding countries
locations = {"+1": "United States and Canada", "+62": "Indonesia", "+505": "Nicaragua"}
def main():
# Define the pattern for the expected phone number format
# using a raw string regular expression
pattern = r"\+\d{1,3} \d{3}-\d{3}-\d{4}"
# Prompt user for a phone number input and store in variable "number"
number = input("Number: ")
# Use the re.search() function to check if input matches the pattern
# and store result (match object) in the match variable
match = re.search(pattern, number)
if match:
# If the input matches the pattern, print "Valid"
print("Valid")
else:
# If the input does not match the pattern, print "Invalid"
print("Invalid")
if __name__ == "__main__":
main()
r"\+\d{1,3} \d{3}-\d{3}-\d{4}"
\+
means to treat the plus sign+
literally.\d
means decimal digit{1,3}
means 1 to 3 repetitions
- Capture groups
()
import re
locations = {"+1": "United States and Canada", "+62": "Indonesia", "+505": "Nicaragua"}
def main():
# Add capture group for the country codes "(\+\d{1,3})"
pattern = r"(\+\d{1,3}) \d{3}-\d{3}-\d{4}"
number = input("Number: ")
match = re.search(pattern, number)
if match:
# Access the first capture group and store in `country_code`
country_code = match.group(1)
# Print country code
print(country_code)
else:
print("Invalid")
if __name__ == "__main__":
main()
- Using capture group as key to access value in dictionary
import re
locations = {"+1": "United States and Canada", "+62": "Indonesia", "+505": "Nicaragua"}
def main():
# Add capture group for the country codes "(\+\d{1,3})"
pattern = r"(\+\d{1,3}) \d{3}-\d{3}-\d{4}"
number = input("Number: ")
match = re.search(pattern, number)
if match:
# Access the first capture group and store in `country_code`
country_code = match.group(1)
# Retrieve and print country corresponding to country code
print(locations[country_code])
else:
print("Invalid")
if __name__ == "__main__":
main()
Notice the use of
country_code
as a key to access corresponding value (country)
- Naming capture group for better access
import re
locations = {"+1": "United States and Canada", "+62": "Indonesia", "+505": "Nicaragua"}
def main():
# Naming capture group pattern for easy access
pattern = r"(?P<country_code>\+\d{1,3}) \d{3}-\d{3}-\d{4}"
number = input("Number: ")
match = re.search(pattern, number)
if match:
# Access the capture group with its name "country_code"
country_code = match.group("country_code")
# Retrieve and print country corresponding to country code
print(locations[country_code])
else:
print("Invalid")
if __name__ == "__main__":
main()
Notice the syntax to name a capture group within a regular expression
(?P<name>...)
A Hexadecimal color code:
- Begins with #
- Is composed of 6 characters
- 0-9 and A-F (or a-f)
Examples:
Black: #000000
White: #FFFFFF
Red: #FF0000
Blue: #0000FF
Green: #00FF00
Program that validates Hexadecimal color codes
import re
def main():
# Prompt user for color code
code = input("Hexadecimal color code: ")
# Define the pattern for the expected hexadecimal color code
# using a raw string regular expression
pattern = r"^#[a-fA-F0-9]{6}$"
# Use the re.search() function to check if input matches the pattern
# and store result (match object) in the match variable
match = re.search(pattern, code)
if match:
# If the input matches the pattern, print Valid and what it matched with
print(f"Valid. Matched with {match.group()}")
else:
# If the input does not match the pattern, print "Invalid"
print("Invalid")
if __name__ == "__main__":
main()
r"^#[a-fA-F0-9]{6}$"
^#
means that#
must be the first character of the input[a-fA-F0-9]
determines the set of characters (ranges) accepted{6}
means 6 repetitions$
means that the last characters must be the{6}
characters in the range of pattens.