Useful modules in the Standard Library

Python comes with a built-in selection of modules which provide commonly used functionality. We have encountered some of these modules in previous chapters – for example, itertools, logging, pdb and unittest. We will look at a few more examples in this chapter. This is only a brief overview of a small subset of the available modules – you can see the full list, and find out more details about each one, by reading the Python Standard Library documentation.

Date and time: datetime

The datetime module provides us with objects which we can use to store information about dates and times:

  • datetime.date is used to create dates which are not associated with a time.

  • datetime.time is used for times which are independent of a date.

  • datetime.datetime is used for objects which have both a date and a time.

  • datetime.timedelta objects store differences between dates or datetimes – if we subtract one datetime from another, the result will be a timedelta.

  • datetime.timezone objects represent time zone adjustments as offsets from UTC. This class is a subclass of datetime.tzinfo, which is not meant to be used directly.

We can query these objects for a particular component (like the year, month, hour or minute), perform arithmetic on them, and extract printable string versions from them if we need to display them. Here are a few examples:

import datetime

# this class method creates a datetime object with the current date and time
now = datetime.datetime.today()

print(now.year)
print(now.hour)
print(now.minute)

print(now.weekday())

print(now.strftime("%a, %d %B %Y"))

long_ago = datetime.datetime(1999, 3, 14, 12, 30, 58)

print(long_ago) # remember that this calls str automatically
print(long_ago < now)

difference = now - long_ago
print(type(difference))
print(difference) # remember that this calls str automatically

Exercise 1

  1. Print ten dates, each two a week apart, starting from today, in the form YYYY-MM-DD.

Mathematical functions: math

The math module is a collection of mathematical functions. They can be used on floats or integers, but are mostly intended to be used on floats, and usually return floats. Here are a few examples:

import math

# These are constant attributes, not functions
math.pi
math.e

# round a float up or down
math.ceil(3.3)
math.floor(3.3)

# natural logarithm
math.log(5)
# logarithm with base 10
math.log(5, 10)
math.log10(5) # this function is slightly more accurate

# square root
math.sqrt(10)

# trigonometric functions
math.sin(math.pi/2)
math.cos(0)

# convert between radians and degrees
math.degrees(math.pi/2)
math.radians(90)

If you need mathematical functions to use on complex numbers, you should use the cmath module instead.

Exercise 2

  1. Write an object which represents a sphere of a given radius. Write a method which calculates the sphere’s volume, and one which calculates its surface area.

Pseudo-random numbers: random

We call a sequence of numbers pseudo-random when it appears in some sense to be random, but actually isn’t. Pseudo-random number sequences are generated by some kind of predictable algorithm, but they possess enough of the properties of truly random sequences that they can be used in many applications that call for random numbers.

It is difficult for a computer to generate numbers which are genuinely random. It is possible to gather truly random input using hardware, from sources such as the user’s keystrokes or tiny fluctuations in voltage measurements, and use that input to generate random numbers, but this process is more complicated and expensive than pseudo-random number generation, which can be done purely in software.

Because pseudo-random sequences aren’t actually random, it is also possible to reproduce the exact same sequence twice. That isn’t something we would want to do by accident, but it is a useful thing to be able to deliberately while debugging software, or in an automated test.

In Python can we use the random module to generate pseudo-random numbers, and do a few more things which depend on randomness. The core function of the module generates a random float between 0 and 1, and most of the other functions are derived from it. Here are a few examples:

import random

# a random float from 0 to 1 (excluding 1)
random.random()

pets = ["cat", "dog", "fish"]
# a random element from a sequence
random.choice(pets)
# shuffle a list (in place)
random.shuffle(pets)

# a random integer from 1 to 10 (inclusive)
random.randint(1, 10)

When we load the random module we can seed it before we start generating values. We can think of this as picking a place in the pseudo-random sequence where we want to start. We normally want to start in a different place every time – by default, the module is seeded with a value taken from the system clock. If we want to reproduce the same random sequence multiple times – for example, inside a unit test – we need to pass the same integer or string as parameter to seed each time:

# set a predictable seed
random.seed(3)
random.random()
random.random()
random.random()

# now try it again
random.seed(3)
random.random()
random.random()
random.random()

# and now try a different seed
random.seed("something completely different")
random.random()
random.random()
random.random()

Exercise 3

  1. Write a program which randomly picks an integer from 1 to 100. Your program should prompt the user for guesses – if the user guesses incorrectly, it should print whether the guess is too high or too low. If the user guesses correctly, the program should print how many guesses the user took to guess the right answer. You can assume that the user will enter valid input.

Matching string patterns: re

The re module allows us to write regular expressions. Regular expressions are a mini-language for matching strings, and can be used to find and possibly replace text. If you learn how to use regular expressions in Python, you will find that they are quite similar to use in other languages.

The full range of capabilities of regular expressions is quite extensive, and they are often criticised for their potential complexity, but with the knowledge of only a few basic concepts we can perform some very powerful string manipulation easily.

Note

Regular expressions are good for use on plain text, but a bad fit for parsing more structured text formats like XML – you should always use a more specialised parsing library for those.

The Python documentation for the re module not only explains how to use the module, but also contains a reference for the complete regular expression syntax which Python supports.

A regular expression primer

A regular expression is a string which describes a pattern. This pattern is compared to other strings, which may or may not match it. A regular expression can contain normal characters (which are treated literally as specific letters, numbers or other symbols) as well as special symbols which have different meanings within the expression.

Because many special symbols use the backslash (\) character, we often use raw strings to represent regular expressions in Python. This eliminates the need to use extra backslashes to escape backslashes, which would make complicated regular expressions much more difficult to read. If a regular expression doesn’t contain any backslashes, it doesn’t matter whether we use a raw string or a normal string.

Here are some very simple examples:

# this regular expression contains no special symbols
# it won't match anything except 'cat'
"cat"

# a . stands for any single character (except the newline, by default)
# this will match 'cat', 'cbt', 'c3t', 'c!t' ...
"c.t"

# a * repeats the previous character 0 or more times
# it can be used after a normal character, or a special symbol like .
# this will match 'ct', 'cat', 'caat', 'caaaaaaaaat' ...
"ca*t"
# this will match 'sc', 'sac', 'sic', 'supercalifragilistic' ...
"s.*c"

# + is like *, but the character must occur at least once
# there must be at least one 'a'
"ca+t"

# more generally, we can use curly brackets {} to specify any number of repeats
# or a minimum and maximum
# this will match any five-letter word which starts with 'c' and ends with 't'
"c.{3}t"
# this will match any five-, six-, or seven-letter word ...
"c.{3,5}t"

# One of the uses for ? is matching the previous character zero or one times
# this will match 'http' or 'https'
"https?"

# square brackets [] define a set of allowed values for a character
# they can contain normal characters, or ranges
# if ^ is the first character in the brackets, it *negates* the contents
# the character between 'c' and 't' must be a vowel
"c[aeiou]t"
# this matches any character that *isn't* a vowel, three times
"[^aeiou]{3}"
# This matches an uppercase UCT student number
"[B-DF-HJ-NP-TV-Z]{3}[A-Z]{3}[0-9]{3}"

# we use \ to escape any special regular expression character
# this would match 'c*t'
r"c\*t"
# note that we have used a raw string, so that we can write a literal backslash

# there are also some shorthand symbols for certain allowed subsets of characters:
# \d matches any digit
# \s matches any whitespace character, like space, tab or newline
# \w matches alphanumeric characters -- letters, digits or the underscore
# \D, \S and \W are the opposites of \d, \s and \w

# we can use round brackets () to *capture* portions of the pattern
# this is useful if we want to search and replace
# we can retrieve the contents of the capture in the replace step
# this will capture whatever would be matched by .*
"c(.*)t"

# ^ and $ denote the beginning or end of a string
# this will match a string which starts with 'c' and ends in 't'
"^c.*t$"

# | means "or" -- it lets us choose between multiple options.
"cat|dog"

Using the re module

Now that we have seen how to construct regular expression strings, we can start using them. The re module provides us with several functions which allow us to use regular expressions in different ways:

  • search searches for the regular expression inside a string – the regular expression will match if any subset of the string matches.

  • match matches a regular expression against the entire string – the regular expression will only match if the whole string matches. re.match('something', some_string) is equivalent to re.search('^something$', some_string).

  • sub searches for the regular expression and replaces it with the provided replacement expression.

  • findall searches for all matches of the regular expression within the string.

  • split splits a string using any regular expression as a delimiter.

  • compile allows us to convert our regular expression string to a pre-compiled regular expression object, which has methods analogous to the re module. Using this object is slightly more efficient.

As you can see, this module provides more powerful versions of some simple string operations: for example, we can also split a string or replace a substring using the built-in split and replace methods – but we can only use them with fixed delimiters or search patterns and replacements. With re.sub and re.split we can specify variable patterns instead of fixed strings.

All of the functions take a regular expression as the first parameter. match, search, findall and split also take the string to be searched as the second parameter – but in the sub function this is the third parameter, the second being the replacement string. All the functions also take an keyword parameter which specifies optional flags, which we will discuss shortly.

match and search both return match objects which store information such as the contents of captured groups. sub returns a modified copy of the original string. findall and split return a list of strings. compile returns a compiled regular expression object.

The methods of a regular expression object are very similar to the functions of the module, but the first parameter (the regular expression string) of each method is dropped – because it has already been compiled into the object.

Here are some usage examples:

import re

# match and search are quite similar
print(re.match("c.*t", "cravat")) # this will match
print(re.match("c.*t", "I have a cravat")) # this won't
print(re.search("c.*t", "I have a cravat")) # this will

# We can use a static string as a replacement...
print(re.sub("lamb", "squirrel", "Mary had a little lamb."))
# Or we can capture groups, and substitute their contents back in.
print(re.sub("(.*) (BITES) (.*)", r"\3 \2 \1", "DOG BITES MAN"))
# count is a keyword parameter which we can use to limit replacements
print(re.sub("a", "b", "aaaaaaaaaa"))
print(re.sub("a", "b", "aaaaaaaaaa", count=1))

# Here's a closer look at a match object.
my_match = re.match("(.*) (BITES) (.*)", "DOG BITES MAN")
print(my_match.groups())
print(my_match.group(1))

# We can name groups.
my_match = re.match("(?P<subject>.*) (?P<verb>BITES) (?P<object>.*)", "DOG BITES MAN")
print(my_match.group("subject"))
print(my_match.groupdict())
# We can still access named groups by their positions.
print(my_match.group(1))

# Sometimes we want to find all the matches in a string.
print(re.findall("[^ ]+@[^ ]+", "Bob <bob@example.com>, Jane <jane.doe@example.com>"))

# Sometimes we want to split a string.
print(re.split(", *", "one,two,  three, four"))

# We can compile a regular expression to an object
my_regex = re.compile("(.*) (BITES) (.*)")
# now we can use it in a very similar way to the module
print(my_regex.sub(r"\3 \2 \1", "DOG BITES MAN"))

Greed

Regular expressions are greedy by default – this means that if a part of a regular expression can match a variable number of characters, it will always try to match as many characters as possible. That means that we sometimes need to take special care to make sure that a regular expression doesn’t match too much. For example:

# this is going to match everything between the first and last '"'
# but that's not what we want!
print(re.findall('".*"', '"one" "two" "three" "four"'))

# This is a common trick
print(re.findall('"[^"]*"', '"one" "two" "three" "four"'))

# We can also use ? after * or other expressions to make them *not greedy*
print(re.findall('".*?"', '"one" "two" "three" "four"'))

Functions as replacements

We can also use re.sub to apply a function to a match instead of a string replacement. The function must take a match object as a parameter, and return a string. We can use this functionality to perform modifications which may be difficult or impossible to express as a replacement string:

def swap(m):
    subject = m.group("object").title()
    verb = m.group("verb")
    object = m.group("subject").lower()
    return "%s %s %s!" % (subject, verb, object)

print(re.sub("(?P<subject>.*) (?P<verb>.*) (?P<object>.*)!", swap, "Dog bites man!"))

Flags

Regular expressions have historically tended to be applied to text line by line – newlines have usually required special handling. In Python, the text is treated as a single unit by default, but we can change this and a few other options using flags. These are the most commonly used:

  • re.IGNORECASE – make the regular expression case-insensitive. It is case-sensitive by default.

  • re.MULTILINE – make ^ and $ match the beginning and end of each line (excluding the newline at the end), as well as the beginning and end of the whole string (which is the default).

  • re.DOTALL – make . match any character (by default it does not match newlines).

Here are a few examples:

print(re.match("cat", "Cat")) # this won't match
print(re.match("cat", "Cat", re.IGNORECASE)) # this will

text = """numbers = 'one,
two,
three'
numbers = 'four,
five,
six'
not_numbers = 'cat,
dog'"""

print(re.findall("^numbers = '.*?'", text)) # this won't find anything
# we need both DOTALL and MULTILINE
print(re.findall("^numbers = '.*?'", text, re.DOTALL | re.MULTILINE))

Note

re functions only have a single keyword parameter for flags, but we can combine multiple flags into one using the | operator (bitwise or) – this is because the values of these constants are actually integer powers of two.

Exercise 4

  1. Write a function which takes a string parameter and returns True if the string is a valid Python variable name or False if it isn’t. You don’t have to check whether the string is a reserved keyword in Python – just whether it is otherwise syntactically valid. Test it using all the examples of valid and invalid variable names described in the first chapter.

  2. Write a function which takes a string which contains two words separated by any amount and type of whitespace, and returns a string in which the words are swapped around and the whitespace is preserved.

Parsing CSV files: csv

CSV stands for comma-separated values – it’s a very simple file format for storing tabular data. Most spreadsheets can easily be converted to and from CSV format.

In a typical CSV file, each line represents a row of values in the table, with the columns separated by commas. Field values are often enclosed in double quotes, so that any literal commas or newlines inside them can be escaped:

"one","two","three"
"four, five","six","seven"

Python’s csv module takes care of all this in the background, and allows us to manipulate the data in a CSV file in a simple way, using the reader class:

import csv

with open("numbers.csv") as f:
    r = csv.reader(f)
    for row in r:
        print row

There is no single CSV standard – the comma may be replaced with a different delimiter (such as a tab), and a different quote character may be used. Both of these can be specified as optional keyword parameters to reader.

Similarly, we can write to a CSV file using the writer class:

with open('pets.csv', 'w') as f:
    w = csv.writer(f)
    w.writerow(['Fluffy', 'cat'])
    w.writerow(['Max', 'dog'])

We can use optional parameters to writer to specify the delimiter and quote character, and also whether to quote all fields or only fields with characters which need to be escaped.

Exercise 5

  1. Open a CSV file which contains three columns of numbers. Write out the data to a new CSV file, swapping around the second and third columns and adding a fourth column which contains the sum of the first three.

Writing scripts: sys and argparse

We have already seen a few scripts. Technically speaking, any Python file can be considered a script, since it can be executed without compilation. When we call a Python program a script, however, we usually mean that it contains statements other than function and class definitions – scripts do something other than define structures to be reused.

Scripts vs libraries

We can combine class and function definitions with statements that use them in the same file, but in a large project it is considered good practice to keep them separate: to define all our classes in library files, and import them into the main program. If we do put both classes and main program in one file, we can ensure that the program is only executed when the file is run as a script and not if it is imported from another file – we saw an example of this earlier:

class MyClass:
    pass

class MyOtherClass:
    pass

if __name__ == '__main__':
    my_object = MyClass()
    # do more things

If our file is written purely for use as a script, and will never be imported, including this conditional statement is considered unnecessary.

Simple command-line parameters

When we run a program on the commandline, we often want to pass in parameters, or arguments, just as we would pass parameters to a function inside our code. For example, when we use the Python interpreter to run a file, we pass the filename in as an argument. Unlike parameters passed to a function in Python, arguments passed to an application on the commandline are separated by spaces and listed after the program name without any brackets.

The simplest way to access commandline arguments inside a script is through the sys module. All the arguments in order are stored in the module’s argv attribute. We must remember that the first argument is always the name of the script file, and that all the arguments will be provided in string format. Try saving this simple script and calling it with various arguments after the script name:

import sys

print sys.argv

Complex command-line parameters

The sys module is good enough when we only have a few simple arguments – perhaps the name of a file to open, or a number which tells us how many times to execute a loop. When we want to provide a variety of complicated arguments, some of them optional, we need a better solution.

The argparse module allows us to define a wide range of compulsory and optional arguments. A commonly used type of argument is the flag, which we can think of as equivalent to a keyword argument in Python. A flag is optional, it has a name (sometimes both a long name and a short name) and it may have a value. In Linux and OSX programs, flag names often start with a dash (long names usually start with two), and this convention is sometimes followed by Windows programs too.

Here is a simple example of a program which uses argparse to define two positional arguments which must be integers, a flag which specifies an operation to be performed on the two numbers, and a flag to turn on verbose output:

import argparse
import logging

parser = argparse.ArgumentParser()
# two integers
parser.add_argument("num1", help="the first number", type=int)
parser.add_argument("num2", help="the second number", type=int)
# a string, limited to a list of options
parser.add_argument("op", help="the desired arithmetic operation", choices=['add', 'sub', 'mul', 'div'])
# an optional flag, true by default, with a short and a long name
parser.add_argument("-v", "--verbose", help="turn on verbose output", action="store_true")

opts = parser.parse_args()

if opts.verbose:
    logging.basicConfig(level=logging.DEBUG)

logging.debug("First number: %d" % opts.num1)
logging.debug("Second number: %d" % opts.num2)
logging.debug("Operation: %s" % opts.op)

if opts.op == "add":
    result = opts.num1 + opts.num2
elif opts.op == "sub":
    result = opts.num1 - opts.num2
elif opts.op == "mul":
    result = opts.num1 * opts.num2
elif opts.op == "div":
    result = opts.num1 / opts.num2

print(result)

argparse automatically defines a help parameter, which causes the program’s usage instructions to be printed when we pass -h or --help to the script. These instructions are automatically generated from the descriptions we supply in all the argument definitions. We will also see informative error output if we don’t pass in the correct arguments. Try calling the script above with different arguments!

Note

if we are using Linux or OSX, we can turn our scripts into executable files. Then we can execute them directly instead of passing them as parameters to Python. To make our script executable we must mark it as executable using a system tool (chmod). We must also add a line to the beginning of the file to let the operating system know that it should use Python to execute it. This is typically #!/usr/bin/env python.

Exercise 6

  1. Write a script which reorders the columns in a CSV file. It should take as parameters the path of the original CSV file, a string listing the indices of the columns in the order that they should appear, and optionally a path to the destination file (by default it should have the same name as the original file, but with a suffix). The script should return an error if the list of indices cannot be parsed or if any of the indices are not valid (too low or too high). You may allow indices to be negative or repeated. You should include usage instructions.

Answers to exercises

Answer to exercise 1

  1. Here is an example program:

    import datetime
    
    today = datetime.datetime.today()
    
    for w in range(10):
        day = today + datetime.timedelta(weeks=w)
        print(day.strftime("%Y-%m-%d"))
    

Answer to exercise 2

  1. Here is an example program:

    import math
    
    class Sphere:
        def __init__(self, radius):
            self.radius = radius
    
        def volume(self):
            return (4/3) * math.pi * math.pow(self.radius, 3)
    
        def surface_area(self):
            return 4 * math.pi * self.radius ** 2
    

Answer to exercise 3

  1. Here is an example program:

    import random
    
    secret_number = random.randint(1, 100)
    guess = None
    num_guesses = 0
    
    while not guess == secret_number:
        guess = int(input("Guess a number from 1 to 100: "))
        num_guesses += 1
    
        if guess == secret_number:
            suffix = '' if num_guesses == 1 else 'es'
            print("Congratulations! You guessed the number after %d guess%s." % (num_guesses, suffix))
            break
    
        if guess < secret_number:
            print("Too low!")
        else:
            print("Too high!")
    

Answer to exercise 4

  1. import re
    
    VALID_VARIABLE = re.compile('[a-zA-Z_][a-zA-Z0-9_]*')
    
    def validate_variable_name(name):
        return bool(VALID_VARIABLE.match(name))
    
  2. import re
    
    WORDS = re.compile('(\S+)(\s+)(\S+)')
    
    def swap_words(s):
        return WORDS.sub(r'\3\2\1', s)
    

Answer to exercise 5

  1. Here is an example program:

    import csv
    
    with open("numbers.csv") as f_in:
        with open("numbers_new.csv", "w") as f_out:
            r = csv.reader(f_in)
            w = csv.writer(f_out)
            for row in r:
                w.writerow([row[0], row[2], row[1], sum(float(c) for c in row)])
    

Answer to exercise 6

  1. Here is an example program:

    import sys
    import argparse
    import csv
    import re
    
    parser = argparse.ArgumentParser()
    parser.add_argument("input", help="the input CSV file")
    parser.add_argument("order", help="the desired column order; comma-separated; starting from zero")
    parser.add_argument("-o", "--output", help="the destination CSV file")
    
    opts = parser.parse_args()
    
    output_file = opts.output
    if not output_file:
        output_file = re.sub("\.csv", "_reordered.csv", opts.input, re.IGNORECASE)
    
    try:
        new_row_indices = [int(i) for i in opts.order.split(',')]
    except ValueError:
        sys.exit("Unable to parse column list.")
    
    with open(opts.input) as f_in:
        with open(output_file, "w") as f_out:
            r = csv.reader(f_in)
            w = csv.writer(f_out)
            for row in r:
                new_row = []
                for i in new_row_indices:
                    try:
                        new_row.append(row[i])
                    except IndexError:
                        sys.exit("Invalid column: %d" % i)
                w.writerow(new_row)