************************************** Useful modules in the Standard Library ************************************** Python comes with a built-in selection of modules which provide commonly used functionality. We have encountered some of these modules in previous chapters -- for example, ``itertools``, ``logging``, ``pdb`` and ``unittest``. We will look at a few more examples in this chapter. This is only a brief overview of a small subset of the available modules -- you can see the full list, and find out more details about each one, by reading the `Python Standard Library documentation `_. Date and time: ``datetime`` =========================== The ``datetime`` module provides us with objects which we can use to store information about dates and times: * ``datetime.date`` is used to create dates which are not associated with a time. * ``datetime.time`` is used for times which are independent of a date. * ``datetime.datetime`` is used for objects which have both a date and a time. * ``datetime.timedelta`` objects store *differences* between dates or datetimes -- if we subtract one datetime from another, the result will be a timedelta. * ``datetime.timezone`` objects represent time zone adjustments as offsets from UTC. This class is a subclass of ``datetime.tzinfo``, which is not meant to be used directly. We can query these objects for a particular component (like the year, month, hour or minute), perform arithmetic on them, and extract printable string versions from them if we need to display them. Here are a few examples:: import datetime # this class method creates a datetime object with the current date and time now = datetime.datetime.today() print(now.year) print(now.hour) print(now.minute) print(now.weekday()) print(now.strftime("%a, %d %B %Y")) long_ago = datetime.datetime(1999, 3, 14, 12, 30, 58) print(long_ago) # remember that this calls str automatically print(long_ago < now) difference = now - long_ago print(type(difference)) print(difference) # remember that this calls str automatically Exercise 1 ---------- #. Print ten dates, each two a week apart, starting from today, in the form *YYYY-MM-DD*. Mathematical functions: ``math`` ================================ The ``math`` module is a collection of mathematical functions. They can be used on floats or integers, but are mostly intended to be used on floats, and usually return floats. Here are a few examples:: import math # These are constant attributes, not functions math.pi math.e # round a float up or down math.ceil(3.3) math.floor(3.3) # natural logarithm math.log(5) # logarithm with base 10 math.log(5, 10) math.log10(5) # this function is slightly more accurate # square root math.sqrt(10) # trigonometric functions math.sin(math.pi/2) math.cos(0) # convert between radians and degrees math.degrees(math.pi/2) math.radians(90) If you need mathematical functions to use on complex numbers, you should use the ``cmath`` module instead. Exercise 2 ---------- #. Write an object which represents a sphere of a given radius. Write a method which calculates the sphere's volume, and one which calculates its surface area. Pseudo-random numbers: ``random`` ================================= We call a sequence of numbers *pseudo-random* when it appears in some sense to be random, but actually isn't. Pseudo-random number sequences are generated by some kind of predictable algorithm, but they possess enough of the properties of truly random sequences that they can be used in many applications that call for random numbers. It is difficult for a computer to generate numbers which are genuinely random. It is possible to gather truly random input using hardware, from sources such as the user's keystrokes or tiny fluctuations in voltage measurements, and use that input to generate random numbers, but this process is more complicated and expensive than pseudo-random number generation, which can be done purely in software. Because pseudo-random sequences aren't actually random, it is also possible to reproduce the exact same sequence twice. That isn't something we would want to do by accident, but it is a useful thing to be able to deliberately while debugging software, or in an automated test. In Python can we use the ``random`` module to generate pseudo-random numbers, and do a few more things which depend on randomness. The core function of the module generates a random float between 0 and 1, and most of the other functions are derived from it. Here are a few examples:: import random # a random float from 0 to 1 (excluding 1) random.random() pets = ["cat", "dog", "fish"] # a random element from a sequence random.choice(pets) # shuffle a list (in place) random.shuffle(pets) # a random integer from 1 to 10 (inclusive) random.randint(1, 10) When we load the ``random`` module we can *seed* it before we start generating values. We can think of this as picking a place in the pseudo-random sequence where we want to start. We normally want to start in a different place every time -- by default, the module is seeded with a value taken from the system clock. If we want to reproduce the same random sequence multiple times -- for example, inside a unit test -- we need to pass the same integer or string as parameter to ``seed`` each time:: # set a predictable seed random.seed(3) random.random() random.random() random.random() # now try it again random.seed(3) random.random() random.random() random.random() # and now try a different seed random.seed("something completely different") random.random() random.random() random.random() Exercise 3 ---------- #. Write a program which randomly picks an integer from 1 to 100. Your program should prompt the user for guesses -- if the user guesses incorrectly, it should print whether the guess is too high or too low. If the user guesses correctly, the program should print how many guesses the user took to guess the right answer. You can assume that the user will enter valid input. Matching string patterns: ``re`` ================================ The ``re`` module allows us to write *regular expressions*. Regular expressions are a mini-language for matching strings, and can be used to find and possibly replace text. If you learn how to use regular expressions in Python, you will find that they are quite similar to use in other languages. The full range of capabilities of regular expressions is quite extensive, and they are often criticised for their potential complexity, but with the knowledge of only a few basic concepts we can perform some very powerful string manipulation easily. .. Note:: Regular expressions are good for use on plain text, but a bad fit for parsing more structured text formats like XML -- you should always use a more specialised parsing library for those. The Python documentation for the ``re`` module not only explains how to use the module, but also contains a reference for the complete regular expression syntax which Python supports. A regular expression primer --------------------------- A regular expression is a string which describes a pattern. This pattern is compared to other strings, which may or may not match it. A regular expression can contain normal characters (which are treated literally as specific letters, numbers or other symbols) as well as special symbols which have different meanings within the expression. Because many special symbols use the backslash (``\``) character, we often use *raw strings* to represent regular expressions in Python. This eliminates the need to use extra backslashes to escape backslashes, which would make complicated regular expressions much more difficult to read. If a regular expression doesn't contain any backslashes, it doesn't matter whether we use a raw string or a normal string. Here are some very simple examples:: # this regular expression contains no special symbols # it won't match anything except 'cat' "cat" # a . stands for any single character (except the newline, by default) # this will match 'cat', 'cbt', 'c3t', 'c!t' ... "c.t" # a * repeats the previous character 0 or more times # it can be used after a normal character, or a special symbol like . # this will match 'ct', 'cat', 'caat', 'caaaaaaaaat' ... "ca*t" # this will match 'sc', 'sac', 'sic', 'supercalifragilistic' ... "s.*c" # + is like *, but the character must occur at least once # there must be at least one 'a' "ca+t" # more generally, we can use curly brackets {} to specify any number of repeats # or a minimum and maximum # this will match any five-letter word which starts with 'c' and ends with 't' "c.{3}t" # this will match any five-, six-, or seven-letter word ... "c.{3,5}t" # One of the uses for ? is matching the previous character zero or one times # this will match 'http' or 'https' "https?" # square brackets [] define a set of allowed values for a character # they can contain normal characters, or ranges # if ^ is the first character in the brackets, it *negates* the contents # the character between 'c' and 't' must be a vowel "c[aeiou]t" # this matches any character that *isn't* a vowel, three times "[^aeiou]{3}" # This matches an uppercase UCT student number "[B-DF-HJ-NP-TV-Z]{3}[A-Z]{3}[0-9]{3}" # we use \ to escape any special regular expression character # this would match 'c*t' r"c\*t" # note that we have used a raw string, so that we can write a literal backslash # there are also some shorthand symbols for certain allowed subsets of characters: # \d matches any digit # \s matches any whitespace character, like space, tab or newline # \w matches alphanumeric characters -- letters, digits or the underscore # \D, \S and \W are the opposites of \d, \s and \w # we can use round brackets () to *capture* portions of the pattern # this is useful if we want to search and replace # we can retrieve the contents of the capture in the replace step # this will capture whatever would be matched by .* "c(.*)t" # ^ and $ denote the beginning or end of a string # this will match a string which starts with 'c' and ends in 't' "^c.*t$" # | means "or" -- it lets us choose between multiple options. "cat|dog" Using the ``re`` module ----------------------- Now that we have seen how to construct regular expression strings, we can start using them. The ``re`` module provides us with several functions which allow us to use regular expressions in different ways: * ``search`` searches for the regular expression inside a string -- the regular expression will match if any subset of the string matches. * ``match`` matches a regular expression against the entire string -- the regular expression will only match if the *whole string* matches. ``re.match('something', some_string)`` is equivalent to ``re.search('^something$', some_string)``. * ``sub`` searches for the regular expression and replaces it with the provided replacement expression. * ``findall`` searches for all matches of the regular expression within the string. * ``split`` splits a string using any regular expression as a delimiter. * ``compile`` allows us to convert our regular expression string to a pre-compiled regular expression *object*, which has methods analogous to the ``re`` module. Using this object is slightly more efficient. As you can see, this module provides more powerful versions of some simple string operations: for example, we can also split a string or replace a substring using the built-in ``split`` and ``replace`` methods -- but we can only use them with *fixed* delimiters or search patterns and replacements. With ``re.sub`` and ``re.split`` we can specify variable patterns instead of fixed strings. All of the functions take a regular expression as the first parameter. ``match``, ``search``, ``findall`` and ``split`` also take the string to be searched as the second parameter -- but in the ``sub`` function this is the third parameter, the second being the replacement string. All the functions also take an keyword parameter which specifies optional *flags*, which we will discuss shortly. ``match`` and ``search`` both return match objects which store information such as the contents of captured groups. ``sub`` returns a modified copy of the original string. ``findall`` and ``split`` return a list of strings. ``compile`` returns a compiled regular expression object. The methods of a regular expression object are very similar to the functions of the module, but the first parameter (the regular expression string) of each method is dropped -- because it has already been compiled into the object. Here are some usage examples:: import re # match and search are quite similar print(re.match("c.*t", "cravat")) # this will match print(re.match("c.*t", "I have a cravat")) # this won't print(re.search("c.*t", "I have a cravat")) # this will # We can use a static string as a replacement... print(re.sub("lamb", "squirrel", "Mary had a little lamb.")) # Or we can capture groups, and substitute their contents back in. print(re.sub("(.*) (BITES) (.*)", r"\3 \2 \1", "DOG BITES MAN")) # count is a keyword parameter which we can use to limit replacements print(re.sub("a", "b", "aaaaaaaaaa")) print(re.sub("a", "b", "aaaaaaaaaa", count=1)) # Here's a closer look at a match object. my_match = re.match("(.*) (BITES) (.*)", "DOG BITES MAN") print(my_match.groups()) print(my_match.group(1)) # We can name groups. my_match = re.match("(?P.*) (?PBITES) (?P.*)", "DOG BITES MAN") print(my_match.group("subject")) print(my_match.groupdict()) # We can still access named groups by their positions. print(my_match.group(1)) # Sometimes we want to find all the matches in a string. print(re.findall("[^ ]+@[^ ]+", "Bob , Jane ")) # Sometimes we want to split a string. print(re.split(", *", "one,two, three, four")) # We can compile a regular expression to an object my_regex = re.compile("(.*) (BITES) (.*)") # now we can use it in a very similar way to the module print(my_regex.sub(r"\3 \2 \1", "DOG BITES MAN")) Greed ----- Regular expressions are *greedy* by default -- this means that if a part of a regular expression can match a variable number of characters, it will always try to match as many characters as possible. That means that we sometimes need to take special care to make sure that a regular expression doesn't match too much. For example:: # this is going to match everything between the first and last '"' # but that's not what we want! print(re.findall('".*"', '"one" "two" "three" "four"')) # This is a common trick print(re.findall('"[^"]*"', '"one" "two" "three" "four"')) # We can also use ? after * or other expressions to make them *not greedy* print(re.findall('".*?"', '"one" "two" "three" "four"')) Functions as replacements ------------------------- We can also use ``re.sub`` to apply a *function* to a match instead of a string replacement. The function must take a match object as a parameter, and return a string. We can use this functionality to perform modifications which may be difficult or impossible to express as a replacement string:: def swap(m): subject = m.group("object").title() verb = m.group("verb") object = m.group("subject").lower() return "%s %s %s!" % (subject, verb, object) print(re.sub("(?P.*) (?P.*) (?P.*)!", swap, "Dog bites man!")) Flags ----- Regular expressions have historically tended to be applied to text line by line -- newlines have usually required special handling. In Python, the text is treated as a single unit by default, but we can change this and a few other options using *flags*. These are the most commonly used: * ``re.IGNORECASE`` -- make the regular expression case-insensitive. It is case-sensitive by default. * ``re.MULTILINE`` -- make ``^`` and ``$`` match the beginning and end of each line (excluding the newline at the end), as well as the beginning and end of the whole string (which is the default). * ``re.DOTALL`` -- make ``.`` match any character (by default it does not match newlines). Here are a few examples:: print(re.match("cat", "Cat")) # this won't match print(re.match("cat", "Cat", re.IGNORECASE)) # this will text = """numbers = 'one, two, three' numbers = 'four, five, six' not_numbers = 'cat, dog'""" print(re.findall("^numbers = '.*?'", text)) # this won't find anything # we need both DOTALL and MULTILINE print(re.findall("^numbers = '.*?'", text, re.DOTALL | re.MULTILINE)) .. Note:: ``re`` functions only have a single keyword parameter for flags, but we can combine multiple flags into one using the ``|`` operator (bitwise *or*) -- this is because the values of these constants are actually integer powers of two. Exercise 4 ---------- #. Write a function which takes a string parameter and returns ``True`` if the string is a valid Python variable name or ``False`` if it isn't. You don't have to check whether the string is a reserved keyword in Python -- just whether it is otherwise syntactically valid. Test it using all the examples of valid and invalid variable names described in the first chapter. #. Write a function which takes a string which contains two words separated by any amount and type of whitespace, and returns a string in which the words are swapped around and the whitespace is preserved. Parsing CSV files: ``csv`` ========================== CSV stands for *comma-separated values* -- it's a very simple file format for storing tabular data. Most spreadsheets can easily be converted to and from CSV format. In a typical CSV file, each line represents a row of values in the table, with the columns separated by commas. Field values are often enclosed in double quotes, so that any literal commas or newlines inside them can be escaped:: "one","two","three" "four, five","six","seven" Python's ``csv`` module takes care of all this in the background, and allows us to manipulate the data in a CSV file in a simple way, using the ``reader`` class:: import csv with open("numbers.csv") as f: r = csv.reader(f) for row in r: print row There is no single CSV standard -- the comma may be replaced with a different delimiter (such as a tab), and a different quote character may be used. Both of these can be specified as optional keyword parameters to ``reader``. Similarly, we can *write* to a CSV file using the ``writer`` class:: with open('pets.csv', 'w') as f: w = csv.writer(f) w.writerow(['Fluffy', 'cat']) w.writerow(['Max', 'dog']) We can use optional parameters to ``writer`` to specify the delimiter and quote character, and also whether to quote all fields or only fields with characters which need to be escaped. Exercise 5 ---------- #. Open a CSV file which contains three columns of numbers. Write out the data to a new CSV file, swapping around the second and third columns and adding a fourth column which contains the sum of the first three. Writing scripts: ``sys`` and ``argparse`` ========================================= We have already seen a few scripts. Technically speaking, any Python file can be considered a script, since it can be executed without compilation. When we call a Python program a script, however, we usually mean that it contains statements other than function and class definitions -- scripts *do something* other than define structures to be reused. Scripts vs libraries -------------------- We can combine class and function definitions with statements that use them in the same file, but in a large project it is considered good practice to keep them separate: to define all our classes in *library* files, and import them into the main program. If we do put both classes and main program in one file, we can ensure that the program is only executed when the file is run as a script and not if it is imported from another file -- we saw an example of this earlier:: class MyClass: pass class MyOtherClass: pass if __name__ == '__main__': my_object = MyClass() # do more things If our file is written purely for use as a script, and will never be imported, including this conditional statement is considered unnecessary. Simple command-line parameters ------------------------------ When we run a program on the commandline, we often want to pass in parameters, or *arguments*, just as we would pass parameters to a function inside our code. For example, when we use the Python interpreter to run a file, we pass the filename in as an argument. Unlike parameters passed to a function in Python, arguments passed to an application on the commandline are separated by spaces and listed after the program name without any brackets. The simplest way to access commandline arguments inside a script is through the ``sys`` module. All the arguments in order are stored in the module's ``argv`` attribute. We must remember that the first argument is always the name of the script file, and that all the arguments will be provided in string format. Try saving this simple script and calling it with various arguments after the script name:: import sys print sys.argv Complex command-line parameters ------------------------------- The ``sys`` module is good enough when we only have a few simple arguments -- perhaps the name of a file to open, or a number which tells us how many times to execute a loop. When we want to provide a variety of complicated arguments, some of them optional, we need a better solution. The ``argparse`` module allows us to define a wide range of compulsory and optional arguments. A commonly used type of argument is the *flag*, which we can think of as equivalent to a keyword argument in Python. A flag is optional, it has a name (sometimes both a long name and a short name) and it may have a value. In Linux and OSX programs, flag names often start with a dash (long names usually start with two), and this convention is sometimes followed by Windows programs too. Here is a simple example of a program which uses ``argparse`` to define two positional arguments which must be integers, a flag which specifies an operation to be performed on the two numbers, and a flag to turn on verbose output:: import argparse import logging parser = argparse.ArgumentParser() # two integers parser.add_argument("num1", help="the first number", type=int) parser.add_argument("num2", help="the second number", type=int) # a string, limited to a list of options parser.add_argument("op", help="the desired arithmetic operation", choices=['add', 'sub', 'mul', 'div']) # an optional flag, true by default, with a short and a long name parser.add_argument("-v", "--verbose", help="turn on verbose output", action="store_true") opts = parser.parse_args() if opts.verbose: logging.basicConfig(level=logging.DEBUG) logging.debug("First number: %d" % opts.num1) logging.debug("Second number: %d" % opts.num2) logging.debug("Operation: %s" % opts.op) if opts.op == "add": result = opts.num1 + opts.num2 elif opts.op == "sub": result = opts.num1 - opts.num2 elif opts.op == "mul": result = opts.num1 * opts.num2 elif opts.op == "div": result = opts.num1 / opts.num2 print(result) ``argparse`` automatically defines a ``help`` parameter, which causes the program's usage instructions to be printed when we pass ``-h`` or ``--help`` to the script. These instructions are automatically generated from the descriptions we supply in all the argument definitions. We will also see informative error output if we don't pass in the correct arguments. Try calling the script above with different arguments! .. Note:: if we are using Linux or OSX, we can turn our scripts into *executable files*. Then we can execute them directly instead of passing them as parameters to Python. To make our script executable we must mark it as executable using a system tool (``chmod``). We must also add a line to the beginning of the file to let the operating system know that it should use Python to execute it. This is typically ``#!/usr/bin/env python``. Exercise 6 ---------- #. Write a script which reorders the columns in a CSV file. It should take as parameters the path of the original CSV file, a string listing the indices of the columns in the order that they should appear, and optionally a path to the destination file (by default it should have the same name as the original file, but with a suffix). The script should return an error if the list of indices cannot be parsed or if any of the indices are not valid (too low or too high). You may allow indices to be negative or repeated. You should include usage instructions. Answers to exercises ==================== Answer to exercise 1 -------------------- #. Here is an example program:: import datetime today = datetime.datetime.today() for w in range(10): day = today + datetime.timedelta(weeks=w) print(day.strftime("%Y-%m-%d")) Answer to exercise 2 -------------------- #. Here is an example program:: import math class Sphere: def __init__(self, radius): self.radius = radius def volume(self): return (4/3) * math.pi * math.pow(self.radius, 3) def surface_area(self): return 4 * math.pi * self.radius ** 2 Answer to exercise 3 -------------------- #. Here is an example program:: import random secret_number = random.randint(1, 100) guess = None num_guesses = 0 while not guess == secret_number: guess = int(input("Guess a number from 1 to 100: ")) num_guesses += 1 if guess == secret_number: suffix = '' if num_guesses == 1 else 'es' print("Congratulations! You guessed the number after %d guess%s." % (num_guesses, suffix)) break if guess < secret_number: print("Too low!") else: print("Too high!") Answer to exercise 4 -------------------- #. :: import re VALID_VARIABLE = re.compile('[a-zA-Z_][a-zA-Z0-9_]*') def validate_variable_name(name): return bool(VALID_VARIABLE.match(name)) #. :: import re WORDS = re.compile('(\S+)(\s+)(\S+)') def swap_words(s): return WORDS.sub(r'\3\2\1', s) Answer to exercise 5 -------------------- #. Here is an example program:: import csv with open("numbers.csv") as f_in: with open("numbers_new.csv", "w") as f_out: r = csv.reader(f_in) w = csv.writer(f_out) for row in r: w.writerow([row[0], row[2], row[1], sum(float(c) for c in row)]) .. Todo:: why does writerow echo a number to the console? Answer to exercise 6 -------------------- #. Here is an example program:: import sys import argparse import csv import re parser = argparse.ArgumentParser() parser.add_argument("input", help="the input CSV file") parser.add_argument("order", help="the desired column order; comma-separated; starting from zero") parser.add_argument("-o", "--output", help="the destination CSV file") opts = parser.parse_args() output_file = opts.output if not output_file: output_file = re.sub("\.csv", "_reordered.csv", opts.input, re.IGNORECASE) try: new_row_indices = [int(i) for i in opts.order.split(',')] except ValueError: sys.exit("Unable to parse column list.") with open(opts.input) as f_in: with open(output_file, "w") as f_out: r = csv.reader(f_in) w = csv.writer(f_out) for row in r: new_row = [] for i in new_row_indices: try: new_row.append(row[i]) except IndexError: sys.exit("Invalid column: %d" % i) w.writerow(new_row)