Bending Python With Regular Expressions

Keeping the subject of Regular Expressions, today I will show in which case the usage of regular expression overcomes a (relatively complex) Python snippet.

Take a deep breath and let’s go…

Context

For this example I’ll use the context of a simple application that validates a list of IP addresses. If anyone is wondering, this is a real example, in one of my applications I need to check if an user entry user is an IP address or not.

Below I will show various pieces of code, if you want to test them, all snippets should be inserted in the same file. In this case I used a file called ip_tester.py.

Understanding the Code

All code is well commented and documented, but I’ll explain some lines in order to easy the understanding.

Let’s see the "header" of our program:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
This program is responsible for performance tests in the validation of IP
addresses using pure Python code and Regular Expressions.

Author: Magnun Leno
License: GPLv3
Filename: ip_tester.py
'''

# Importing the RegExp module
import re
# Importing the timeit module which will help us measuring the elapsed time
import timeit

# RegEx responsible for analyze the IP address "format"
ip_re = re.compile(r'^d{1,3}.d{1,3}.d{1,3}.d{1,3}$')

# Tuple with the data which will will be analysed (Valid or not IP addresses)
data = (
        '10.0.2.1',
        '192.168.1.1',
        'wrong.ip.address',
        '172.16.32.1',
        '10.0.0.396',
        'another.wrong.ip.address',
        '172.16.80.1',
        '172.error.80.2',
        'This surely isnt a valid IP!!!',
        '172.16.80.4',
        )

For this program I will use only the modules er (regular expression) and timeit (run time meter). The er module is already an acquaintance of ours, for those who read the last article, but the timeit module is "new" even for me (I discovered it recently). In the timer function we’ll see how it works.

After importing the modules, we have to compile a regular expression using the compile method from the re module. Since regular expressions are a little complicated I will not explain how it works here, maybe in the future I’ll write a series of articles about the RegEx in Python :D.

After the regular expression, we have the tuple data that stores all data (valid and invalid IP addresses) to be verified by our program.

Now let’s look at two functions used to decode the IP addresses:

def py_validate():
    '''Function responsible for validate the IP addesses in the data tuple
    '''
    for address in data:
        # Any Ip address always has 3 periods
        if address.count('.') != 3:
            continue
        # Split the string in the periods, creating a tuple
        fields = address.split('.')
        # Checks if every field is composed by numbers
        fields = [not field.isdigit() for field in fields]
        # If something isn't numbers, go to the next IP address
        if any(fields):
            continue
        else:
            # Valid IP
            #print 'Valid IP:',address
            pass

def re_validate():
    '''This function is responsible for validate hte IP addresses inside
the data tupple using only Regular Expressions.
    '''
    for address in data:
        # Verify if the IP "matches" the regular expression
        if ip_re.match(address):
            # Valid IP
            #print 'Valid IP:',address
            pass

The first function uses only Python code to ensure that he strings contained in data are valid IP addresses. The most complicated piece in this function (in my opinion) is the list comprehension and the any statement.

The snippet fields = [not field.isdigit() for field in fields] create a list containing the inverted boolean values (note the instruction not) returned by the field.isdigit(). This method tells if the string field is composed only by numeric values.

The passage any (fields) returns True if there is some value True in the list fields. With this we know if there is any True value in the list.

Looking quickly to the second function called re_validate, we can see how RegExp are powerful because, the code has extremely short an easy to understand.

To conclude this first test cycle, we create a function called timer that uses the class Timer to measure the elapsed time of 20,000 executions of a given function:

def timer(func):
    t = timeit.Timer(setup='from __main__ import '+ func, stmt=func+'()')
    print func.ljust(16)+':', t.timeit(number=20000)

if __name__ == '__main__':
    print 'Python timing...'
    timer('py_validate')

    print 'nRegEx timing...'
    timer('re_validate')

In addition to the function timer I also wrote the Python’s "main function", which calls the timer for the py_validate and re_validate functions.

Saving everything and running, we have the following result:

$ python ip_tester.py
Python timing...
py_validate       : 0.481071233749

RegEx timing...
re_validate       : 0.185606956482

Wow! This is humiliating! The RegEx was 2.6 times faster than the pure Python implementation! Okay I’m not being fair, the Python code can be optimized. To prove this I’ll created the function py2_validate:

def py2_validate():
    '''Function responsible for validate the IP addesses in the data tuple
    This version is much faster and a little superior then py_validate
    '''
    for address in data:
        if address.count('.') != 3:
            continue
        fields = address.split('.')
        for field in fields:
            if not field.isdigit():
                break
        else:
            # Valid IP
            #print 'Valid IP:',address
            pass

To test this new function change the "main" program to be as follows:

if __name__ == '__main__':
    print 'Python timing...'
    timer('py_validate')
    timer('py2_validate')

    print 'nRegEx timing...'
    timer('re_validate')

After the execution we have an unexpected joy:

$ python ip_tester.py
Python timing...
py_validate       : 0.495260000229
py2_validate      : 0.367525100708

RegEx timing...
re_validate       : 0.186715841293

Even improving our pure Python code the regular expression implementation still more efficient (almost 2 times faster).

Is That All

Of course not! Our "IP validator" still has flaws. It only checks if the passed string is composed of four groups of numbers separated by dots and nothing more. Now we need to make sure that they are valid, so we’ll consider that a valid IP is in the range between 0.0.0.0 and 255.255.255.255.

To meet this we need to add two snippets of code in the "header" of our program, as follows:

# Importing the RegExp module
import re
# Importing the timeit module which will help us measuring the elapsed time
import timeit

# RegEx responsible for analyze the IP address "format"
ip_re = re.compile(r'^d{1,3}.d{1,3}.d{1,3}.d{1,3}$')
# RegEx responsible for analyze then IP address contents
ip_content_re = re.compile(r'^'+
        r'(25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9]).'+
        r'(25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9]).'+
        r'(25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9]).'+
        r'(25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9])$'
        )
# Lambda used to inform if a number is inside or outside the interval
is_valid = lambda n : not(255 >= int(n) >= 0)

The first inserted code is merely a new regular expression called ip_content_re, while the second code is the lambda is_valid, both responsible for analyze and report if the number is inside or outside the range between 0 and 255.

Now we create two more functions:

def py_validate_all():
    '''Function responsible for validate the contents of the IP
    address inside the tuple.
    '''
    for address in data:
        if address.count('.') != 3:
            continue
        fields = address.split('.')
        for field in fields:
            if not field.isdigit():
                break
        else:
            if any(map(is_valid, fields)):
                continue
            # Valid IP
            #print 'Valid IP:',address
            pass

def re_validate_all():
    '''Function responsible for validate the content of the IP
    address inside the tuple using Regular Expressions
    '''
    for address in data:
        # Verifies if the IP address "match" the regular expression
        if ip_content_re.match(address):
            # Valid IP
            #print 'Valid IP:',address
            pass

These functions do not differ much from the others shown here, I just added the instruction map in conjunction with the lambda - in the py_validate_all function - and the use of the new regular expression ip_content_re - in the re_validate_all function. To conclude this series of tests just add a call to the timer function for each new function. Our "main function" will look like:

if __name__ == '__main__':
    print 'Python timing...'
    timer('py_validate')
    timer('py2_validate')
    timer('py_validate_all')

    print 'nRegEx timing...'
    timer('re_validate')
    timer('re_validate_all')

Now for the final test:

$ python ip_tester.py
Python timing...
py_validate       : 0.496430158615
py2_validate      : 0.377727985382
py_validate_all  : 1.28690600395

RegEx timing...
re_validate       : 0.186795949936
re_validate_all  : 0.310018062592

Now, we totally break Python down! Regular expressions was 4.15 times faster than pure Python implementation.

Source Code

All source code used is available here. Remember this is a GPL code then you can use without fear!

Conclusion

After this brief analysis on usage of regular expressions in Python we saw that there are cases that they aren’t the best choice, but in others cases they can give you a performance boost and dramatically reduce your code. But be careful, regular expressions are extremely treacherous, test them thoroughly before you consider them fully valid, or you may end up having an unexpected behavior in your application.

Oh, and of course, regular expressions can become true of 7 headed monsters which loves to devour programmer’s leavers! If your regular expression is starting to get too complex, consider breaking it into several pieces, it can keep your sanity longer!

Until next time …