PyCon2011 Highlights: The Data Structures of Python

I’ll start this article the same way that Alex Gaynor (the speaker) began. "Who here remembers their CS data structure class?". Then he adds: "Who cares?".

For those who learned to program in high level languages, data structures isn’t anything to worry. But for older people like me who started programming in C, data structures is extremely difficult, but valuable. What happens is that the high-level languages already have implemented various data structures, making its use simple and painless.

Python is no exception to this rule, but I do not agree with the idea that this knowledge is disposable. The speaker quoted a phrase from Tim Peters, one of the greatest Python gurus: "We read Knuth so you do not have to.". Knuth is a renowned professor at Stanford University and author of The Art of Computer Programming, a reference in the field of computer science.

Python by default implements various types of data structures lists, tuples, dictionaries, sets, frozensets arrays, bytearrays and other less used or more recent (like OrderedDict). This talk attempts to show how and where to use some of these types (lists, tuples, sets and frozensets):

Numbers, numbers, numbers…

In my humble opinion, I think the lecture should have more numbers and statistics. For example, during the comparison between lists and sets he cites the following function:

def remove_dups(seq):
    seen = set()
    items = []
    for item in seq:
        if item not in seen:
            seen.add(item)
            items.append(item)
    return items

He also mentions that if you change the seen set for a list, the operation would be the same however, the performance would be lower. How to prove it? Simple, just write the second function suggested by him:

def bad_remove_dups(seq):
    seen = []
    items = []
    for item in seq:
        if item not in seen:
            seen.append(item)
            items.append(item)
    return items

In addition, we need two modules (random and time), a benchmark function and a group of values, as follows:

from random import random
from time import time

def bench(func):
    results = []
    for i in range(10):
        ini = time()
        func(data)
        results.append(time() - ini)
    print "Mean time:", sum(results)/10

data = [int(random()*50)for i in range(900000)]

After that, simply call the benchmark function and pass the function to be tested, as below:

bench(remove_dups)
bench(bad_remove_dups)

Although we are computing a ten runs average, still there is a slight variation between executions, but the results always very close. Below is a sample output:

Mean time: 0.0723000049591
Mean time: 0.580800008774

We can see, the remove_dups function is slightly over eight times faster than the bad_remove_dups function.

Dot it Yourself

For me, a very important point at this talk was the use of collections module abstract classes. I’ve already committed the mistake of extending some standard Python types. Never do it again… Never!

Mind Bending

PyCon2011 Highlights: The Data Structures of Python

Numbers, numbers, numbers…

Dot it Yourself

Comments

Mind Bending

Numbers, numbers, numbers…

Dot it Yourself

Related Posts:

Comments