Mind Bending

So, let’s suppose that it’s just a normal raining evening and you’re, as always, coding in Python. Then sundelly in the middle fo your code you need to accumulate some strings, what would you do? Simply accumulate using the +=, wright? Yeah, me too, until today! I’ve just found out (through PythonMeme.com) about a python’s string accumulation "flaw" that made me change my way to write strings accumulation…

Python Logo

I’m talking about this post by Xah Lee where he briefly discuss about on point in the Google’s Python Style Guide. After reading I’ve decided to make some tests to prove this point of view. By the way, I was tho only one how didn’t know about this style guide?

About The "Flaw"

First, let’s quote the Google’s Python Style Guide:

Avoid using the + and += operators to accumulate a string within a loop. Since strings are immutable, this creates unnecessary temporary objects and results in quadratic rather than linear running time. Instead, add each substring to a list and ”.join the list after the loop terminates (or, write each substring to a cStringIO.StringIO buffer).

Hey, that’s not a bug, it’s a feature! Leaving the jokes aside I agree that this looks like a flaw, but I rather to define it as a corollary. That’s just a result to Guido’s decision to make Python’s strings immutable.

The Test

In Xah Lee’s post there is a code that shows a better way to accumulate strings using lists, and I strongly recommend you to go read his post before moving along.

Now that you’re back, let’s see the code that I’ve written to test this theory:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cStringIO

small_emps = [
        ["Mary", "Jane"],
        ["Jenny", "Doe"],
        ["Alice", "Johnson"]
        ]

big_emps = small_emps[::]
for i in xrange(6):
    big_emps += big_emps

print '#'*15
print 'Employee List Sizes:'
print 'tsmall_emps size=%i'%(len(small_emps))
print 'tbig_emps size=%in'%(len(big_emps))

def accumulate_str(employees):
    emp_table = '<table>n'
    for l_name, f_name in employees:
        emp_table += 't<tr><td>%s, %s<td><tr>n' % (l_name, f_name)
    emp_table += '<table>'
    return emp_table

def accumulate_list(employees):
    items = ['<table>']
    for l_name, f_name in employees:
        items.append('t<tr><td>%s, %s<td><tr>' % (l_name, f_name))
    items.append('<table>')
    emp_table = 'n'.join(items)
    return emp_table

def accumulate_cStringIO(employees):
    emp_table = cStringIO.StringIO()
    emp_table.write('<table>n')
    for l_name, f_name in employees:
        emp_table.write('t<tr><td>%s, %s<td><tr>n'%(l_name, f_name))
    emp_table.write('<table>')
    emp_table_s = emp_table.getvalue()
    emp_table.close()
    return emp_table_s

if __name__ == '__main__':
    import timeit
    ############################
    # SMALL EMPLOYEE LIST TEST #
    ############################
    print '#'*15
    print 'Small employee_list test...'
    print "tDefault String Accumulation"
    t = timeit.Timer("accumulate_str(small_emps)",
            "from __main__ import accumulate_str, small_emps")
    print "tElapsed time: %0.5sn"%t.timeit(number=50000)

    print "tDefault List Accumulation"
    t = timeit.Timer("accumulate_list(small_emps)",
            "from __main__ import accumulate_list, small_emps")
    print "tElapsed time: %0.5sn"%t.timeit(number=50000)

    print "tCStringIO Accumulation"
    t = timeit.Timer("accumulate_cStringIO(small_emps)",
            "from __main__ import accumulate_cStringIO, small_emps")
    print "tElapsed time: %0.5snn"%t.timeit(number=50000)

    ##########################
    # BIG EMPLOYEE LIST TEST #
    ##########################
    print '#'*15
    print 'Big employee_list test...'
    print "tDefault String Accumulation"
    t = timeit.Timer("accumulate_str(big_emps)",
            "from __main__ import accumulate_str, big_emps")
    print "tElapsed time: %0.5sn"%t.timeit(number=50000)

    print "tDefault List Accumulation"
    t = timeit.Timer("accumulate_list(big_emps)",
            "from __main__ import accumulate_list, big_emps")
    print "tElapsed time: %0.5sn"%t.timeit(number=50000)

    print "tCStringIO Accumulation"
    t = timeit.Timer("accumulate_cStringIO(big_emps)",
            "from __main__ import accumulate_cStringIO, big_emps")
    print "tElapsed time: %0.5sn"%t.timeit(number=50000)

In this simple code I’ve written a small employee list, a big employee list and three functions (accumulate_str, accumulate_list, and accumulate_cStringIO). All functions behave exactly the same, differing only by the way they accumulate strings. When I ran this code I get the following result:

$ python2 string_accumulation_test.py
###############
Employee List Sizes:
    small_emps size=3
    big_emps size=192

###############
Small employee_list test...
    Default String Accumulation
    Elapsed time: 0.082

    Default List Accumulation
    Elapsed time: 0.108

    CStringIO Accumulation
    Elapsed time: 0.153

###############
Big employee_list test...
    Default String Accumulation
    Elapsed time: 4.661

    Default List Accumulation
    Elapsed time: 4.215

    CStringIO Accumulation
    Elapsed time: 4.833

This is a very interesting result that says:

  • If your a weekend programmer that don’t need to accumulate more then a dozen of strings, use += without worries;
  • But if you need to iterate and accumulate over many strings use the list approach.
  • Last but not least, don’t use the cStringIO;

Of course I may have misunderstood what the Google’s Python Style Guide says about using cStringIO, but now matter how I use it it’s results was always the worst. Don’t try to add more items to the big_emps, I’ve already tried that out and the cStringIO execution time only grew bigger.

Magnun

Magnun

Graduated in Telecommunication Engineering, but currently working with GNU/Linux infrastructure and in the spare time I'm an Open Source programmer (Python and C), a drawer and author in the Mind Bending Blog.


Comments

comments powered by Disqus