Many times, during our day-to-day as a programmer, we face with the need to analyze, search and return values into a string. There are two ways to proceed with this, you write the string anlysis manually or use a regular expressions, also called: RegEx
Regular expressions are a chain of character with it’s own meaning, generally used to search for patterns of text. They have many applications, but are generally used to extract informations from a text or to ensure that a group of predefined characters are present in a text. My goal in this article is not explain regular expressions, but to show its usefulness. So here we go …
A Usefull Example
Regular expressions are extremely powerful tools and can go from a common sequence to an unintelligible tangle of characters, so use it sparingly. Here I’ll show a simple example on how to use regular expressions to summarize a few snippets of Python code.
In our example we have a string called data which contains an HTML snippet. Our task is to analyze the string and return only links that exist in it. First let’s see how to do this in pure Python:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
# File: link_search.py
# License: GPLv3
data = '''
Mind Bending Blog
Welcome!
Welcome visitor! Access my blog.
Visit also out news portal
'''.split('n')
def links_search(data):
ret = []
for line in data:
if 'href' not in line:
continue
ret.append(line.split('"')[1])
return ret
if __name__ == '__main__':
print links_search(data)
After running the code above we obtain the following output:
$ python link_search.py
['/en/projects/tiamat/', 'http://news.codecommunity.org']
How could we do this with RegEx? Simple:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
# File: link_search_re.py
# License: GPLv3
import re
data = '''
Mind Bending Blog
Welcome!
Welcome visitor! Access my blog.
Visit also out news portal
'''.split('n')
link_re = re.compile(r'href="(.*?)"')
def links_search_re(data):
ret = []
for line in data:
ret += link_re.findall(line)
return ret
if __name__ == '__main__':
print links_search_re(data)
After running the code above we obtain the following output:
$ python link_search_re.py
['/en/projects/tiamat/', 'http://news.codecommunity.org']
Apart from the difficulty of understanding the meaning of the regular expression, the code is much more compact. Now I wonder, will be worth replacing the use of split and find in all our Python programs? The answer is simple: No.
Why Not?!
Some routines (as shown above) are so simple that the use of regular expressions is a very high cost. To prove this we run a test to measure the elapsed time to perform 1.000.000 executions of link_search and link_search_re:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
# Arquivo: link_timer.py
# Licença: GPLv3
import time
import link_search
import link_search_re
data = link_search.data
links_search = link_search.link_search
links_search_re = link_search_re.link_search_re
def timing(func, args, count):
i = time()
for n in range(count):
func(args)
print 'Elapsed time with'+func.func_name+':',time()-i
n = 1000000
timing(links_search, data, n)
timing(links_search_re, data, n)
When we run this code we receive the following message:
$ python link_timer.py
Elapsed time with links_search: 3.89300012589
Elapsed time with links_search_re: 11.4989998341
We can see clearly that for simple analysis like this one, just split is enough :D. But if we use a data type that favors regular expressions, we can improve this time. We just need to modify the links_search_re function in the link_search_re file as follows:
def links_search_re(data):
ret = []
for match in link_re.finditer(data):
ret.append(match.group(1))
return ret
And alter the link_timer file as follow:
data2 = ''.join(data)
timing(links_search_re, data2, n)
After rerunning the tests, we got the following elapsed times:
$ python link_timer.py
Elapsed time with links_search: 4.00200009346
Elapsed time with links_search_re: 6.77699995041
Thats a good response time!
Just this?!
So far only! But stay tuned, soon I will show that it is possible to have less code, more efficiency and more performance using regular expressions.
Until then…
Comments
comments powered by Disqus