This is a "sandbox" post which I use as a brain dump for code I find interesting. You know that cool stuff that, because you might not use all the time, it's easy to forget.
For the code and the testing (always test, I’m serious) look at my repo sandbox in github.
https://github.com/zom-pro/sandbox.
This is post is mainly inspired by those moments when you look at the code you wrote 6 months ago and want to slap yourself. Normally, you wish nobody who knows you will ever see that code neither. This happens to me sometimes when I discover a new cool (maybe a bit obscure) data structure in a language.
In this post I'll show a couple of things that can be done with the deque and defaultdict collections library of Python 3.4. There're other interesting objects in collections such as OrderedDict, but they are quite straightforward. For more info, just take a look at the official collections doc.
There're two main reasons for me when looking for slightly more advanced data structures: performance and code readability.
For the code and the testing (always test, I’m serious) look at my repo sandbox in github.
https://github.com/zom-pro/sandbox.
This is post is mainly inspired by those moments when you look at the code you wrote 6 months ago and want to slap yourself. Normally, you wish nobody who knows you will ever see that code neither. This happens to me sometimes when I discover a new cool (maybe a bit obscure) data structure in a language.
In this post I'll show a couple of things that can be done with the deque and defaultdict collections library of Python 3.4. There're other interesting objects in collections such as OrderedDict, but they are quite straightforward. For more info, just take a look at the official collections doc.
There're two main reasons for me when looking for slightly more advanced data structures: performance and code readability.
defaultdict
This data structure comes handy when you have to generate missing keys in dicts. Let’s look at a comparisons between how I used to do it and how I do it know. These guys are great examples of improved readability.
sparse matrix
Let's say I want to create a sparse matrix representation using a dictionary. The key would be a row, column tuple and if it doesn't have a value associated to it, it will return 0. In the past, I would've done something like this:
class MyDict():
def __init__(self, default_value):
# default value to be used when key isn't found.
self.default_value = default_value
self._d = dict()
def __setitem__(self, key, value):
return self._d.__setitem__(key, value)
def __getitem__(self, key):
try:
return self._d.__getitem__(key)
except KeyError:
return self.default_value
def __delitem__(self, key):
return self._d.__delitem__(key)
my_dict = MyDict(0)
But what about using defaultdict
def default():
return 0
my_dict = defaultdict(default);
class MyDict():
def __init__(self, default_value):
# default value to be used when key isn't found.
self.default_value = default_value
self._d = dict()
def __setitem__(self, key, value):
return self._d.__setitem__(key, value)
def __getitem__(self, key):
try:
return self._d.__getitem__(key)
except KeyError:
return self.default_value
def __delitem__(self, key):
return self._d.__delitem__(key)
my_dict = MyDict(0)
But what about using defaultdict
def default():
return 0
my_dict = defaultdict(default);
counting appearances
before collections....
def counting_appearance_old(sample):
d = {}
for key in sample:
if key not in d.keys():
d[key] = 0
d[key] += 1
return d
after collections...
def counting_appearance(sample):
d = defaultdict(int)
for k in sample:
d[k] += 1
return d
Not an enormous difference, but not doubt more straightforward.
def counting_appearance_old(sample):
d = {}
for key in sample:
if key not in d.keys():
d[key] = 0
d[key] += 1
return d
after collections...
def counting_appearance(sample):
d = defaultdict(int)
for k in sample:
d[k] += 1
return d
Not an enormous difference, but not doubt more straightforward.
a dictionary of list like the example in the docs?
old school...
def dict_of_lists_old(sample):
d = {}
for k, v in sample:
if k not in d.keys():
d[k] = []
d[k].append(v)
return d
new generation
def dict_of_lists(sample):
d = defaultdict(list)
for k, v in sample:
d[k].append(v)
return d
def dict_of_lists_old(sample):
d = {}
for k, v in sample:
if k not in d.keys():
d[k] = []
d[k].append(v)
return d
new generation
def dict_of_lists(sample):
d = defaultdict(list)
for k, v in sample:
d[k].append(v)
return d
deque
moving average
This guy is more commonly (or at least I've seen it more often) used as the tail filter in Unix. However, it comes in handy when calculating moving average and other similar "moving" calculations. Let's take a look, lets calculate moving average of a list with a moving group of 3 numbers (here it's crazy important to test, take a look at my github).
without collections.deque I would have done something like this
def old_moving_average(iterable, n=3):
len_iterable = len(iterable)
for e, i in enumerate(iterable):
# necessary to stop it before reaches the end.
if e <= len_iterable - n:
yield sum(iterable[e:n + e]) / n
with deque
def moving_average(iterable, n=3):
it = iter(iterable)
d = deque(itertools.islice(it, n - 1))
d.appendleft(0)
s = sum(d)
for elem in it:
s += elem - d.popleft()
d.append(elem)
yield s / n
Well, readability hasn't improve as far as I can see. It actually involves a number of stuff that wasn't there before!!! so why on earth would I use it?
well, I've done this type of calculation with finance-related stuff and normally sample sizes explode quite quickly. So let's look at performance and you will see why. Running a timeit with number=1000 and increasingly larger lists, the results are:
Length deque: 100 time: 0.20786042507368704
Length normal: 100 time: 0.4498787968021777
Length deque: 1000 time: 2.0155418775699676
Length normal: 1000 time: 4.767286307571519
Length deque: 10000 time: 20.771743648427435
Length normal: 10000 time: 28.09181877427863
Length deque: 100000 time: 53.21926311771493
Length normal: 100000 time: 140.31067571029496
yup...
As always, I'm aware things can be done differently and results will vary. But as I said, this is just a brain dump, hopefully you saw something new around.
without collections.deque I would have done something like this
def old_moving_average(iterable, n=3):
len_iterable = len(iterable)
for e, i in enumerate(iterable):
# necessary to stop it before reaches the end.
if e <= len_iterable - n:
yield sum(iterable[e:n + e]) / n
with deque
def moving_average(iterable, n=3):
it = iter(iterable)
d = deque(itertools.islice(it, n - 1))
d.appendleft(0)
s = sum(d)
for elem in it:
s += elem - d.popleft()
d.append(elem)
yield s / n
Well, readability hasn't improve as far as I can see. It actually involves a number of stuff that wasn't there before!!! so why on earth would I use it?
well, I've done this type of calculation with finance-related stuff and normally sample sizes explode quite quickly. So let's look at performance and you will see why. Running a timeit with number=1000 and increasingly larger lists, the results are:
Length deque: 100 time: 0.20786042507368704
Length normal: 100 time: 0.4498787968021777
Length deque: 1000 time: 2.0155418775699676
Length normal: 1000 time: 4.767286307571519
Length deque: 10000 time: 20.771743648427435
Length normal: 10000 time: 28.09181877427863
Length deque: 100000 time: 53.21926311771493
Length normal: 100000 time: 140.31067571029496
yup...
As always, I'm aware things can be done differently and results will vary. But as I said, this is just a brain dump, hopefully you saw something new around.