Luke Lee

Software Engineer

Web + Desktop + Science

Fork me on Github

Inspecting Python history

One of the great things about a project like Python being open source is that anyone can go look at the code and find a 'paper' trail detailing the reasons for almost every line of code. I personally don't take advantage of this enough.

Recently I stumbled across a great opportunity to dive into the source. The source is a great way to to track down how and why of any implementation detail or behavior.

Discovery phase

First, I spent some time thinking about how iterators and count work. I came up with the following little snippet:

https://gist.github.com/durden/4158116

This lead me to thinking about the real implementation of pieces of the collections module, specifically count and Counter.

I was quickly side-tracked by the following snippet in the collections module:

    def subtract(self, iterable=None, **kwds):
        if iterable is not None:
            self_get = self.get
            if isinstance(iterable, Mapping):
                for elem, count in iterable.items():
                    self[elem] = self_get(elem, 0) - count
            else:
                for elem in iterable:
                    self[elem] = self_get(elem, 0) - 1
        if kwds:
            self.subtract(kwds)

The most interesting line is self_get = self.get. What is the point of this?

I assume this is performance related because attribute lookups can be quite expensive. So, avoiding the call to self.get in a tight loop could be beneficial, but how can I fact-check my assumption?

Use the history

To follow along and look through the history you need to have the Hg repository locally:

  1. Get source:
    • hg clone http://hg.python.org/cpython
  2. Update source to version we want to inspect:
    • hg update 3.3

Now find what revision the suspicious lines were last touched in:

hg annotate -u -n Lib/collections/__init__.py

The above command is the Hg equivalent of the 'blame' concept from git and subversion. This command is essential when you need to know the author of each line and the revision it was last updated.

After running the above hg annotate command you can see the first of these lines was last updated in revision 54985. So, what information did the author leave behind when this revision was committed:

hg log -p -r 54985

This gives us the commit message and changes introduced by this revision:

summary:     Issue 6370: Performance issue with collections.Counter()

The key element here is Issue 6370. This refers to an issue in the official Python bug tracking system. From here it's easy, just search for that issue number on the site. You will find an explanation of the 'bug' and the change associated with it.

Mystery solved

Turns out it was performance related after all, replacing self[elem] with something like self.get(elem, 0) is faster.

However, calling get() is not ideal either because calling Python functions is expensive.

The moral of the story? When in doubt, go to the source and search the history. The majority of the time the documentation trail is very useful and accessible with just a little bit of work.

Published: 12-08-2012 16:27:15

lukelee.net