Nick Bastin

Nick Bastin header image 2

Why collections.deque is my new favorite data structure

July 15th, 2009 · No Comments

In the process of unrolling some recursive algorithms to be iterative, I’ve started using collections.deque, and I’ve decided it’s my new favorite general-purpose python data structure. This wasn’t a hard decision – lists are the general utility collection type and they suck, so really anything ordered and mutable was probably going to be better (I almost never need the rotational capabilities of deques, so a somewhat lighter structure would actually be better, like a doubly linked list).

Here’s the fundamental problem: python lists aren’t lists by any standard computer science definition of the word – they’re arrays. I think this should be a surprise to people who don’t know anything about Python. There are at least a few core python developers who have tried to convince me that making the “list” data structure be an array “is just easier for newbies” (along with some rather condescending comments about how it’s obvious that it’s an array since it supports direct indexing). This is complete horseshit. If someone is new to programming, they really don’t care if you call it an array or a vector or a list – they don’t understand what any of those things are. In fact, calling it a list calls up connotations from real life that are not generally accurate, so best not to confuse them anyhow, in my opinion.

You might just think this is a stupid arguments about semantics, but it has much larger implications towards when you would use a given data structure (and choosing the right data structure for a given job is something that most programmers, and certainly new ones, do very poorly – lets not design a language that encourages that practice, eh?). If a list were really a linked list, you would probably take away direct element access (list[35]) – although you might not, there are ways to make some common cases of direct element access reasonably efficient in linked lists – but you’d leave the API otherwise the same. I’m not really sure who this would really hurt (and of course you’d be free to implement vector() or array() to solve that specific problem), since the most common list use case is iteration over some or all of the objects (a cursory examination of the standard library indicates this is true, but I don’t have hard numbers). However, the list API already offers some truly inefficient members – ones which I think might have more common use (at least in my experience) than direct element access. Notably, that member is list.insert(idx, data) – it allows you to insert anywhere you’d like in the list, although in practicality one of the most common places to insert is at position zero. This is incredibly inefficient.

Of course, if you want to always insert on the top of the list then it can be argued that you should just use list.append() – which is very fast – and reverse() or pop off the bottom of list. The problem is when you want to be adding elements to both ends. List is a VERY bad data structure for double-ended operation, especially if you have a lot of items. list.insert() can cause some serious memmove() action when your lists get bigger than 10000 items (more data below). Of course, for that matter even append() can cause serious problems when your collections get to be more than a couple million elements – you can very easily fail to allocate a few megabytes due to memory fragmentation even when you have plenty of memory free, since arrays have to be contiguous blocks. So, I would call that something other than a list, because it very much behaves differently, and bring it in line with other programming texts that people might pick up. You can still make it your primary data structure if you like, but please, please, call it an array or a vector, and write about performance guarantees in the python documentation (for all data types).

So anyhow, because Python doesn’t actually have an efficient doubly-linked list collection, you’re stuck with deque. The implementation is efficient, though, so it’s plenty serviceable as a doubly-linked list if you need a double-ended data structure with efficient insertion at both ends or need to store millions of rows (deques are allocated in blocks, but those blocks can reside anywhere on the heap, making reallocation a relative non-issue).

I made a few tests that mimic my use case and I’ve put the results below. The test is the same code for both lists and deques – just the container type is changed (and list.insert(0, data) is replaced with deque.appendleft(data)). The test creates an empty structure and inserts 100000 elements into it, alternating every other element between appending and inserting on the top. Timings are from the test loop run 10 times, each test is then run 3 times.

Deque: 0.46s, 0.47s, 0.47s
List: 48.86s, 48.22s, 49.62s

I had originally put a million items into each collection, but the list test did not want to complete in anything resembling reasonable time, even alternating between append and insert (insert(0, data) on a 10000 item vector 10 times takes around a second, while using a 100000 item vector takes around 100 seconds). list.append() is slightly faster than deque.append(), but it’s only a couple % (probably due to the efficiency of having a LIST_APPEND opcode), so I continue to use deques anytime I have any kind of medium-to-large data set, even if I’m just appending and iterating.

Tags: Python · Rants · Software

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

You must log in to post a comment.