Google Summer of Code: June 2015

Monday, June 29, 2015

Week 5

In our last meeting, my mentors and I talked about breaking up this summer's work into multiple major pull requests, as opposed to last year's enormous pull request which was merged toward the very end of the summer. It'll be nice to do this in pieces just to make sure everything gets into the master branch of Astropy as intended, so we're planning on getting a PR in very soon (we discussed a deadline of 1-2 weeks past last Wednesday's meeting). The idea is to have working code that handles changes to Table indices when the Table itself is modified, and after this PR we can focus more on speeding up the index system and adding more functionality.

With that in mind, I mostly spent this week working on previous #TODO's, fixing bugs, and generally getting ready for a PR. Having previously ignored some of the subtleties of Table and Column copying, I found it pretty irritating to ensure that indices are preserved/copied/deep copied as appropriate when doing things like constructing one Table from another, slicing a Table by rows, etc. -- mostly because there are some intricacies involved in subclassing `numpy.ndarray` that I wasn't aware of before running across them. Also, while I managed to get this working correctly, there might end up being relevant time bottlenecks we need to take into consideration.

I also moved the relevant tests for Table indices to a new file `test_index.py` (adding some new tests), and fixed a couple issues including a bug with `Table.remove_rows` when the argument passed is a slice object. For the actual indexing engine, I found a library called bintrees which provides C-based binary tree and red-black tree classes, so for now I'm using this as the default engine (with the optional bintrees dependency, and falling back on my pure-Python classes if the dependency isn't present). I'm looking forward to figuring out the plan for a PR at this Wednesday's meeting, and from there moving on to optimization and increasing functionality.

Monday, June 22, 2015

Week 4

At this point, the infrastructure for indexes is pretty much in place, and the goal now is to work on getting as much speed as possible out of each index. This week, I've begun trying a couple things to improve the performance of the basic binary-tree index implementation, at this stage mostly just algorithmic. As a first step I implemented a red-black tree (in Python for now, though I hope to cut it down to C level); its advantage is that all major operations (add, remove, find) are in O(log n) as guaranteed by the relative balancing of the tree height. The self-balancing part involves keeping to a standard set of rules about the tree, whose nodes have a "color" of either red or black: the root must be black, every red node must have two black children, leaf (null) nodes are black, and any path from a node to its descendants must have the same number of black components. Currently, the performance of the red-black tree in an index overshadows the vanilla binary search tree, but is still not as good as I'd like.

Per my mentors' suggestion, I also implemented a sorted array implementation that can find elements in O(log n), while adding elements has a worst-case running time of O(n). While this makes sorted arrays worse when there are many modifications, the common use-case of an index used on an unmodified Table seems to justify their use. After first initializing the sorted array using a fast (probably in C) numpy sort, modifications to the array occur in-place (presumably the underlying Python list shifts elements on one side) and the find() method uses a binary search. Currently, this implementation outperforms the non-indexed version by about 10% for group_by, and I'm looking to improve on its performance for where(), which is probably the most time-critical function.

Sunday, June 14, 2015

Week 3

I spent most of this week integrating the new indexing system into the existing system of table operations, with the goal of making sure that everything works correctly before honing in on performance improvements. (I'm keeping thoughts about improvements in the back of my mind though, e.g. changing the indexing data structure to something like a red-black tree, rewriting code in Cython/C, etc.) So far the results have come along reasonably well, so that the indexing system is used in `group_by()` and a new function called `where()`, which is under construction. The goal of the `where()` function is to mimic the SQL "where" query, so that one can write something like this for a table full of names:

results = table.where('first_name = {0} and last_name = {0}', 'Michael', 'Mueller')

This will ultimately be sort of mini-language, so I'll have to worry about parsing. For now it only deals with a couple of the easiest cases.

I've also modified the Index structure to allow for composite indices, in which an index keeps track of keys from more than one column. The idea is to store the node keys in the underlying implementation (currently a BST) as a tuple containing values from each column, in order; for example, a key for an index on the columns "first_name" and "last_name" might be ("Michael", "Mueller"). Clearly the order of the columns is important, because ("A", "B") < ("B", "A"), so `where()` queries involving multiple columns will have to use a leftmost prefix of the column set for a composite index in order to be usable. For example, if an index is on ("first_name", "last_name"), then querying for "first_name" should be allowed but querying only for "last_name" should not be allowed, as otherwise the advantage of an index is no longer relevant.

Going into next week, I plan on continuing the current integration of Index into the Table system, more specifically on the `where()` functionality. Luckily the basic infrastructure works now, though figuring out a couple errors with pickling/memory views in the Table and Column classes were a bit annoying this week and the week prior. As of now, indices are stored in their container Columns and, conversely, each index contains a list of references (in order) to its column(s).

Sunday, June 7, 2015

Week 2

This week, I worked on the backbone of the indexing system I'll be working on--I have an Index class that will be compatible with a number of data structure implementations, and for now a binary search tree implementation. See https://github.com/mdmueller/astropy/blob/table-indexing/astropy/table/index.py and the implementation, https://github.com/mdmueller/astropy/blob/table-indexing/astropy/table/bst.py. I also have tests to make sure that the BST structure works as expected. (Nodes of the BST are of the form (column value, row number) for now, although this will probably change to something like ((column 1 value, ..., column n value), [row 1, ..., row m]) later to deal with composite indices and repeated column values across multiple rows.)

Next week I'll be integrating the new Index class into the existing functionality of the Table class, as well as implementing composite indices (indices on multiple columns) as well as a new `where()` function to find rows by column values.