Google Summer of Code: 2015

Tuesday, August 18, 2015

Week 12

Since GSOC is finally wrapping up, I pretty much spent this week reading over code in the PR and writing documentation. I introduced a new section on table indexing in the docs after the "Table operations" section, which should give a good introduction to indexing functionality. It also links to an IPython notebook I wrote (http://nbviewer.ipython.org/github/mdmueller/astropy-notebooks/blob/master/table/indexing-profiling.ipynb) that displays some profiling results of indexing by comparing different scenarios, e.g. testing different engines and using regular columns vs. mixins. I also ran the asv benchmarking tool on features relevant to indexing, and fixed an issue with Table sorting in which performance was slowed down while sorting a primary index.

There's not much else to describe in terms of final changes, although I do worry about areas where index copying or relabeling come up unexpectedly and have a negative effect on performance. As an example, using the `loc` attribute on Table is very slow for an indexing engine like FastRBT (which is slow to copy), since the returned rows of the Table are retrieved via a slice that relabels indices. This is necessary if the user wants indices in the returned slice, but I doubt that's usually a real issue. I guess the two alternatives here are either to have `loc` return something else (like a non-indexed slice) or to simply advise in the documentation that using the index mode 'discard_on_copy' is appropriate in such a scenario.

Monday, August 10, 2015

Week 11

This week I implemented nice bit of functionality that Tom suggested, inspired by a similar feature in Pandas: retrieving index information via Table attributes `loc` and `iloc`. The idea is to provide a mechanism for row retrieval in between a high-level `query()` method and dealing with Index objects directly. Here's an example:
```
In [2]: t = simple_table(10)

In [3]: print t
a b c
--- ---- ---
1 1.0 c
2 2.0 d
3 3.0 e
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
8 8.0 j
9 9.0 k
10 10.0 l

In [4]: t.add_index('a')

In [5]: t.add_index('b')

In [6]: t.loc[4:9] # 'a' is the implicit primary key
Out[6]:
<Table length=6>
a b c
int32 float32 str1
----- ------- ----
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
8 8.0 j
9 9.0 k

In [7]: t.loc['b', 1.5:7.0]
Out[7]:
<Table length=6>
a b c
int32 float32 str1
----- ------- ----
2 2.0 d
3 3.0 e
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i

In [8]: t.iloc[2:4]
Out[8]:
<Table length=2>
a b c
int32 float32 str1
----- ------- ----
3 3.0 e
4 4.0 f
```
The `loc` attribute is used for retrieval by column value, while `iloc` is used for retrieval by position in the sorted order of an index. This involves the designation of a primary key, which for now is just the first index added to the table. Also, indices can now be retrieved by column name(s):
```
In [9]: t.indices['b']
Out[9]:
b rows
---- ----
1.0 0
2.0 1
3.0 2
4.0 3
5.0 4
6.0 5
7.0 6
8.0 7
9.0 8
10.0 9
```
Aside from this, I've been adding in miscellaneous changes to the PR, such as getting `np.lexsort` to work with Time objects, reworking the `SortedArray` class to use a `Table` object instead of a list of ndarrays (for working with mixins), putting `index_mode` in `Table`, etc. Tom noted some performance issues when working with indices, which I've been working on as well.

Tuesday, August 4, 2015

Week 10

This week wasn't terribly eventful; I spent time documenting code, expanding tests, etc. for the pull request. Docstrings are now in numpydoc format, and I fixed a few bugs including one that Tom noticed when taking a slice of a slice:
```
from astropy import table
from astropy.table import table_helpers

t = table_helpers.simple_table(10)
t.add_index('a')
t2 = t[1:]
t3 = t2[1:]

print(t3.indices[0])
```
The former output was "Index slice (2, 10, 2) of [[ 1 2 3 4 5 6 7 8 9 10], [0 1 2 3 4 5 6 7 8 9]]" while now the step size is 1, as it should be. The SlicedIndex system seems to be working fine otherwise, except for a python3 bug I found involving the new behavior of the / operator (i.e. it returns a float), though this is fixed now.

Another new change is to the `index_mode` context manager--the "copy_on_getitem" mode now properly affects only the supplied table rather than tampering with BaseColumn directly. Michael's workaround is to change the __class__ attribute of each relevant column to a subclass (either _GetitemColumn or _GetitemMaskedColumn) with the correct __getitem__ method, and this should rule out possible unlikely side effects. Aside from this, I've also been looking into improving the performance of the engines other than SortedArray. The main issue I see is that there's a lot of Python object creation in the engine initialization, which unfortunately seems to be unavoidable given the constraints of the bintrees library. The success of SortedArray really lies in the fact that it deals with numpy arrays, so I'm looking into creating an ndarray-based binary search tree.

Tuesday, July 28, 2015

Week 9

This week I spent quite a bit of time on mixin column support for indices, where appropriate. After first moving the indices themselves from a Column attribute to a DataInfo attribute (accessed as col.info.indices), I moved most of the indexing code for dealing with column access/modifications to BaseColumnInfo for mixin use in methods like __getitem__ and __setitem__. Since each mixin class has to include proper calls to indexing code, mixins should set a boolean value _supports_indices to True in their info classes (e.g. QuantityInfo). As of now, Quantity and Time support indices, while SkyCoord does not since there is no natural order on coordinate values. I've updated the indexing testing suite to deal with the new mixins.

Aside from mixins (and general PR improvements like bug fixes), I implemented my mentors' suggestion to turn the previous static_indices context manager into a context manager called index_mode, which takes an argument indicating one of three modes to set for the index engine. These modes are currently:

'freeze', indicating that table indices should not be updated upon modification, such as updating values or adding rows. After 'freeze' mode is lifted, each index updates itself based on column values. This mode should come in useful if users intend to perform a large number of column updates at a time.
'discard_on_copy', indicating that indices should not be copied upon creation of a new column (for example, due to calls like "table[2:5]" or "Table(table)").
'copy_on_getitem', indicating that indices should be copied when columns are sliced directly. This mode is motivated by the fact that BaseColumn does not override the __getitem__ method of its parent class (numpy.ndarray) for performance reasons, and so the method BaseColumn.get_item(item) must be used to copy indices upon slicing. When in 'copy_on_getitem' mode, BaseColumn.__getitem__ will copy indices at the expense of a reasonably large performance hit. One issue I ran into while implementing this mode is that, for special methods like __getitem__, new-style Python classes call the type's method rather than the instance's method; that is, "col[[1, 3]]" corresponds to something like "type(col).__getitem__(col, [1, 3])" rather than "col.__getitem__([1, 3])". I got around this by adjusting the actual __getitem__ method of BaseColumn in this context (and only for the duration of the context), but this has the side effect that all columns have changed behavior, not just the columns of the table supplied to index_mode. I'll have to ask my mentors whether they see this as much of an issue, because as far as I can tell there's no other solution.

At this point I see the PR as pretty much done, although I'll spend more time writing documentation (and making docstrings conform to the numpy docstring standard).

Monday, July 20, 2015

Week 8

This past week I've been adding some functionality to the index system while the current PR is being reviewed: taking slices of slices, including non-copy and non-modify modes as context managers, etc. One issue my mentors and I discussed in our last meeting is the fact that `Column.__getitem__` becomes incredibly slow if overriden to check for slices and so forth, so we have to do without it (as part of a larger rebase on astropy/master). Our decision was to drop index propagation upon column slicing, and only propagate indices on Table slices; though this behavior is potentially confusing, it will be documented and shouldn't be a big deal. For convenience, a separate method `get_item` in `Column` has the same functionality as the previous `Column.__getitem__` and can be used instead.

I have a lot more to write, but I need to be up early tomorrow morning so I'll finish this post later.

Monday, July 13, 2015

Week 7

The existing PR is passing Travis tests but hasn't been reviewed yet, so in the meantime I've been working on some performance issues. One huge problem I discovered last week was that `group_by` created a number of Table slices, each of which incurred an expensive deep copy of column indices and contributed to a running time of several seconds. To circumvent the problem, I created a context manager `static_indices`, which can be used like so:
```
with static_indices(table):
subtable = table[[1, 3, 4]]
```
In this case `subtable` will have no indices. However, the main issue was with column slicing, which should represent a reference rather than a (deep) copy in keeping with the behavior of `numpy.ndarray` (and thus `Column` internal data). Lucky that this came up, because I didn't have any previous tests checking to make sure that modifying a slice affects both the internal data and indices of the original Table or Column.

Aside from this, I've been working on cutting down the time required to initialize an index; the pure-Python loop I previously used was woefully inefficient memory- and time-wise. I haven't yet figured out how to get this to work with FastRBT, but SortedArray is now much faster to initialize. Here are some time profiling results on a 100,000 line table:
```
FastRBT
**********
time_init: 1.2263660431
time_group: 0.2325990200
time_where: 0.0003449917
time_query: 0.0000329018
time_query_range: 0.0643420219
time_add_row: 2.7397549152
time_modify: 0.0001499653

SortedArray
**********
time_init: 0.0355048180
time_group: 0.0041830540
time_where: 0.0000801086
time_query: 0.0000169277
time_query_range: 0.0217781067
time_add_row: 0.0200960636
time_modify: 0.0016808510

None
**********
time_init: 0.0000019073
time_group: 0.0865180492
time_where: 0.0002820492
time_query: 0.0001931190
time_query_range: 0.2128648758
time_add_row: 0.0006089211
time_modify: 0.0000159740

```

I've focused on SortedArray quite a bit, so FastBRT is still pretty slow in areas that should be easily fixable--I'll tackle those tomorrow. I have an IPython notebook with these timing results here, and a short Markdown file here.

Monday, July 6, 2015

Week 6

I began this week by fixing up a couple loose ends causing test failures, and then opening a new pull request for the work I've done so far. Travis CI and Appveyor run their builds of the PR successfully, but for some reason Coveralls reports that overall coverage has decreased. I'll have to look into that...

From there I've been working on the SortedArray engine for indexing; that is, I've turned it into an actual array (a numpy structured array) rather than the old fill-in pure-Python list. One nice thing about the new class is that it can make good use of the `numpy.searchsorted` function, which searches for the right index to insert a given parameter into a sorted list. Unfortunately `searchsorted` doesn't have a lexicographical feature for lists that are sorted by more than one column (as in the case of composite indices), although I was able to manage a workaround by using `searchsorted` several times.

I also worked on adding to the testing suite, particularly for the new SortedArray engine. I added a test fixture in `test_index.py` to run each test with every possible engine, which gave me some grounds for testing SortedArray and will make it worthwhile to add more indexing tests, so as to test the Index class and each engine simultaneously. In terms of memory, SortedArray seems to do fairly well; I profiled a few engines on a 100,000 line table and found memory usages of 18.6 MB for SortedArray, 20.0 MB for FastRBT, and 48.2 MB for BST (the pure-Python binary search tree). The time profiles for these engines are more complicated, and much more problematic; querying an index takes negligible time compared to creating one, and the time it takes to create an index is totally unreasonable at this stage. (I found 7.53 seconds for BST, 2.04 seconds for FastRBT, and 2.21 seconds for SortedArray.) It's also a bit difficult to find what the main issue is, although iterating through Column takes up a significant time chunk and `numpy.argsort`, oddly enough, takes up a full half-second for SortedArray -- maybe there's something more subtle than I expect going on in there. I'm interested to hear whether Michael and Tom think we should copy Column data at a lower level (i.e. Cython/C), or how otherwise to get around this unexpected time bottleneck. Hopefully the current PR will get some feedback soon, as well.

Monday, June 29, 2015

Week 5

In our last meeting, my mentors and I talked about breaking up this summer's work into multiple major pull requests, as opposed to last year's enormous pull request which was merged toward the very end of the summer. It'll be nice to do this in pieces just to make sure everything gets into the master branch of Astropy as intended, so we're planning on getting a PR in very soon (we discussed a deadline of 1-2 weeks past last Wednesday's meeting). The idea is to have working code that handles changes to Table indices when the Table itself is modified, and after this PR we can focus more on speeding up the index system and adding more functionality.

With that in mind, I mostly spent this week working on previous #TODO's, fixing bugs, and generally getting ready for a PR. Having previously ignored some of the subtleties of Table and Column copying, I found it pretty irritating to ensure that indices are preserved/copied/deep copied as appropriate when doing things like constructing one Table from another, slicing a Table by rows, etc. -- mostly because there are some intricacies involved in subclassing `numpy.ndarray` that I wasn't aware of before running across them. Also, while I managed to get this working correctly, there might end up being relevant time bottlenecks we need to take into consideration.

I also moved the relevant tests for Table indices to a new file `test_index.py` (adding some new tests), and fixed a couple issues including a bug with `Table.remove_rows` when the argument passed is a slice object. For the actual indexing engine, I found a library called bintrees which provides C-based binary tree and red-black tree classes, so for now I'm using this as the default engine (with the optional bintrees dependency, and falling back on my pure-Python classes if the dependency isn't present). I'm looking forward to figuring out the plan for a PR at this Wednesday's meeting, and from there moving on to optimization and increasing functionality.

Monday, June 22, 2015

Week 4

At this point, the infrastructure for indexes is pretty much in place, and the goal now is to work on getting as much speed as possible out of each index. This week, I've begun trying a couple things to improve the performance of the basic binary-tree index implementation, at this stage mostly just algorithmic. As a first step I implemented a red-black tree (in Python for now, though I hope to cut it down to C level); its advantage is that all major operations (add, remove, find) are in O(log n) as guaranteed by the relative balancing of the tree height. The self-balancing part involves keeping to a standard set of rules about the tree, whose nodes have a "color" of either red or black: the root must be black, every red node must have two black children, leaf (null) nodes are black, and any path from a node to its descendants must have the same number of black components. Currently, the performance of the red-black tree in an index overshadows the vanilla binary search tree, but is still not as good as I'd like.

Per my mentors' suggestion, I also implemented a sorted array implementation that can find elements in O(log n), while adding elements has a worst-case running time of O(n). While this makes sorted arrays worse when there are many modifications, the common use-case of an index used on an unmodified Table seems to justify their use. After first initializing the sorted array using a fast (probably in C) numpy sort, modifications to the array occur in-place (presumably the underlying Python list shifts elements on one side) and the find() method uses a binary search. Currently, this implementation outperforms the non-indexed version by about 10% for group_by, and I'm looking to improve on its performance for where(), which is probably the most time-critical function.

Sunday, June 14, 2015

Week 3

I spent most of this week integrating the new indexing system into the existing system of table operations, with the goal of making sure that everything works correctly before honing in on performance improvements. (I'm keeping thoughts about improvements in the back of my mind though, e.g. changing the indexing data structure to something like a red-black tree, rewriting code in Cython/C, etc.) So far the results have come along reasonably well, so that the indexing system is used in `group_by()` and a new function called `where()`, which is under construction. The goal of the `where()` function is to mimic the SQL "where" query, so that one can write something like this for a table full of names:

results = table.where('first_name = {0} and last_name = {0}', 'Michael', 'Mueller')

This will ultimately be sort of mini-language, so I'll have to worry about parsing. For now it only deals with a couple of the easiest cases.

I've also modified the Index structure to allow for composite indices, in which an index keeps track of keys from more than one column. The idea is to store the node keys in the underlying implementation (currently a BST) as a tuple containing values from each column, in order; for example, a key for an index on the columns "first_name" and "last_name" might be ("Michael", "Mueller"). Clearly the order of the columns is important, because ("A", "B") < ("B", "A"), so `where()` queries involving multiple columns will have to use a leftmost prefix of the column set for a composite index in order to be usable. For example, if an index is on ("first_name", "last_name"), then querying for "first_name" should be allowed but querying only for "last_name" should not be allowed, as otherwise the advantage of an index is no longer relevant.

Going into next week, I plan on continuing the current integration of Index into the Table system, more specifically on the `where()` functionality. Luckily the basic infrastructure works now, though figuring out a couple errors with pickling/memory views in the Table and Column classes were a bit annoying this week and the week prior. As of now, indices are stored in their container Columns and, conversely, each index contains a list of references (in order) to its column(s).

Sunday, June 7, 2015

Week 2

This week, I worked on the backbone of the indexing system I'll be working on--I have an Index class that will be compatible with a number of data structure implementations, and for now a binary search tree implementation. See https://github.com/mdmueller/astropy/blob/table-indexing/astropy/table/index.py and the implementation, https://github.com/mdmueller/astropy/blob/table-indexing/astropy/table/bst.py. I also have tests to make sure that the BST structure works as expected. (Nodes of the BST are of the form (column value, row number) for now, although this will probably change to something like ((column 1 value, ..., column n value), [row 1, ..., row m]) later to deal with composite indices and repeated column values across multiple rows.)

Next week I'll be integrating the new Index class into the existing functionality of the Table class, as well as implementing composite indices (indices on multiple columns) as well as a new `where()` function to find rows by column values.

Sunday, May 31, 2015

Week 1

It's the end of the first week of coding--I spent this week writing benchmarks to test current Table functionality (see repo: https://github.com/mdmueller/astropy-benchmarks-1) and creating the groundwork for an Index class that will maintain sorted order on one or more columns. The results of the ASV benchmarks (http://mdmueller.github.io/astropy-benchmarks-1/) were essentially what I expected; for example, the cost of grouping by column scales by a factor of 10 when a table is made 10 times as large, confirming that current operations are O(n). Over time, we should see this scaling improve and become O(log n). A couple other comparisons are viewable at http://i.imgur.com/uxBcFvQ.png and http://i.imgur.com/lQ49dWZ.png.

Apart from this, I wrote a basic skeleton of the Index class and adjusted parts of the Table class to update Table indices upon modification of the Table. My work will be done in the branch "table-indexing" on my Astropy fork (https://github.com/mdmueller/astropy/tree/table-indexing). More to come in a bit.

Monday, May 25, 2015

GSOC 2015

This is my first blog entry for the 2015 Google Summer of Code--I'm excited to take part in the program for a second year, and to continue working with Astropy in particular! I just realized that Python students were expected to make a blog post about Community Bonding Period by May 24th, so this is just a bit late.

My project this year will involve the implementation of database indexing for the Table class in Astropy, which is a central Astropy data structure. I had a Google Hangout meeting with my mentors (Michael Droettboom and Tom Aldcroft) last week in which we discussed possibilities for functionality to implement over the summer. Among these are

Allowing for multiple indexed columns within a Table, where each "index" is a 1-to-n mapping of keys to values
Using indices to speed up existing Table operations, such as "join", "group_by", and "unique"
Implement other Table methods like "where", "indices", "sort"
A special index called the "primary key" which might yield algorithmic improvements
Making sure values in an index can be selected by range

We also noted some open questions to tackle as the project gets further along. My work on the project begins today; so far (during the community bonding period) I've been reading the Astropy docs on current Table operations and looking at Pandas to see how the indexed Series class works under the hood. I discovered that Pandas uses a Python wrapper around a Cython/C hash table engine to implement its Series index; in fact, the same engine is used in the Python ASCII parser I investigated last year (e.g. for a hashmap of NaN values). I haven't figured out yet which data structure is most appropriate for our purposes--candidates include a hash map, B-tree, or a bitmap for certain cases--but that's part of what I'll be looking into this week. I'll also write asv benchmarks for current Table operations and work on passing Table information into a (not yet written) indexing data structure.