Google Summer of Code

Tuesday, August 18, 2015

Week 12

Since GSOC is finally wrapping up, I pretty much spent this week reading over code in the PR and writing documentation. I introduced a new section on table indexing in the docs after the "Table operations" section, which should give a good introduction to indexing functionality. It also links to an IPython notebook I wrote (http://nbviewer.ipython.org/github/mdmueller/astropy-notebooks/blob/master/table/indexing-profiling.ipynb) that displays some profiling results of indexing by comparing different scenarios, e.g. testing different engines and using regular columns vs. mixins. I also ran the asv benchmarking tool on features relevant to indexing, and fixed an issue with Table sorting in which performance was slowed down while sorting a primary index.

There's not much else to describe in terms of final changes, although I do worry about areas where index copying or relabeling come up unexpectedly and have a negative effect on performance. As an example, using the `loc` attribute on Table is very slow for an indexing engine like FastRBT (which is slow to copy), since the returned rows of the Table are retrieved via a slice that relabels indices. This is necessary if the user wants indices in the returned slice, but I doubt that's usually a real issue. I guess the two alternatives here are either to have `loc` return something else (like a non-indexed slice) or to simply advise in the documentation that using the index mode 'discard_on_copy' is appropriate in such a scenario.

Monday, August 10, 2015

Week 11

This week I implemented nice bit of functionality that Tom suggested, inspired by a similar feature in Pandas: retrieving index information via Table attributes `loc` and `iloc`. The idea is to provide a mechanism for row retrieval in between a high-level `query()` method and dealing with Index objects directly. Here's an example:
```
In [2]: t = simple_table(10)

In [3]: print t
a b c
--- ---- ---
1 1.0 c
2 2.0 d
3 3.0 e
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
8 8.0 j
9 9.0 k
10 10.0 l

In [4]: t.add_index('a')

In [5]: t.add_index('b')

In [6]: t.loc[4:9] # 'a' is the implicit primary key
Out[6]:
<Table length=6>
a b c
int32 float32 str1
----- ------- ----
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
8 8.0 j
9 9.0 k

In [7]: t.loc['b', 1.5:7.0]
Out[7]:
<Table length=6>
a b c
int32 float32 str1
----- ------- ----
2 2.0 d
3 3.0 e
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i

In [8]: t.iloc[2:4]
Out[8]:
<Table length=2>
a b c
int32 float32 str1
----- ------- ----
3 3.0 e
4 4.0 f
```
The `loc` attribute is used for retrieval by column value, while `iloc` is used for retrieval by position in the sorted order of an index. This involves the designation of a primary key, which for now is just the first index added to the table. Also, indices can now be retrieved by column name(s):
```
In [9]: t.indices['b']
Out[9]:
b rows
---- ----
1.0 0
2.0 1
3.0 2
4.0 3
5.0 4
6.0 5
7.0 6
8.0 7
9.0 8
10.0 9
```
Aside from this, I've been adding in miscellaneous changes to the PR, such as getting `np.lexsort` to work with Time objects, reworking the `SortedArray` class to use a `Table` object instead of a list of ndarrays (for working with mixins), putting `index_mode` in `Table`, etc. Tom noted some performance issues when working with indices, which I've been working on as well.

Tuesday, August 4, 2015

Week 10

This week wasn't terribly eventful; I spent time documenting code, expanding tests, etc. for the pull request. Docstrings are now in numpydoc format, and I fixed a few bugs including one that Tom noticed when taking a slice of a slice:
```
from astropy import table
from astropy.table import table_helpers

t = table_helpers.simple_table(10)
t.add_index('a')
t2 = t[1:]
t3 = t2[1:]

print(t3.indices[0])
```
The former output was "Index slice (2, 10, 2) of [[ 1 2 3 4 5 6 7 8 9 10], [0 1 2 3 4 5 6 7 8 9]]" while now the step size is 1, as it should be. The SlicedIndex system seems to be working fine otherwise, except for a python3 bug I found involving the new behavior of the / operator (i.e. it returns a float), though this is fixed now.

Another new change is to the `index_mode` context manager--the "copy_on_getitem" mode now properly affects only the supplied table rather than tampering with BaseColumn directly. Michael's workaround is to change the __class__ attribute of each relevant column to a subclass (either _GetitemColumn or _GetitemMaskedColumn) with the correct __getitem__ method, and this should rule out possible unlikely side effects. Aside from this, I've also been looking into improving the performance of the engines other than SortedArray. The main issue I see is that there's a lot of Python object creation in the engine initialization, which unfortunately seems to be unavoidable given the constraints of the bintrees library. The success of SortedArray really lies in the fact that it deals with numpy arrays, so I'm looking into creating an ndarray-based binary search tree.

Tuesday, July 28, 2015

Week 9

This week I spent quite a bit of time on mixin column support for indices, where appropriate. After first moving the indices themselves from a Column attribute to a DataInfo attribute (accessed as col.info.indices), I moved most of the indexing code for dealing with column access/modifications to BaseColumnInfo for mixin use in methods like __getitem__ and __setitem__. Since each mixin class has to include proper calls to indexing code, mixins should set a boolean value _supports_indices to True in their info classes (e.g. QuantityInfo). As of now, Quantity and Time support indices, while SkyCoord does not since there is no natural order on coordinate values. I've updated the indexing testing suite to deal with the new mixins.

Aside from mixins (and general PR improvements like bug fixes), I implemented my mentors' suggestion to turn the previous static_indices context manager into a context manager called index_mode, which takes an argument indicating one of three modes to set for the index engine. These modes are currently:

'freeze', indicating that table indices should not be updated upon modification, such as updating values or adding rows. After 'freeze' mode is lifted, each index updates itself based on column values. This mode should come in useful if users intend to perform a large number of column updates at a time.
'discard_on_copy', indicating that indices should not be copied upon creation of a new column (for example, due to calls like "table[2:5]" or "Table(table)").
'copy_on_getitem', indicating that indices should be copied when columns are sliced directly. This mode is motivated by the fact that BaseColumn does not override the __getitem__ method of its parent class (numpy.ndarray) for performance reasons, and so the method BaseColumn.get_item(item) must be used to copy indices upon slicing. When in 'copy_on_getitem' mode, BaseColumn.__getitem__ will copy indices at the expense of a reasonably large performance hit. One issue I ran into while implementing this mode is that, for special methods like __getitem__, new-style Python classes call the type's method rather than the instance's method; that is, "col[[1, 3]]" corresponds to something like "type(col).__getitem__(col, [1, 3])" rather than "col.__getitem__([1, 3])". I got around this by adjusting the actual __getitem__ method of BaseColumn in this context (and only for the duration of the context), but this has the side effect that all columns have changed behavior, not just the columns of the table supplied to index_mode. I'll have to ask my mentors whether they see this as much of an issue, because as far as I can tell there's no other solution.

At this point I see the PR as pretty much done, although I'll spend more time writing documentation (and making docstrings conform to the numpy docstring standard).

Monday, July 20, 2015

Week 8

This past week I've been adding some functionality to the index system while the current PR is being reviewed: taking slices of slices, including non-copy and non-modify modes as context managers, etc. One issue my mentors and I discussed in our last meeting is the fact that `Column.__getitem__` becomes incredibly slow if overriden to check for slices and so forth, so we have to do without it (as part of a larger rebase on astropy/master). Our decision was to drop index propagation upon column slicing, and only propagate indices on Table slices; though this behavior is potentially confusing, it will be documented and shouldn't be a big deal. For convenience, a separate method `get_item` in `Column` has the same functionality as the previous `Column.__getitem__` and can be used instead.

I have a lot more to write, but I need to be up early tomorrow morning so I'll finish this post later.

Monday, July 13, 2015

Week 7

The existing PR is passing Travis tests but hasn't been reviewed yet, so in the meantime I've been working on some performance issues. One huge problem I discovered last week was that `group_by` created a number of Table slices, each of which incurred an expensive deep copy of column indices and contributed to a running time of several seconds. To circumvent the problem, I created a context manager `static_indices`, which can be used like so:
```
with static_indices(table):
subtable = table[[1, 3, 4]]
```
In this case `subtable` will have no indices. However, the main issue was with column slicing, which should represent a reference rather than a (deep) copy in keeping with the behavior of `numpy.ndarray` (and thus `Column` internal data). Lucky that this came up, because I didn't have any previous tests checking to make sure that modifying a slice affects both the internal data and indices of the original Table or Column.

Aside from this, I've been working on cutting down the time required to initialize an index; the pure-Python loop I previously used was woefully inefficient memory- and time-wise. I haven't yet figured out how to get this to work with FastRBT, but SortedArray is now much faster to initialize. Here are some time profiling results on a 100,000 line table:
```
FastRBT
**********
time_init: 1.2263660431
time_group: 0.2325990200
time_where: 0.0003449917
time_query: 0.0000329018
time_query_range: 0.0643420219
time_add_row: 2.7397549152
time_modify: 0.0001499653

SortedArray
**********
time_init: 0.0355048180
time_group: 0.0041830540
time_where: 0.0000801086
time_query: 0.0000169277
time_query_range: 0.0217781067
time_add_row: 0.0200960636
time_modify: 0.0016808510

None
**********
time_init: 0.0000019073
time_group: 0.0865180492
time_where: 0.0002820492
time_query: 0.0001931190
time_query_range: 0.2128648758
time_add_row: 0.0006089211
time_modify: 0.0000159740

```

I've focused on SortedArray quite a bit, so FastBRT is still pretty slow in areas that should be easily fixable--I'll tackle those tomorrow. I have an IPython notebook with these timing results here, and a short Markdown file here.

Monday, July 6, 2015

Week 6

I began this week by fixing up a couple loose ends causing test failures, and then opening a new pull request for the work I've done so far. Travis CI and Appveyor run their builds of the PR successfully, but for some reason Coveralls reports that overall coverage has decreased. I'll have to look into that...

From there I've been working on the SortedArray engine for indexing; that is, I've turned it into an actual array (a numpy structured array) rather than the old fill-in pure-Python list. One nice thing about the new class is that it can make good use of the `numpy.searchsorted` function, which searches for the right index to insert a given parameter into a sorted list. Unfortunately `searchsorted` doesn't have a lexicographical feature for lists that are sorted by more than one column (as in the case of composite indices), although I was able to manage a workaround by using `searchsorted` several times.

I also worked on adding to the testing suite, particularly for the new SortedArray engine. I added a test fixture in `test_index.py` to run each test with every possible engine, which gave me some grounds for testing SortedArray and will make it worthwhile to add more indexing tests, so as to test the Index class and each engine simultaneously. In terms of memory, SortedArray seems to do fairly well; I profiled a few engines on a 100,000 line table and found memory usages of 18.6 MB for SortedArray, 20.0 MB for FastRBT, and 48.2 MB for BST (the pure-Python binary search tree). The time profiles for these engines are more complicated, and much more problematic; querying an index takes negligible time compared to creating one, and the time it takes to create an index is totally unreasonable at this stage. (I found 7.53 seconds for BST, 2.04 seconds for FastRBT, and 2.21 seconds for SortedArray.) It's also a bit difficult to find what the main issue is, although iterating through Column takes up a significant time chunk and `numpy.argsort`, oddly enough, takes up a full half-second for SortedArray -- maybe there's something more subtle than I expect going on in there. I'm interested to hear whether Michael and Tom think we should copy Column data at a lower level (i.e. Cython/C), or how otherwise to get around this unexpected time bottleneck. Hopefully the current PR will get some feedback soon, as well.