Tuesday, July 28, 2015

Week 9

This week I spent quite a bit of time on mixin column support for indices, where appropriate. After first moving the indices themselves from a Column attribute to a DataInfo attribute (accessed as col.info.indices), I moved most of the indexing code for dealing with column access/modifications to BaseColumnInfo for mixin use in methods like __getitem__ and __setitem__. Since each mixin class has to include proper calls to indexing code, mixins should set a boolean value _supports_indices to True in their info classes (e.g. QuantityInfo). As of now, Quantity and Time support indices, while SkyCoord does not since there is no natural order on coordinate values. I've updated the indexing testing suite to deal with the new mixins.

Aside from mixins (and general PR improvements like bug fixes), I implemented my mentors' suggestion to turn the previous static_indices context manager into a context manager called index_mode, which takes an argument indicating one of three modes to set for the index engine. These modes are currently:

  • 'freeze', indicating that table indices should not be updated upon modification, such as updating values or adding rows. After 'freeze' mode is lifted, each index updates itself based on column values. This mode should come in useful if users intend to perform a large number of column updates at a time.
  • 'discard_on_copy', indicating that indices should not be copied upon creation of a new column (for example, due to calls like "table[2:5]" or "Table(table)").
  • 'copy_on_getitem', indicating that indices should be copied when columns are sliced directly. This mode is motivated by the fact that BaseColumn does not override the __getitem__ method of its parent class (numpy.ndarray) for performance reasons, and so the method BaseColumn.get_item(item) must be used to copy indices upon slicing. When in 'copy_on_getitem' mode, BaseColumn.__getitem__ will copy indices at the expense of a reasonably large performance hit. One issue I ran into while implementing this mode is that, for special methods like __getitem__, new-style Python classes call the type's method rather than the instance's method; that is, "col[[1, 3]]" corresponds to something like "type(col).__getitem__(col, [1, 3])" rather than "col.__getitem__([1, 3])". I got around this by adjusting the actual __getitem__ method of BaseColumn in this context (and only for the duration of the context), but this has the side effect that all columns have changed behavior, not just the columns of the table supplied to index_mode. I'll have to ask my mentors whether they see this as much of an issue, because as far as I can tell there's no other solution.
At this point I see the PR as pretty much done, although I'll spend more time writing documentation (and making docstrings conform to the numpy docstring standard).

Monday, July 20, 2015

Week 8

This past week I've been adding some functionality to the index system while the current PR is being reviewed: taking slices of slices, including non-copy and non-modify modes as context managers, etc. One issue my mentors and I discussed in our last meeting is the fact that `Column.__getitem__` becomes incredibly slow if overriden to check for slices and so forth, so we have to do without it (as part of a larger rebase on astropy/master). Our decision was to drop index propagation upon column slicing, and only propagate indices on Table slices; though this behavior is potentially confusing, it will be documented and shouldn't be a big deal. For convenience, a separate method `get_item` in `Column` has the same functionality as the previous `Column.__getitem__` and can be used instead.

I have a lot more to write, but I need to be up early tomorrow morning so I'll finish this post later.

Monday, July 13, 2015

Week 7

The existing PR is passing Travis tests but hasn't been reviewed yet, so in the meantime I've been working on some performance issues. One huge problem I discovered last week was that `group_by` created a number of Table slices, each of which incurred an expensive deep copy of column indices and contributed to a running time of several seconds. To circumvent the problem, I created a context manager `static_indices`, which can be used like so:
```
with static_indices(table):
    subtable = table[[1, 3, 4]]
```
In this case `subtable` will have no indices. However, the main issue was with column slicing, which should represent a reference rather than a (deep) copy in keeping with the behavior of `numpy.ndarray` (and thus `Column` internal data). Lucky that this came up, because I didn't have any previous tests checking to make sure that modifying a slice affects both the internal data and indices of the original Table or Column.

Aside from this, I've been working on cutting down the time required to initialize an index; the pure-Python loop I previously used was woefully inefficient memory- and time-wise. I haven't yet figured out how to get this to work with FastRBT, but SortedArray is now much faster to initialize. Here are some time profiling results on a 100,000 line table:
```
FastRBT
**********
time_init: 1.2263660431
time_group: 0.2325990200
time_where: 0.0003449917
time_query: 0.0000329018
time_query_range: 0.0643420219
time_add_row: 2.7397549152
time_modify: 0.0001499653

SortedArray
**********
time_init: 0.0355048180
time_group: 0.0041830540
time_where: 0.0000801086
time_query: 0.0000169277
time_query_range: 0.0217781067
time_add_row: 0.0200960636
time_modify: 0.0016808510

None
**********
time_init: 0.0000019073
time_group: 0.0865180492
time_where: 0.0002820492
time_query: 0.0001931190
time_query_range: 0.2128648758
time_add_row: 0.0006089211
time_modify: 0.0000159740
```
I've focused on SortedArray quite a bit, so FastBRT is still pretty slow in areas that should be easily fixable--I'll tackle those tomorrow. I have an IPython notebook with these timing results here, and a short Markdown file here.

Monday, July 6, 2015

Week 6

I began this week by fixing up a couple loose ends causing test failures, and then opening a new pull request for the work I've done so far. Travis CI and Appveyor run their builds of the PR successfully, but for some reason Coveralls reports that overall coverage has decreased. I'll have to look into that...

From there I've been working on the SortedArray engine for indexing; that is, I've turned it into an actual array (a numpy structured array) rather than the old fill-in pure-Python list. One nice thing about the new class is that it can make good use of the `numpy.searchsorted` function, which searches for the right index to insert a given parameter into a sorted list. Unfortunately `searchsorted` doesn't have a lexicographical feature for lists that are sorted by more than one column (as in the case of composite indices), although I was able to manage a workaround by using `searchsorted` several times.

I also worked on adding to the testing suite, particularly for the new SortedArray engine. I added a test fixture in `test_index.py` to run each test with every possible engine, which gave me some grounds for testing SortedArray and will make it worthwhile to add more indexing tests, so as to test the Index class and each engine simultaneously. In terms of memory, SortedArray seems to do fairly well; I profiled a few engines on a 100,000 line table and found memory usages of 18.6 MB for SortedArray, 20.0 MB for FastRBT, and 48.2 MB for BST (the pure-Python binary search tree). The time profiles for these engines are more complicated, and much more problematic; querying an index takes negligible time compared to creating one, and the time it takes to create an index is totally unreasonable at this stage. (I found 7.53 seconds for BST, 2.04 seconds for FastRBT, and 2.21 seconds for SortedArray.) It's also a bit difficult to find what the main issue is, although iterating through Column takes up a significant time chunk and `numpy.argsort`, oddly enough, takes up a full half-second for SortedArray -- maybe there's something more subtle than I expect going on in there. I'm interested to hear whether Michael and Tom think we should copy Column data at a lower level (i.e. Cython/C), or how otherwise to get around this unexpected time bottleneck. Hopefully the current PR will get some feedback soon, as well.