I began this week by fixing up a couple loose ends causing test failures, and then opening a new pull request for the work I've done so far. Travis CI and Appveyor run their builds of the PR successfully, but for some reason Coveralls reports that overall coverage has decreased. I'll have to look into that...
From there I've been working on the SortedArray engine for indexing; that is, I've turned it into an actual array (a numpy structured array) rather than the old fill-in pure-Python list. One nice thing about the new class is that it can make good use of the `numpy.searchsorted` function, which searches for the right index to insert a given parameter into a sorted list. Unfortunately `searchsorted` doesn't have a lexicographical feature for lists that are sorted by more than one column (as in the case of composite indices), although I was able to manage a workaround by using `searchsorted` several times.
I also worked on adding to the testing suite, particularly for the new SortedArray engine. I added a test fixture in `test_index.py` to run each test with every possible engine, which gave me some grounds for testing SortedArray and will make it worthwhile to add more indexing tests, so as to test the Index class and each engine simultaneously. In terms of memory, SortedArray seems to do fairly well; I profiled a few engines on a 100,000 line table and found memory usages of 18.6 MB for SortedArray, 20.0 MB for FastRBT, and 48.2 MB for BST (the pure-Python binary search tree). The time profiles for these engines are more complicated, and much more problematic; querying an index takes negligible time compared to creating one, and the time it takes to create an index is totally unreasonable at this stage. (I found 7.53 seconds for BST, 2.04 seconds for FastRBT, and 2.21 seconds for SortedArray.) It's also a bit difficult to find what the main issue is, although iterating through Column takes up a significant time chunk and `numpy.argsort`, oddly enough, takes up a full half-second for SortedArray -- maybe there's something more subtle than I expect going on in there. I'm interested to hear whether Michael and Tom think we should copy Column data at a lower level (i.e. Cython/C), or how otherwise to get around this unexpected time bottleneck. Hopefully the current PR will get some feedback soon, as well.
No comments:
Post a Comment