At this point, the infrastructure for indexes is pretty much in place, and the goal now is to work on getting as much speed as possible out of each index. This week, I've begun trying a couple things to improve the performance of the basic binary-tree index implementation, at this stage mostly just algorithmic. As a first step I implemented a red-black tree (in Python for now, though I hope to cut it down to C level); its advantage is that all major operations (add, remove, find) are in O(log n) as guaranteed by the relative balancing of the tree height. The self-balancing part involves keeping to a standard set of rules about the tree, whose nodes have a "color" of either red or black: the root must be black, every red node must have two black children, leaf (null) nodes are black, and any path from a node to its descendants must have the same number of black components. Currently, the performance of the red-black tree in an index overshadows the vanilla binary search tree, but is still not as good as I'd like.
Per my mentors' suggestion, I also implemented a sorted array implementation that can find elements in O(log n), while adding elements has a worst-case running time of O(n). While this makes sorted arrays worse when there are many modifications, the common use-case of an index used on an unmodified Table seems to justify their use. After first initializing the sorted array using a fast (probably in C) numpy sort, modifications to the array occur in-place (presumably the underlying Python list shifts elements on one side) and the find() method uses a binary search. Currently, this implementation outperforms the non-indexed version by about 10% for group_by, and I'm looking to improve on its performance for where(), which is probably the most time-critical function.
No comments:
Post a Comment