I spent most of this week integrating the new indexing system into the existing system of table operations, with the goal of making sure that everything works correctly before honing in on performance improvements. (I'm keeping thoughts about improvements in the back of my mind though, e.g. changing the indexing data structure to something like a red-black tree, rewriting code in Cython/C, etc.) So far the results have come along reasonably well, so that the indexing system is used in `group_by()` and a new function called `where()`, which is under construction. The goal of the `where()` function is to mimic the SQL "where" query, so that one can write something like this for a table full of names:
results = table.where('first_name = {0} and last_name = {0}', 'Michael', 'Mueller')
This will ultimately be sort of mini-language, so I'll have to worry about parsing. For now it only deals with a couple of the easiest cases.
I've also modified the Index structure to allow for composite indices, in which an index keeps track of keys from more than one column. The idea is to store the node keys in the underlying implementation (currently a BST) as a tuple containing values from each column, in order; for example, a key for an index on the columns "first_name" and "last_name" might be ("Michael", "Mueller"). Clearly the order of the columns is important, because ("A", "B") < ("B", "A"), so `where()` queries involving multiple columns will have to use a leftmost prefix of the column set for a composite index in order to be usable. For example, if an index is on ("first_name", "last_name"), then querying for "first_name" should be allowed but querying only for "last_name" should not be allowed, as otherwise the advantage of an index is no longer relevant.
Going into next week, I plan on continuing the current integration of Index into the Table system, more specifically on the `where()` functionality. Luckily the basic infrastructure works now, though figuring out a couple errors with pickling/memory views in the Table and Column classes were a bit annoying this week and the week prior. As of now, indices are stored in their container Columns and, conversely, each index contains a list of references (in order) to its column(s).
No comments:
Post a Comment