My project this year will involve the implementation of database indexing for the Table class in Astropy, which is a central Astropy data structure. I had a Google Hangout meeting with my mentors (Michael Droettboom and Tom Aldcroft) last week in which we discussed possibilities for functionality to implement over the summer. Among these are
- Allowing for multiple indexed columns within a Table, where each "index" is a 1-to-n mapping of keys to values
- Using indices to speed up existing Table operations, such as "join", "group_by", and "unique"
- Implement other Table methods like "where", "indices", "sort"
- A special index called the "primary key" which might yield algorithmic improvements
- Making sure values in an index can be selected by range
We also noted some open questions to tackle as the project gets further along. My work on the project begins today; so far (during the community bonding period) I've been reading the Astropy docs on current Table operations and looking at Pandas to see how the indexed Series class works under the hood. I discovered that Pandas uses a Python wrapper around a Cython/C hash table engine to implement its Series index; in fact, the same engine is used in the Python ASCII parser I investigated last year (e.g. for a hashmap of NaN values). I haven't figured out yet which data structure is most appropriate for our purposes--candidates include a hash map, B-tree, or a bitmap for certain cases--but that's part of what I'll be looking into this week. I'll also write asv benchmarks for current Table operations and work on passing Table information into a (not yet written) indexing data structure.
No comments:
Post a Comment