Sunday, May 31, 2015

Week 1

It's the end of the first week of coding--I spent this week writing benchmarks to test current Table functionality (see repo: https://github.com/mdmueller/astropy-benchmarks-1) and creating the groundwork for an Index class that will maintain sorted order on one or more columns. The results of the ASV benchmarks (http://mdmueller.github.io/astropy-benchmarks-1/) were essentially what I expected; for example, the cost of grouping by column scales by a factor of 10 when a table is made 10 times as large, confirming that current operations are O(n). Over time, we should see this scaling improve and become O(log n). A couple other comparisons are viewable at http://i.imgur.com/uxBcFvQ.png and http://i.imgur.com/lQ49dWZ.png.

Apart from this, I wrote a basic skeleton of the Index class and adjusted parts of the Table class to update Table indices upon modification of the Table. My work will be done in the branch "table-indexing" on my Astropy fork (https://github.com/mdmueller/astropy/tree/table-indexing). More to come in a bit.

Monday, May 25, 2015

GSOC 2015

This is my first blog entry for the 2015 Google Summer of Code--I'm excited to take part in the program for a second year, and to continue working with Astropy in particular! I just realized that Python students were expected to make a blog post about Community Bonding Period by May 24th, so this is just a bit late.

My project this year will involve the implementation of database indexing for the Table class in Astropy, which is a central Astropy data structure. I had a Google Hangout meeting with my mentors (Michael Droettboom and Tom Aldcroft) last week in which we discussed possibilities for functionality to implement over the summer. Among these are

  • Allowing for multiple indexed columns within a Table, where each "index" is a 1-to-n mapping of keys to values
  • Using indices to speed up existing Table operations, such as "join", "group_by", and "unique"
  • Implement other Table methods like "where", "indices", "sort"
  • A special index called the "primary key" which might yield algorithmic improvements
  • Making sure values in an index can be selected by range
We also noted some open questions to tackle as the project gets further along. My work on the project begins today; so far (during the community bonding period) I've been reading the Astropy docs on current Table operations and looking at Pandas to see how the indexed Series class works under the hood. I discovered that Pandas uses a Python wrapper around a Cython/C hash table engine to implement its Series index; in fact, the same engine is used in the Python ASCII parser I investigated last year (e.g. for a hashmap of NaN values). I haven't figured out yet which data structure is most appropriate for our purposes--candidates include a hash map, B-tree, or a bitmap for certain cases--but that's part of what I'll be looking into this week. I'll also write asv benchmarks for current Table operations and work on passing Table information into a (not yet written) indexing data structure.