Tuesday, April 22, 2014

First GSOC Post

This is my first blog entry--I'll be be making weekly posts over the summer to explain what I'm doing as part of the program, what I've accomplished so far, whatever problems I might run into, etc.

In case you're unaware, Google Summer of Code is an international program in which Google awards monetary stipends to student developers who work on tasks for open-source organizations over the course of the summer. I applied as a student developer for two organizations under the umbrella of the Python Software Foundation: AstroPy and SunPy. These projects are both Python software libraries intended for scientific use; their main difference is that AstroPy is intended for astronomy and SunPy is specifically intended for solar physics, but there is a large degree of collaboration between the two. My proposal for AstroPy was accepted under the program, but I'd still love to continue contributing to SunPy whenever I have the chance!

My project over the summer will deal with the astropy.io.ascii package of AstroPy, which contains support for reading and writing astronomical data to/from a number of text-based file formats. This package was originally designed with simplicity and flexibility in mind, which facilitates easier development, but this flexibility often comes at the expense of parsing performance. Some of these formats are relatively simple (like CSV, for example) and current parsing takes way too long for large input files, so it would be helpful to have an optimized parser and writer for these formats.

Another possible strategy for increasing the performance of astropy.io.ascii is to use memory mapping, which would allow for both faster parsing and more efficient memory usage. The package astropy.io.fits, which implements reading and writing for FITS (Flexible Image Transport System) files, currently uses this technique. One of my project mentors, Tom Aldcroft, mentioned that using memory mapping might turn out to be implausible for variable-length ASCII files, so I guess we'll have to figure that out over the summer after examining AstroPy's FITS memory mapping in further detail.

So in conclusion, my general plan is as follows:
  1. Write benchmarks to test out the current performance of astropy.io.ascii and relevant parts of AstroPy's Table class.
  2. Examine the ASCII parsing algorithms in the Pandas data analysis library to determine whether these can be adapted for use in AstroPy; if not, I plan to implement my own efficient parsing strategy based on Pandas' approach.
  3. Look for further performance bottlenecks and fix any I can find.
  4. See if I'll be able to implement memory mapping for ASCII files.
  5. If I have enough time, work on general performance enhancement for AstroPy across the board.
I'll be using Cython, a performance-improving superset of Python which allows for C-esque features like static typing, as well as the benchmarking and profiling tools asv and line_profiler throughout the project. The coding period begins on May 19th, so until then I'll be playing around with Cython and Pandas. I'm looking forward to the summer!