Sunday, June 8, 2014

Week 3

It's the end of the third week of coding, and I've finished the benchmarking stage of the project. After a hangout meeting with two of my mentors, Tom Aldcroft and Michael Droettboom in which we discussed my work so far and how to proceed, I began working on improving the existing benchmarks, creating new ones (some dealing with relevant parts of astropy.table and some comparing the performance of AstroPy with numpy and pandas), doing some high-level profiling, and documenting the results. Although I already posted the link in my previous post, here is the GitHub repo containing most of my work, including randomly generated sample text files for benchmarking.

Some of my work involved fixing or extending the current tests; for example, I used cStringIO as input and output for benchmarks instead of reading and writing from file each time. I also added benchmarks for some other functions that I found to be significant while profiling in addition to running line_profiler on these as well; the current asv graph is here and the line profiling results are here. Although I had expected that the Table class itself would be pretty significant for benchmarking, I actually discovered via profiling that most of the work is done in Column and pprint.py, so the benchmarks I wrote pertain to those.

More importantly, I used cProfiler to get a high-level view of the amount of time the current implementation spends in various functions, which provided me with what should be a good framework for investigating how to change io.ascii in the coming weeks. Per Tom's and Michael's suggestion, I used snakeviz, a nifty little profile visualization tool, to look at cProfiler's output. Finally, I wrote up the results in this markup file, which was particularly necessary because asv doesn't (currently) have a mechanism for comparing benchmarks. I wasn't aware of any way to publish snakeviz's output (e.g. on GitHub pages), so I just took a couple screenshots and linked in my markup file. The writeup is more specific, but my main findings were that pandas >> numpy > AstroPy for reading/writing speed, different formats were fairly similar in terms of speed, and both reading and writing times varied considerably for different data types (probably because conversion is currently inefficient).

Next week I'll be looking closely at how Pandas handles text-based reading and writing in order to see if their approach can be adapted to AstroPy, after which I'll plan out how best to change the current implementation in io.ascii. If Pandas turns out to be compatible with the Table class, then I expect the gains in terms of time efficiency should be pretty huge, since Pandas is currently about five times as fast as AstroPy.

No comments:

Post a Comment