Monday, August 18, 2014

Week 13

This was the final week of Google Summer of Code, and since last Monday was the suggested "pencils down" date, I spent the week focusing on getting the main pull request ready for merging. I began by testing the new fast converter for unusual input, then handled issues Erik noted with the PR, filed an issue with Pandas, and began work on a new branch which implements a different memory scheme in the tokenizer. The PR seems to be in a final review stage, so hopefully it'll be merged by next week.
After testing out xstrtod(), I noticed a couple problems with extreme input values and fixed them; the most notable problem was an inability to handle subnormals (values with exponent less that -308). As of now, the converter seems to work pretty well for a wide range of input, and the absolute worst-case error seems to be around 3.0 ULP. Interestingly, when I reported the problems with the old xstrtod() as a bug in Pandas, the response I received was that the current code should remain, but a new parameter float_precision might be added to allow for more accurate conversion. Both Tom and I found this response a little bizarre, since the issues with xstrtod() seem quite buggy, but in any case I have an open PR to implement this in Pandas.
Aside from this, Erik pointed out some suggestions and concerns about the PR, which I dealt with in new commits. For example, he suggested that I use the mmap module in Python rather than dealing with platform-dependent memory mapping in C, which seems to make more sense for the sake of portability. He also pointed out that the method FileString.splitlines(), which returns a generator yielding lines from the memory-mapped file, was inefficient due to repeated calls to chr(). I ultimately rewrote it in C, and although its performance is really only important for commented-header files with header line deep into the file, I managed to get more than a 2x speedup on a 10,000-line integer file with a commented header line in the last row with the new approach.
Although it won't be a part of the main PR, I've also been working on a separate branch change-memory-layout which changes the storage of output in memory in the tokenizer. The main purpose of this branch is to reduce the memory footprint of parsing, as the peak memory usage is almost twice that of Pandas; the basic idea is that instead of storing output in char **output_cols, it's stored instead in a single string char *output and an array of pointers, char **line_ptrs, records the beginning of each line for conversion purposes. While I'm still working on memory improvements, I actually managed to get a bit of a speed boost with this approach. Pure floating-point data is now slightly quicker to read with io.ascii than with Pandas, even without multiprocessing enabled!
Since today is the absolute pencils down date, this marks the official end of the coding period and the end of my blog posts. I plan to continue responding to the review of the main PR and finish up the work in my new branch, but the real work of the summer is basically over. It's been a great experience, and I'm glad I was able to learn a lot and get involved in Astropy development!

No comments:

Post a Comment