Monday, August 18, 2014

Week 13

This was the final week of Google Summer of Code, and since last Monday was the suggested "pencils down" date, I spent the week focusing on getting the main pull request ready for merging. I began by testing the new fast converter for unusual input, then handled issues Erik noted with the PR, filed an issue with Pandas, and began work on a new branch which implements a different memory scheme in the tokenizer. The PR seems to be in a final review stage, so hopefully it'll be merged by next week.
After testing out xstrtod(), I noticed a couple problems with extreme input values and fixed them; the most notable problem was an inability to handle subnormals (values with exponent less that -308). As of now, the converter seems to work pretty well for a wide range of input, and the absolute worst-case error seems to be around 3.0 ULP. Interestingly, when I reported the problems with the old xstrtod() as a bug in Pandas, the response I received was that the current code should remain, but a new parameter float_precision might be added to allow for more accurate conversion. Both Tom and I found this response a little bizarre, since the issues with xstrtod() seem quite buggy, but in any case I have an open PR to implement this in Pandas.
Aside from this, Erik pointed out some suggestions and concerns about the PR, which I dealt with in new commits. For example, he suggested that I use the mmap module in Python rather than dealing with platform-dependent memory mapping in C, which seems to make more sense for the sake of portability. He also pointed out that the method FileString.splitlines(), which returns a generator yielding lines from the memory-mapped file, was inefficient due to repeated calls to chr(). I ultimately rewrote it in C, and although its performance is really only important for commented-header files with header line deep into the file, I managed to get more than a 2x speedup on a 10,000-line integer file with a commented header line in the last row with the new approach.
Although it won't be a part of the main PR, I've also been working on a separate branch change-memory-layout which changes the storage of output in memory in the tokenizer. The main purpose of this branch is to reduce the memory footprint of parsing, as the peak memory usage is almost twice that of Pandas; the basic idea is that instead of storing output in char **output_cols, it's stored instead in a single string char *output and an array of pointers, char **line_ptrs, records the beginning of each line for conversion purposes. While I'm still working on memory improvements, I actually managed to get a bit of a speed boost with this approach. Pure floating-point data is now slightly quicker to read with io.ascii than with Pandas, even without multiprocessing enabled!
Since today is the absolute pencils down date, this marks the official end of the coding period and the end of my blog posts. I plan to continue responding to the review of the main PR and finish up the work in my new branch, but the real work of the summer is basically over. It's been a great experience, and I'm glad I was able to learn a lot and get involved in Astropy development!

Tuesday, August 12, 2014

Week 12

There's not too much to report for this week, as I basically worked on making some final changes and double-checking the writing code to make sure it works with the entire functionality of the legacy writer. After improving performance issues related to tokenization and string conversion, I created a final version of the IPython notebook for reading. Since IPython doesn't work well with multiprocessing, I wrote a separate script to test the performance of the fast reader in parallel and output the results in an HTML file; here are the results on my laptop. Parallel reading seems to work well for very large input files, and I guess the goal of beating Pandas (at least for huge input and ordinary data) is basically complete! Writing is still a little slower than the Pandas method to_csv, but I fixed an issue involving custom formatting; the results can be viewed here.
I also wrote up a separate section in the documentation for fast ASCII I/O, although there's still the question of how to incorporate IPython notebooks in the documentation. For now I have the notebooks hosted in a repo called ascii-profiling, but they may be moved to a new repo called astropy-notebooks. More importantly, Tom noticed that there must actually be something wrong with the fast converter (xstrtod()), since increasing the number of significant figures seems to scale the potential conversion error linearly. After looking over xstrtod() and reading more about IEEE floating-point arithmetic, I found a reasonable solution by forcing xstrtod() to stop parsing digits after the 17th digit (since doubles can only have a maximum precision of 17 digits) and by correcting an issue in the second half of xstrtod(), where the significand is scaled by a power of ten. I tested the new version of xstrtod() in the conversion notebook and found that low-precision values are now guaranteed to be within 0.5 ULP, while high-precision values are within 1.0 ULP about 90% of the time with no linear growth in error.
Once I commit the new xstrtod(), my PR should be pretty close to merging--at this point I'll probably write some more tests just to make sure everything works okay. Today is the suggested "pencils down" date of Google Summer of Code, so I guess it's time to wrap up.

Monday, August 4, 2014

Week 11

Since the real goal at this point is to finish up my main PR and my multiprocessing branch in order to merge, I ended up spending this week on final changes instead of Erik's dtype idea. My code's gotten some more review and my mentors and I have been investigating some details and various test cases, which should be really useful for documentation.
One nice thing I managed to discover was how well xstrtod() (the Pandas-borrowed fast float conversion function) works for various input precisions. Unlike strtod(), which is guaranteed to be within 0.5 ULP (units in the last place, or the distance between the two closest floating-point numbers) of the correct result, xstrtod() has no general bound and in fact might be off by several ULP for input with numerous significant figures. However, it works pretty well when the number of significant figures is relatively low, so users might prefer to choose use_fast_convert=True for fairly low-precision data. I wrote up an IPython notebook showing a few results, which Tom also built on in another notebook. I plan to include the results in the final documentation, as users might find it useful to know more about rounding issues with parsing and judge whether or not the fast converter is appropriate for their purposes.
On the multiprocessing branch, I added in xstrtod() and the use_fast_converter parameter, which defaults to False. After discussion with my mentors, I changed the file reading system so that the parser employs memory mapping whenever it reads from a file; the philosophy is that, aside from speed gains, reading a 1 GB file via memory mapping will save users from having to load a full gigabyte into memory. The main challenge with memory mapping (and later with other compatibility concerns) is getting the code to run correctly on Windows, which turned out to be more frustrating than I expected.
Since Windows doesn't have the POSIX function memmap(), specific Windows memory mapping code has to be wrapped in an #ifdef _WIN32 block, and the fact that Windows has no fork() call for multiprocessing means that memory is not simply copy-on-write as it is on Linux, which leads to a host of other issues. For example, I first ran into a weird issue involving pickling that ultimately turned out to be due to a six bug, which has been noted for versions < 1.7.0. I opened a PR to update the bundled version of six in AstroPy, so that issue should be fixed pretty quickly. There were some other problems, such as the fact that processes cannot be created with bound methods in Windows (which I circumvented by turning _read_chunk into a normal method and making CParser picklable via a custom __reduce__ method), but things seem to work correctly on Windows now. I actually found that there was a slowdown in switching to parallel reading, but I couldn't find a cause with cProfile; it might have been the fact that I used a very old laptop for testing, so I'll have to find some other way to see if there's actually a problem.
Tom also wrote a very informative IPython notebook detailing the performance of the new implementation compared to the old readers,genfromtxt(), and Pandas, which I'll include in the documentation for the new fast readers. It was also nice to see an interesting discussion regarding metadata parsing in issue #2810 and a new PR to remove boilerplate code, which is always good. I also made a quick fix to the HTML reading tests and opened a PR to allow for a user-specified backend parser in HTML reading, as Tom pointed out that certain files will work with one backend (e.g. html5lib) and not the default.