Tuesday, August 12, 2014

Week 12

There's not too much to report for this week, as I basically worked on making some final changes and double-checking the writing code to make sure it works with the entire functionality of the legacy writer. After improving performance issues related to tokenization and string conversion, I created a final version of the IPython notebook for reading. Since IPython doesn't work well with multiprocessing, I wrote a separate script to test the performance of the fast reader in parallel and output the results in an HTML file; here are the results on my laptop. Parallel reading seems to work well for very large input files, and I guess the goal of beating Pandas (at least for huge input and ordinary data) is basically complete! Writing is still a little slower than the Pandas method to_csv, but I fixed an issue involving custom formatting; the results can be viewed here.
I also wrote up a separate section in the documentation for fast ASCII I/O, although there's still the question of how to incorporate IPython notebooks in the documentation. For now I have the notebooks hosted in a repo called ascii-profiling, but they may be moved to a new repo called astropy-notebooks. More importantly, Tom noticed that there must actually be something wrong with the fast converter (xstrtod()), since increasing the number of significant figures seems to scale the potential conversion error linearly. After looking over xstrtod() and reading more about IEEE floating-point arithmetic, I found a reasonable solution by forcing xstrtod() to stop parsing digits after the 17th digit (since doubles can only have a maximum precision of 17 digits) and by correcting an issue in the second half of xstrtod(), where the significand is scaled by a power of ten. I tested the new version of xstrtod() in the conversion notebook and found that low-precision values are now guaranteed to be within 0.5 ULP, while high-precision values are within 1.0 ULP about 90% of the time with no linear growth in error.
Once I commit the new xstrtod(), my PR should be pretty close to merging--at this point I'll probably write some more tests just to make sure everything works okay. Today is the suggested "pencils down" date of Google Summer of Code, so I guess it's time to wrap up.

No comments:

Post a Comment