There's not too much to report for this week, as I basically worked on making some final changes and double-checking the writing code to make sure it works with the entire functionality of the legacy writer. After improving performance issues related to tokenization and string conversion, I created a final version of the IPython notebook for reading. Since IPython doesn't work well with multiprocessing, I wrote a separate script to test the performance of the fast reader in parallel and output the results in an HTML file; here are the results on my laptop. Parallel reading seems to work well for very large input files, and I guess the goal of beating Pandas (at least for huge input and ordinary data) is basically complete! Writing is still a little slower than the Pandas method
to_csv
, but I fixed an issue involving custom formatting; the results can be viewed here.
I also wrote up a separate section in the documentation for fast ASCII I/O, although there's still the question of how to incorporate IPython notebooks in the documentation. For now I have the notebooks hosted in a repo called ascii-profiling, but they may be moved to a new repo called astropy-notebooks. More importantly, Tom noticed that there must actually be something wrong with the fast converter (
xstrtod()
), since increasing the number of significant figures seems to scale the potential conversion error linearly. After looking over xstrtod()
and reading more about IEEE floating-point arithmetic, I found a reasonable solution by forcing xstrtod()
to stop parsing digits after the 17th digit (since doubles can only have a maximum precision of 17 digits) and by correcting an issue in the second half of xstrtod()
, where the significand is scaled by a power of ten. I tested the new version of xstrtod()
in the conversion notebook and found that low-precision values are now guaranteed to be within 0.5 ULP, while high-precision values are within 1.0 ULP about 90% of the time with no linear growth in error.
Once I commit the new
xstrtod()
, my PR should be pretty close to merging--at this point I'll probably write some more tests just to make sure everything works okay. Today is the suggested "pencils down" date of Google Summer of Code, so I guess it's time to wrap up.
No comments:
Post a Comment