Since the real goal at this point is to finish up my main PR and my multiprocessing branch in order to merge, I ended up spending this week on final changes instead of Erik's dtype idea. My code's gotten some more review and my mentors and I have been investigating some details and various test cases, which should be really useful for documentation.
One nice thing I managed to discover was how well
xstrtod()
(the Pandas-borrowed fast float conversion function) works for various input precisions. Unlike strtod()
, which is guaranteed to be within 0.5 ULP (units in the last place, or the distance between the two closest floating-point numbers) of the correct result, xstrtod()
has no general bound and in fact might be off by several ULP for input with numerous significant figures. However, it works pretty well when the number of significant figures is relatively low, so users might prefer to choose use_fast_convert=True
for fairly low-precision data. I wrote up an IPython notebook showing a few results, which Tom also built on in another notebook. I plan to include the results in the final documentation, as users might find it useful to know more about rounding issues with parsing and judge whether or not the fast converter is appropriate for their purposes.
On the multiprocessing branch, I added in
xstrtod()
and the use_fast_converter
parameter, which defaults to False
. After discussion with my mentors, I changed the file reading system so that the parser employs memory mapping whenever it reads from a file; the philosophy is that, aside from speed gains, reading a 1 GB file via memory mapping will save users from having to load a full gigabyte into memory. The main challenge with memory mapping (and later with other compatibility concerns) is getting the code to run correctly on Windows, which turned out to be more frustrating than I expected.
Since Windows doesn't have the POSIX function
memmap()
, specific Windows memory mapping code has to be wrapped in an #ifdef _WIN32
block, and the fact that Windows has no fork()
call for multiprocessing means that memory is not simply copy-on-write as it is on Linux, which leads to a host of other issues. For example, I first ran into a weird issue involving pickling that ultimately turned out to be due to a six
bug, which has been noted for versions < 1.7.0. I opened a PR to update the bundled version of six in AstroPy, so that issue should be fixed pretty quickly. There were some other problems, such as the fact that processes cannot be created with bound methods in Windows (which I circumvented by turning _read_chunk
into a normal method and making CParser
picklable via a custom __reduce__
method), but things seem to work correctly on Windows now. I actually found that there was a slowdown in switching to parallel reading, but I couldn't find a cause with cProfile; it might have been the fact that I used a very old laptop for testing, so I'll have to find some other way to see if there's actually a problem.
Tom also wrote a very informative IPython notebook detailing the performance of the new implementation compared to the old readers,
genfromtxt()
, and Pandas, which I'll include in the documentation for the new fast readers. It was also nice to see an interesting discussion regarding metadata parsing in issue #2810 and a new PR to remove boilerplate code, which is always good. I also made a quick fix to the HTML reading tests and opened a PR to allow for a user-specified backend parser in HTML reading, as Tom pointed out that certain files will work with one backend (e.g. html5lib) and not the default.
No comments:
Post a Comment