After improving the fast writer and fixing up some lingering issues last week, I spent this week looking into porting my reading implementation to numpy and reading the memory mapping code in 
io.fits to learn more about memory mapping. This felt like a bit of a slow week, since I was sick on Thursday and had trouble figuring out a problem with numpy's build system on Friday, but I think my work on adapting my implementation to genfromtxt() should definitely come in useful (at least at some point when I have time to finish the work).
I tidied up the code and made a few final PRs on Monday and Tuesday, and then at Tom's suggestion I opened a PR for review (https://github.com/astropy/astropy/pull/2716#issuecomment-48749423). Since it's a big PR, I expect it'll take some time to receive an extensive code review, but it's nice to have a roughly final implementation that passes on Travis CI. I also opened a PR in astropy-benchmarks for the 
io.ascii benchmarks I wrote at the beginning of the summer (https://github.com/astropy/astropy-benchmarks/pull/7). From there I went on to look at the memory mapping in io.fits and read about memory mapping in general, and I'm sure this will be something to discuss in the next hangout; I'm curious whether the Python mmap module will work with an underlying C layer, or how else memory mapping might be achieved.
More importantly, I spent much of the week changing my text reading code to work in 
numpy as an engine for genfromtxt(). After Tom pointed me to a thread on the numpy discussion list, I forked numpy and started writing a new version of genfromtxt() in a separate branch: https://github.com/amras1/numpy/tree/faster-genfromtxt. I also spoke to a numpy developer on IRC who participated in the mailing list thread, and he believed that my implementation might be of use. He also explained that a particularly important problem with loadtxt() and genfromtxt() is their broken handling of unicode with Python 3, which necessitates a full rewrite of the two functions. After that, I worked on altering the C tokenizing code to handle UTF-8 rather than plain ASCII and wrote related code to have the Cython/C parser work underneath genfromtxt(). As of now, genfromtxt() is working with relatively simple sample data, although there's still more to do with handling the function's parameters (e.g. missing_values and usemask).
Adapting the Cython/C reader to numpy has been a little more difficult than I thought, but I'm glad to have learned a lot of useful information about unicode and the differences between Python 2.x and 3.x. I'm not really sure what I'll be doing next week, but I'll know after the hangout meeting tomorrow.
 
Great. I just sent you a comment but it seems that blogspot ate it. Basically to summarize it sounds like your work on the C code to handle UTF-8 better might also be applicable to the Pandas project, where I recently encountered an issue with exactly that. Remind me to see if I can get you more details including a way to reproduce it.
ReplyDelete