Google Summer of Code: Week 7

After the bulk work of the last two weeks, I spent this week mostly fixing up some minor issues with the fast readers and adding some more functionality. I think I can finally say that the C reader is more or less done; I might want to work more on getting its speed closer to Pandas' read_csv(), but it's currently stable and passes the tests I've written as well as the old tests (I added parametrized fixtures to the generic reading tests).

As Tom said in our last hangout, there seem to be a few directions to choose from here: improving the C reader's speed, implementing a fast writer, writing more tests/documentation, etc. This week I opted to work on the fast writer after finishing up lingering issues with the functionality of the C reader, but I'd like to address the other issues as well if time permits. Tom also mentioned that numpy is looking for an overhaul of loadtxt() and genfromtxt(), so if possible I'd like to offer up the code I've written for adaptation in numpy. This is a low priority at the moment, since I'm focusing on tightening up the code for Astropy, but at any rate I may be able to donate the code even if I don't have time to work on adapting it to genfromtxt()/loadtxt()—the developers on the numpy mailing list mentioned looking into Pandas, but they might find my implementation more adaptable (albeit slower).

During the first half of the week, I fixed some issues I had previously delayed, such as right stripping whitespace from table fields in the tokenizer, adding a parameter use_fast_reader to the io.ascii infrastructure, etc. as well as adding more comments and writing up a quick design document. I also added new fast readers for commented-header and RDB files after making the _read_header() method in FastBasic more extensible. One idea that Tom had was to simply use the old header classes to read file headers since the time header reading takes is so negligible, but I found that the headers were too closely tied to the BaseReader infrastructure to use them without modification. For now, at least, overriding _read_header() seems like a reasonable way to allow for some flexibility in header reading.

After that, I worked on creating a fast writing system in Cython. I ran into a number of problems when I tried out my algorithm by parametrizing fixtures in the writing test suite, particularly with masking, but I got it working after a while. I also included some format-specific handling to the current fast readers in order to deal with specific writing issues (e.g. omitting the header line, writing column types in RDB, etc). As of right now the implementation is passing all tests, but it could use a little more customization and definitely isn't fast enough—I found the fast reader took 2.8 seconds to write a file that took 3.5 seconds to write without the fast reader, only a 25% speed reduction. Profiling has been little tricky, since I can't find a line profiler for Cython and sometimes splitting into subroutines doesn't work well (since the overhead of function calls becomes a big factor). However, it's definitely clear that a lot of time is spent in the iter_str_vals() method called on each column for string iteration. I should be able to find a way to cut the writing time down significantly.

Although we're not having a hangout this week, after I get some work done on the writing algorithm I plan to ask Tom what direction would be best to go to from here and at what point I should plan on opening a PR for my branch. According to the schedule I should be looking into the possibility of memory mapping soon, but Tom said he's not really sure if it's a feasible idea; I guess that will be tabled for a little while, but we'll see what happens.

Google Summer of Code

Monday, July 7, 2014

Week 7

No comments:

Post a Comment