Google Summer of Code: July 2014

Monday, July 28, 2014

Week 10

I spent this week mostly working on the multiprocessing branch, fixing up some issues from last week and adding a bit more functionality. Most importantly, I finally switched to a more efficient scheme for reading data when given a filename as input; I'd been meaning to deal with this previously, since profiling indicated that the naïve method of simply reading file data into a single Python string took up something like 15% of processing time, so it's nice to have a more efficient method.

One idea I had thought of previously was to split file reading into each process, but my initial changes showed a speed decrease, which made sense after I came across this comment on StackOverflow explaining that it's best from a performance standpoint to access file data sequentially. I then switched over to reading separate chunks for each process in the main process at the beginning of parsing and then passing each chunk to its respective process, which seems more promising but still runs into issues with memory management. I've been trying to find a better solution today, and I think I should be able to figure it out by tomorrow.

Another issue I looked into was finding a faster algorithm for converting strings to doubles, based on Pandas' superior performance. After looking into what Pandas does, I found that a fairly simple conversion function xstrtod() replaces strtod() from the standard library; from what I was told by a Pandas developer, it seems that Pandas considers the speed gains more important than retaining the relatively slow correction loop in strtod() for high-precision values and is therefore willing to have values off by more than 0.5 ULP (units in the last place). numpy's conversion routine doesn't seem to offer a clear benefit (I tried replacing strtod() with numpy's conversion and got mixed results), so I'm not sure if any optimization is possible.

I then added a parallel parameter for reading and patched up the failing tests (due to multi-line quoted values, which Tom suggested should be an acceptable but documented problem with parallelized reading). I also fixed a unicode issue with string conversion in Python 3, which arose because the 'S' dtype in numpy corresponds to bytes in Python3 rather than str. Interestingly I found it a fairly trivial extension of the file reading code to implement memory mapping for reading on non-Windows systems, so I tentatively added it with a memory_map keyword and will test it more thoroughly to see if the performance boost is anything very significant. This does not enable a memory-mapped view of the file inside the output Table, but simply memory-maps the file to a char * pointer for faster reading and then un-maps it after processing is done.

Next week I'll be working on Erik's idea of creating custom numpy dtypes, and I'll also add documentation, make some final fixes, and otherwise get the engine ready for merging in the meantime.

Monday, July 21, 2014

Week 9

This week was a pretty straightforward one--I spent most of it working on implementing parallelization in the Cython reading algorithm (in a new branch, multiprocessing). After Tuesday's discussion about what problem to pursue next, we decided that I should spend a week or two on parallelization and then begin to work on creating custom numpy dtypes, which will be an extension of Erik's work and should be applicable to FITS files as well as ASCII to some extent.

In short, the new strategy I implemented for reading involves splitting input data into N chunks (I currently have N=8, but I'll have to find an optimal N at some point) and performing both tokenization and conversion in a different process for each chunk. There are a few subtleties involved; for example, chunks need to be separated by newlines, certain data has to be copied for each process, etc. One irritating issue I came across was one with conversion in which one process discovers string data and another discovers numeric data in the same column. The basic idea at the end of the algorithm is to take the resulting data from each process and concatenate the chunks together, using widening conversions (i.e. int to float to str) when chunks have different data types. However, if a column contains integers in the first chunk, floats in the second, and strings in the third, concatenating the first two would turn 5 to 5.0 and then concatenating the first two chunks with the third would result in "5.0" instead of the desired "5". The obvious solution is to check explicitly for string data in any of the chunks and immediately convert all the columns to strings, but this still results in improper behavior for numbers in scientific notation, for example (1.3e2 -> 13 -> "13"). I ended up simply using a multiprocessing.Queue to force processes to reconvert columns as strings in this case, discarding the original converted data altogether. Luckily this seems like a pretty unlikely corner case for large data files, so performance shouldn't really take a hit.

A more serious problem, and one I haven't quite decided how to handle, is how to deal with quoted fields spanning multiple lines. Since my approach involves splitting data into chunks by lines, an input string like 'a b c\n1 2 "3\n4"' will not parse correctly because one chunk will try parsing the line 1 2 "3 (noticing an unfinished quote sequence) and another will try parsing 4" (noticing an unexpected quote character and too few data values in the row). Unfortunately no good solution comes to mind, since explicitly searching through the input for quote characters would basically negate the performance gains of multiprocessing in the first place. I plan on asking my mentors what they think, since I'm not sure how commonly this sort of case occurs.

All tests are passing with the new multiprocessing algorithm except for those involving the quoting issue, and there is a speed gain over the old method. Unfortunately it's a little less than I hoped it would be (Tom warned that this could happen, citing Amdahl's Law), but still fairly nice; on a sample file that took the old algorithm about 0.11 seconds (the same file I've been using for a few weeks...I should probably start trying other ones), the new algorithm takes about 0.075 seconds, a speed gain of about 45%. I've been working on improving the reader's performance with reading from file, since inputting the file into a single string takes up about 0.012 seconds of the reading time, although I'm not yet finished. Table.__init__() is also taking about 0.01 seconds, which is probably unavoidable. Pandas is taking about 0.055 seconds on the same file. It's getting trickier to cut down time at this point, but hopefully I can edge out Pandas in the end!

Monday, July 14, 2014

Week 8

After improving the fast writer and fixing up some lingering issues last week, I spent this week looking into porting my reading implementation to numpy and reading the memory mapping code in io.fits to learn more about memory mapping. This felt like a bit of a slow week, since I was sick on Thursday and had trouble figuring out a problem with numpy's build system on Friday, but I think my work on adapting my implementation to genfromtxt() should definitely come in useful (at least at some point when I have time to finish the work).

I tidied up the code and made a few final PRs on Monday and Tuesday, and then at Tom's suggestion I opened a PR for review (https://github.com/astropy/astropy/pull/2716#issuecomment-48749423). Since it's a big PR, I expect it'll take some time to receive an extensive code review, but it's nice to have a roughly final implementation that passes on Travis CI. I also opened a PR in astropy-benchmarks for the io.ascii benchmarks I wrote at the beginning of the summer (https://github.com/astropy/astropy-benchmarks/pull/7). From there I went on to look at the memory mapping in io.fits and read about memory mapping in general, and I'm sure this will be something to discuss in the next hangout; I'm curious whether the Python mmap module will work with an underlying C layer, or how else memory mapping might be achieved.

More importantly, I spent much of the week changing my text reading code to work in numpy as an engine for genfromtxt(). After Tom pointed me to a thread on the numpy discussion list, I forked numpy and started writing a new version of genfromtxt() in a separate branch: https://github.com/amras1/numpy/tree/faster-genfromtxt. I also spoke to a numpy developer on IRC who participated in the mailing list thread, and he believed that my implementation might be of use. He also explained that a particularly important problem with loadtxt() and genfromtxt() is their broken handling of unicode with Python 3, which necessitates a full rewrite of the two functions. After that, I worked on altering the C tokenizing code to handle UTF-8 rather than plain ASCII and wrote related code to have the Cython/C parser work underneath genfromtxt(). As of now, genfromtxt() is working with relatively simple sample data, although there's still more to do with handling the function's parameters (e.g. missing_values and usemask).

Adapting the Cython/C reader to numpy has been a little more difficult than I thought, but I'm glad to have learned a lot of useful information about unicode and the differences between Python 2.x and 3.x. I'm not really sure what I'll be doing next week, but I'll know after the hangout meeting tomorrow.

Monday, July 7, 2014

Week 7

After the bulk work of the last two weeks, I spent this week mostly fixing up some minor issues with the fast readers and adding some more functionality. I think I can finally say that the C reader is more or less done; I might want to work more on getting its speed closer to Pandas' read_csv(), but it's currently stable and passes the tests I've written as well as the old tests (I added parametrized fixtures to the generic reading tests).

As Tom said in our last hangout, there seem to be a few directions to choose from here: improving the C reader's speed, implementing a fast writer, writing more tests/documentation, etc. This week I opted to work on the fast writer after finishing up lingering issues with the functionality of the C reader, but I'd like to address the other issues as well if time permits. Tom also mentioned that numpy is looking for an overhaul of loadtxt() and genfromtxt(), so if possible I'd like to offer up the code I've written for adaptation in numpy. This is a low priority at the moment, since I'm focusing on tightening up the code for Astropy, but at any rate I may be able to donate the code even if I don't have time to work on adapting it to genfromtxt()/loadtxt()—the developers on the numpy mailing list mentioned looking into Pandas, but they might find my implementation more adaptable (albeit slower).

During the first half of the week, I fixed some issues I had previously delayed, such as right stripping whitespace from table fields in the tokenizer, adding a parameter use_fast_reader to the io.ascii infrastructure, etc. as well as adding more comments and writing up a quick design document. I also added new fast readers for commented-header and RDB files after making the _read_header() method in FastBasic more extensible. One idea that Tom had was to simply use the old header classes to read file headers since the time header reading takes is so negligible, but I found that the headers were too closely tied to the BaseReader infrastructure to use them without modification. For now, at least, overriding _read_header() seems like a reasonable way to allow for some flexibility in header reading.

After that, I worked on creating a fast writing system in Cython. I ran into a number of problems when I tried out my algorithm by parametrizing fixtures in the writing test suite, particularly with masking, but I got it working after a while. I also included some format-specific handling to the current fast readers in order to deal with specific writing issues (e.g. omitting the header line, writing column types in RDB, etc). As of right now the implementation is passing all tests, but it could use a little more customization and definitely isn't fast enough—I found the fast reader took 2.8 seconds to write a file that took 3.5 seconds to write without the fast reader, only a 25% speed reduction. Profiling has been little tricky, since I can't find a line profiler for Cython and sometimes splitting into subroutines doesn't work well (since the overhead of function calls becomes a big factor). However, it's definitely clear that a lot of time is spent in the iter_str_vals() method called on each column for string iteration. I should be able to find a way to cut the writing time down significantly.

Although we're not having a hangout this week, after I get some work done on the writing algorithm I plan to ask Tom what direction would be best to go to from here and at what point I should plan on opening a PR for my branch. According to the schedule I should be looking into the possibility of memory mapping soon, but Tom said he's not really sure if it's a feasible idea; I guess that will be tabled for a little while, but we'll see what happens.