Google Summer of Code: Week 9

This week was a pretty straightforward one--I spent most of it working on implementing parallelization in the Cython reading algorithm (in a new branch, multiprocessing). After Tuesday's discussion about what problem to pursue next, we decided that I should spend a week or two on parallelization and then begin to work on creating custom numpy dtypes, which will be an extension of Erik's work and should be applicable to FITS files as well as ASCII to some extent.

In short, the new strategy I implemented for reading involves splitting input data into N chunks (I currently have N=8, but I'll have to find an optimal N at some point) and performing both tokenization and conversion in a different process for each chunk. There are a few subtleties involved; for example, chunks need to be separated by newlines, certain data has to be copied for each process, etc. One irritating issue I came across was one with conversion in which one process discovers string data and another discovers numeric data in the same column. The basic idea at the end of the algorithm is to take the resulting data from each process and concatenate the chunks together, using widening conversions (i.e. int to float to str) when chunks have different data types. However, if a column contains integers in the first chunk, floats in the second, and strings in the third, concatenating the first two would turn 5 to 5.0 and then concatenating the first two chunks with the third would result in "5.0" instead of the desired "5". The obvious solution is to check explicitly for string data in any of the chunks and immediately convert all the columns to strings, but this still results in improper behavior for numbers in scientific notation, for example (1.3e2 -> 13 -> "13"). I ended up simply using a multiprocessing.Queue to force processes to reconvert columns as strings in this case, discarding the original converted data altogether. Luckily this seems like a pretty unlikely corner case for large data files, so performance shouldn't really take a hit.

A more serious problem, and one I haven't quite decided how to handle, is how to deal with quoted fields spanning multiple lines. Since my approach involves splitting data into chunks by lines, an input string like 'a b c\n1 2 "3\n4"' will not parse correctly because one chunk will try parsing the line 1 2 "3 (noticing an unfinished quote sequence) and another will try parsing 4" (noticing an unexpected quote character and too few data values in the row). Unfortunately no good solution comes to mind, since explicitly searching through the input for quote characters would basically negate the performance gains of multiprocessing in the first place. I plan on asking my mentors what they think, since I'm not sure how commonly this sort of case occurs.

All tests are passing with the new multiprocessing algorithm except for those involving the quoting issue, and there is a speed gain over the old method. Unfortunately it's a little less than I hoped it would be (Tom warned that this could happen, citing Amdahl's Law), but still fairly nice; on a sample file that took the old algorithm about 0.11 seconds (the same file I've been using for a few weeks...I should probably start trying other ones), the new algorithm takes about 0.075 seconds, a speed gain of about 45%. I've been working on improving the reader's performance with reading from file, since inputting the file into a single string takes up about 0.012 seconds of the reading time, although I'm not yet finished. Table.__init__() is also taking about 0.01 seconds, which is probably unavoidable. Pandas is taking about 0.055 seconds on the same file. It's getting trickier to cut down time at this point, but hopefully I can edge out Pandas in the end!

Google Summer of Code

Monday, July 21, 2014

Week 9

No comments:

Post a Comment