Monday, August 18, 2014

Week 13

This was the final week of Google Summer of Code, and since last Monday was the suggested "pencils down" date, I spent the week focusing on getting the main pull request ready for merging. I began by testing the new fast converter for unusual input, then handled issues Erik noted with the PR, filed an issue with Pandas, and began work on a new branch which implements a different memory scheme in the tokenizer. The PR seems to be in a final review stage, so hopefully it'll be merged by next week.
After testing out xstrtod(), I noticed a couple problems with extreme input values and fixed them; the most notable problem was an inability to handle subnormals (values with exponent less that -308). As of now, the converter seems to work pretty well for a wide range of input, and the absolute worst-case error seems to be around 3.0 ULP. Interestingly, when I reported the problems with the old xstrtod() as a bug in Pandas, the response I received was that the current code should remain, but a new parameter float_precision might be added to allow for more accurate conversion. Both Tom and I found this response a little bizarre, since the issues with xstrtod() seem quite buggy, but in any case I have an open PR to implement this in Pandas.
Aside from this, Erik pointed out some suggestions and concerns about the PR, which I dealt with in new commits. For example, he suggested that I use the mmap module in Python rather than dealing with platform-dependent memory mapping in C, which seems to make more sense for the sake of portability. He also pointed out that the method FileString.splitlines(), which returns a generator yielding lines from the memory-mapped file, was inefficient due to repeated calls to chr(). I ultimately rewrote it in C, and although its performance is really only important for commented-header files with header line deep into the file, I managed to get more than a 2x speedup on a 10,000-line integer file with a commented header line in the last row with the new approach.
Although it won't be a part of the main PR, I've also been working on a separate branch change-memory-layout which changes the storage of output in memory in the tokenizer. The main purpose of this branch is to reduce the memory footprint of parsing, as the peak memory usage is almost twice that of Pandas; the basic idea is that instead of storing output in char **output_cols, it's stored instead in a single string char *output and an array of pointers, char **line_ptrs, records the beginning of each line for conversion purposes. While I'm still working on memory improvements, I actually managed to get a bit of a speed boost with this approach. Pure floating-point data is now slightly quicker to read with io.ascii than with Pandas, even without multiprocessing enabled!
Since today is the absolute pencils down date, this marks the official end of the coding period and the end of my blog posts. I plan to continue responding to the review of the main PR and finish up the work in my new branch, but the real work of the summer is basically over. It's been a great experience, and I'm glad I was able to learn a lot and get involved in Astropy development!

Tuesday, August 12, 2014

Week 12

There's not too much to report for this week, as I basically worked on making some final changes and double-checking the writing code to make sure it works with the entire functionality of the legacy writer. After improving performance issues related to tokenization and string conversion, I created a final version of the IPython notebook for reading. Since IPython doesn't work well with multiprocessing, I wrote a separate script to test the performance of the fast reader in parallel and output the results in an HTML file; here are the results on my laptop. Parallel reading seems to work well for very large input files, and I guess the goal of beating Pandas (at least for huge input and ordinary data) is basically complete! Writing is still a little slower than the Pandas method to_csv, but I fixed an issue involving custom formatting; the results can be viewed here.
I also wrote up a separate section in the documentation for fast ASCII I/O, although there's still the question of how to incorporate IPython notebooks in the documentation. For now I have the notebooks hosted in a repo called ascii-profiling, but they may be moved to a new repo called astropy-notebooks. More importantly, Tom noticed that there must actually be something wrong with the fast converter (xstrtod()), since increasing the number of significant figures seems to scale the potential conversion error linearly. After looking over xstrtod() and reading more about IEEE floating-point arithmetic, I found a reasonable solution by forcing xstrtod() to stop parsing digits after the 17th digit (since doubles can only have a maximum precision of 17 digits) and by correcting an issue in the second half of xstrtod(), where the significand is scaled by a power of ten. I tested the new version of xstrtod() in the conversion notebook and found that low-precision values are now guaranteed to be within 0.5 ULP, while high-precision values are within 1.0 ULP about 90% of the time with no linear growth in error.
Once I commit the new xstrtod(), my PR should be pretty close to merging--at this point I'll probably write some more tests just to make sure everything works okay. Today is the suggested "pencils down" date of Google Summer of Code, so I guess it's time to wrap up.

Monday, August 4, 2014

Week 11

Since the real goal at this point is to finish up my main PR and my multiprocessing branch in order to merge, I ended up spending this week on final changes instead of Erik's dtype idea. My code's gotten some more review and my mentors and I have been investigating some details and various test cases, which should be really useful for documentation.
One nice thing I managed to discover was how well xstrtod() (the Pandas-borrowed fast float conversion function) works for various input precisions. Unlike strtod(), which is guaranteed to be within 0.5 ULP (units in the last place, or the distance between the two closest floating-point numbers) of the correct result, xstrtod() has no general bound and in fact might be off by several ULP for input with numerous significant figures. However, it works pretty well when the number of significant figures is relatively low, so users might prefer to choose use_fast_convert=True for fairly low-precision data. I wrote up an IPython notebook showing a few results, which Tom also built on in another notebook. I plan to include the results in the final documentation, as users might find it useful to know more about rounding issues with parsing and judge whether or not the fast converter is appropriate for their purposes.
On the multiprocessing branch, I added in xstrtod() and the use_fast_converter parameter, which defaults to False. After discussion with my mentors, I changed the file reading system so that the parser employs memory mapping whenever it reads from a file; the philosophy is that, aside from speed gains, reading a 1 GB file via memory mapping will save users from having to load a full gigabyte into memory. The main challenge with memory mapping (and later with other compatibility concerns) is getting the code to run correctly on Windows, which turned out to be more frustrating than I expected.
Since Windows doesn't have the POSIX function memmap(), specific Windows memory mapping code has to be wrapped in an #ifdef _WIN32 block, and the fact that Windows has no fork() call for multiprocessing means that memory is not simply copy-on-write as it is on Linux, which leads to a host of other issues. For example, I first ran into a weird issue involving pickling that ultimately turned out to be due to a six bug, which has been noted for versions < 1.7.0. I opened a PR to update the bundled version of six in AstroPy, so that issue should be fixed pretty quickly. There were some other problems, such as the fact that processes cannot be created with bound methods in Windows (which I circumvented by turning _read_chunk into a normal method and making CParser picklable via a custom __reduce__ method), but things seem to work correctly on Windows now. I actually found that there was a slowdown in switching to parallel reading, but I couldn't find a cause with cProfile; it might have been the fact that I used a very old laptop for testing, so I'll have to find some other way to see if there's actually a problem.
Tom also wrote a very informative IPython notebook detailing the performance of the new implementation compared to the old readers,genfromtxt(), and Pandas, which I'll include in the documentation for the new fast readers. It was also nice to see an interesting discussion regarding metadata parsing in issue #2810 and a new PR to remove boilerplate code, which is always good. I also made a quick fix to the HTML reading tests and opened a PR to allow for a user-specified backend parser in HTML reading, as Tom pointed out that certain files will work with one backend (e.g. html5lib) and not the default.

Monday, July 28, 2014

Week 10

I spent this week mostly working on the multiprocessing branch, fixing up some issues from last week and adding a bit more functionality. Most importantly, I finally switched to a more efficient scheme for reading data when given a filename as input; I'd been meaning to deal with this previously, since profiling indicated that the naïve method of simply reading file data into a single Python string took up something like 15% of processing time, so it's nice to have a more efficient method.
One idea I had thought of previously was to split file reading into each process, but my initial changes showed a speed decrease, which made sense after I came across this comment on StackOverflow explaining that it's best from a performance standpoint to access file data sequentially. I then switched over to reading separate chunks for each process in the main process at the beginning of parsing and then passing each chunk to its respective process, which seems more promising but still runs into issues with memory management. I've been trying to find a better solution today, and I think I should be able to figure it out by tomorrow.
Another issue I looked into was finding a faster algorithm for converting strings to doubles, based on Pandas' superior performance. After looking into what Pandas does, I found that a fairly simple conversion function xstrtod() replaces strtod() from the standard library; from what I was told by a Pandas developer, it seems that Pandas considers the speed gains more important than retaining the relatively slow correction loop in strtod() for high-precision values and is therefore willing to have values off by more than 0.5 ULP (units in the last place). numpy's conversion routine doesn't seem to offer a clear benefit (I tried replacing strtod() with numpy's conversion and got mixed results), so I'm not sure if any optimization is possible.
I then added a parallel parameter for reading and patched up the failing tests (due to multi-line quoted values, which Tom suggested should be an acceptable but documented problem with parallelized reading). I also fixed a unicode issue with string conversion in Python 3, which arose because the 'S' dtype in numpy corresponds to bytes in Python3 rather than str. Interestingly I found it a fairly trivial extension of the file reading code to implement memory mapping for reading on non-Windows systems, so I tentatively added it with a memory_map keyword and will test it more thoroughly to see if the performance boost is anything very significant. This does not enable a memory-mapped view of the file inside the output Table, but simply memory-maps the file to a char * pointer for faster reading and then un-maps it after processing is done.
Next week I'll be working on Erik's idea of creating custom numpy dtypes, and I'll also add documentation, make some final fixes, and otherwise get the engine ready for merging in the meantime.

Monday, July 21, 2014

Week 9

This week was a pretty straightforward one--I spent most of it working on implementing parallelization in the Cython reading algorithm (in a new branch, multiprocessing). After Tuesday's discussion about what problem to pursue next, we decided that I should spend a week or two on parallelization and then begin to work on creating custom numpy dtypes, which will be an extension of Erik's work and should be applicable to FITS files as well as ASCII to some extent.
In short, the new strategy I implemented for reading involves splitting input data into N chunks (I currently have N=8, but I'll have to find an optimal N at some point) and performing both tokenization and conversion in a different process for each chunk. There are a few subtleties involved; for example, chunks need to be separated by newlines, certain data has to be copied for each process, etc. One irritating issue I came across was one with conversion in which one process discovers string data and another discovers numeric data in the same column. The basic idea at the end of the algorithm is to take the resulting data from each process and concatenate the chunks together, using widening conversions (i.e. int to float to str) when chunks have different data types. However, if a column contains integers in the first chunk, floats in the second, and strings in the third, concatenating the first two would turn 5 to 5.0 and then concatenating the first two chunks with the third would result in "5.0" instead of the desired "5". The obvious solution is to check explicitly for string data in any of the chunks and immediately convert all the columns to strings, but this still results in improper behavior for numbers in scientific notation, for example (1.3e2 -> 13 -> "13"). I ended up simply using a multiprocessing.Queue to force processes to reconvert columns as strings in this case, discarding the original converted data altogether. Luckily this seems like a pretty unlikely corner case for large data files, so performance shouldn't really take a hit.
A more serious problem, and one I haven't quite decided how to handle, is how to deal with quoted fields spanning multiple lines. Since my approach involves splitting data into chunks by lines, an input string like 'a b c\n1 2 "3\n4"' will not parse correctly because one chunk will try parsing the line 1 2 "3 (noticing an unfinished quote sequence) and another will try parsing 4" (noticing an unexpected quote character and too few data values in the row). Unfortunately no good solution comes to mind, since explicitly searching through the input for quote characters would basically negate the performance gains of multiprocessing in the first place. I plan on asking my mentors what they think, since I'm not sure how commonly this sort of case occurs.
All tests are passing with the new multiprocessing algorithm except for those involving the quoting issue, and there is a speed gain over the old method. Unfortunately it's a little less than I hoped it would be (Tom warned that this could happen, citing Amdahl's Law), but still fairly nice; on a sample file that took the old algorithm about 0.11 seconds (the same file I've been using for a few weeks...I should probably start trying other ones), the new algorithm takes about 0.075 seconds, a speed gain of about 45%. I've been working on improving the reader's performance with reading from file, since inputting the file into a single string takes up about 0.012 seconds of the reading time, although I'm not yet finished. Table.__init__() is also taking about 0.01 seconds, which is probably unavoidable. Pandas is taking about 0.055 seconds on the same file. It's getting trickier to cut down time at this point, but hopefully I can edge out Pandas in the end!

Monday, July 14, 2014

Week 8

After improving the fast writer and fixing up some lingering issues last week, I spent this week looking into porting my reading implementation to numpy and reading the memory mapping code in io.fits to learn more about memory mapping. This felt like a bit of a slow week, since I was sick on Thursday and had trouble figuring out a problem with numpy's build system on Friday, but I think my work on adapting my implementation to genfromtxt() should definitely come in useful (at least at some point when I have time to finish the work).
I tidied up the code and made a few final PRs on Monday and Tuesday, and then at Tom's suggestion I opened a PR for review (https://github.com/astropy/astropy/pull/2716#issuecomment-48749423). Since it's a big PR, I expect it'll take some time to receive an extensive code review, but it's nice to have a roughly final implementation that passes on Travis CI. I also opened a PR in astropy-benchmarks for the io.ascii benchmarks I wrote at the beginning of the summer (https://github.com/astropy/astropy-benchmarks/pull/7). From there I went on to look at the memory mapping in io.fits and read about memory mapping in general, and I'm sure this will be something to discuss in the next hangout; I'm curious whether the Python mmap module will work with an underlying C layer, or how else memory mapping might be achieved.
More importantly, I spent much of the week changing my text reading code to work in numpy as an engine for genfromtxt(). After Tom pointed me to a thread on the numpy discussion list, I forked numpy and started writing a new version of genfromtxt() in a separate branch: https://github.com/amras1/numpy/tree/faster-genfromtxt. I also spoke to a numpy developer on IRC who participated in the mailing list thread, and he believed that my implementation might be of use. He also explained that a particularly important problem with loadtxt() and genfromtxt() is their broken handling of unicode with Python 3, which necessitates a full rewrite of the two functions. After that, I worked on altering the C tokenizing code to handle UTF-8 rather than plain ASCII and wrote related code to have the Cython/C parser work underneath genfromtxt(). As of now, genfromtxt() is working with relatively simple sample data, although there's still more to do with handling the function's parameters (e.g. missing_values and usemask).
Adapting the Cython/C reader to numpy has been a little more difficult than I thought, but I'm glad to have learned a lot of useful information about unicode and the differences between Python 2.x and 3.x. I'm not really sure what I'll be doing next week, but I'll know after the hangout meeting tomorrow.

Monday, July 7, 2014

Week 7

After the bulk work of the last two weeks, I spent this week mostly fixing up some minor issues with the fast readers and adding some more functionality. I think I can finally say that the C reader is more or less done; I might want to work more on getting its speed closer to Pandas' read_csv(), but it's currently stable and passes the tests I've written as well as the old tests (I added parametrized fixtures to the generic reading tests).
As Tom said in our last hangout, there seem to be a few directions to choose from here: improving the C reader's speed, implementing a fast writer, writing more tests/documentation, etc. This week I opted to work on the fast writer after finishing up lingering issues with the functionality of the C reader, but I'd like to address the other issues as well if time permits. Tom also mentioned that numpy is looking for an overhaul of loadtxt() and genfromtxt(), so if possible I'd like to offer up the code I've written for adaptation in numpy. This is a low priority at the moment, since I'm focusing on tightening up the code for Astropy, but at any rate I may be able to donate the code even if I don't have time to work on adapting it to genfromtxt()/loadtxt()—the developers on the numpy mailing list mentioned looking into Pandas, but they might find my implementation more adaptable (albeit slower).
During the first half of the week, I fixed some issues I had previously delayed, such as right stripping whitespace from table fields in the tokenizer, adding a parameter use_fast_reader to the io.ascii infrastructure, etc. as well as adding more comments and writing up a quick design document. I also added new fast readers for commented-header and RDB files after making the _read_header() method in FastBasic more extensible. One idea that Tom had was to simply use the old header classes to read file headers since the time header reading takes is so negligible, but I found that the headers were too closely tied to the BaseReader infrastructure to use them without modification. For now, at least, overriding _read_header() seems like a reasonable way to allow for some flexibility in header reading.
After that, I worked on creating a fast writing system in Cython. I ran into a number of problems when I tried out my algorithm by parametrizing fixtures in the writing test suite, particularly with masking, but I got it working after a while. I also included some format-specific handling to the current fast readers in order to deal with specific writing issues (e.g. omitting the header line, writing column types in RDB, etc). As of right now the implementation is passing all tests, but it could use a little more customization and definitely isn't fast enough—I found the fast reader took 2.8 seconds to write a file that took 3.5 seconds to write without the fast reader, only a 25% speed reduction. Profiling has been little tricky, since I can't find a line profiler for Cython and sometimes splitting into subroutines doesn't work well (since the overhead of function calls becomes a big factor). However, it's definitely clear that a lot of time is spent in the iter_str_vals() method called on each column for string iteration. I should be able to find a way to cut the writing time down significantly.
Although we're not having a hangout this week, after I get some work done on the writing algorithm I plan to ask Tom what direction would be best to go to from here and at what point I should plan on opening a PR for my branch. According to the schedule I should be looking into the possibility of memory mapping soon, but Tom said he's not really sure if it's a feasible idea; I guess that will be tabled for a little while, but we'll see what happens.