Sunday, June 29, 2014

Week 6

This week was fairly straightforward, as I continued my work from last week on the new Cython/C parsing implementation and began looking for areas of improvement. As of now, my new implementation is pretty consistent with the behavior of Basic and other simple readers; there are a couple minor things tagged "//TODO:" and I'm sure there are issues I haven't yet noticed, but the bulk work is basically over. In order to make sure the implementation works as intended and to provide some examples for anyone reading through my code, I wrote some tests this week in addition to writing new functionality.
A good deal of my time this week was spent on implementing the parameters accepted by ascii.read(), outlined here. First, though, I added FastBasicFastCsv, and FastTab to the infrastructure of ascii.read() so that they belong to the guess list and can be specified for reading (e.g. format="fast_basic"). After that, I improved the tokenizer by dealing with quoted values, fixing a memory allocation problem in C with realloc, skipping whitespace at the beginning of fields, etc. Incidentally, this last point should be much more efficient than the old method of dealing with leading/trailing whitespace, since whitespace is conditionally dealt with in the tokenizer and the overhead of calling strip() on a bunch of Python objects is gone.
I also implemented include_names and exclude_names; these will improve parsing performance if specified (unlike the ordinary parsing method, which collects all columns and later throws some out) because the tokenizer only stores the necessary columns in memory. Then I worked on other parameters, such as header_start/data_start/data_end (the last one will not improve performance if negative, though), fill_valuesfill_include_names/fill_exclude_names, etc. I also made the tokenizer accept a parameter which specifies whether rows with insufficient data should be padded by adding empty values; FastBasic sets the parameter to False by default, so any such case raises an error, but FastCsv has the parameter equal to True just as Csv's functionality differs from Basic. Of course, either reader will raise an error upon reading too many data values in a given row.
In our brief meeting on Tuesday, Tom and I talked a little about the conversion part of the algorithm and how we might make it more efficient. I initially used the method astype of numpy.ndarray to try conversion first to int, then to float, and then to string, but I discovered that this was not very efficient; in fact, I ran into memory errors for a couple of the large files I used for benchmarking. I therefore wrote new conversion code which deals with potential masking (in case of bad/empty values, as specified by fill_values) and calls underlying C code which uses strol() or strtod() to convert strings to ints or floats. This seems to work quite a bit better, but I'd still like to see if there's a better way to convert a chunk of data very efficiently.
Anyhow, it's nice to have an implementation down which passes some preliminary tests and seems to perform pretty well. Out of curiosity, I used timeit to check how the new code holds up to the more flexible code and to Pandas. On an integer-filled file which took the flexible code about 0.7 seconds and Pandas about 0.055,FastBasic took about 0.5 seconds. Not terrible, but I definitely want it to be a lot closer to Pandas' speed, and I think I should be able to improve it quite a bit once I do some profiling. Next week, I'm probably going to finish up the last small fixes I have planned and then focus on improving the overall performance of the implementation. I'll also probably start implementing CSV writing, which should be quite a bit simpler than reading judging from Pandas' code.

EDIT: After I moved some of the code in the conversion functions to C (to avoid the creation of Python strings), FastBasic is now down to about 0.12 seconds with the same file. Nice to have a reader about 6 times as fast as the old one, but there's more to do; Pandas is still about twice as fast asFastBasic!

No comments:

Post a Comment