Google Summer of Code: June 2014

Sunday, June 29, 2014

Week 6

This week was fairly straightforward, as I continued my work from last week on the new Cython/C parsing implementation and began looking for areas of improvement. As of now, my new implementation is pretty consistent with the behavior of Basic and other simple readers; there are a couple minor things tagged "//TODO:" and I'm sure there are issues I haven't yet noticed, but the bulk work is basically over. In order to make sure the implementation works as intended and to provide some examples for anyone reading through my code, I wrote some tests this week in addition to writing new functionality.

A good deal of my time this week was spent on implementing the parameters accepted by ascii.read(), outlined here. First, though, I added FastBasic, FastCsv, and FastTab to the infrastructure of ascii.read() so that they belong to the guess list and can be specified for reading (e.g. format="fast_basic"). After that, I improved the tokenizer by dealing with quoted values, fixing a memory allocation problem in C with realloc, skipping whitespace at the beginning of fields, etc. Incidentally, this last point should be much more efficient than the old method of dealing with leading/trailing whitespace, since whitespace is conditionally dealt with in the tokenizer and the overhead of calling strip() on a bunch of Python objects is gone.

I also implemented include_names and exclude_names; these will improve parsing performance if specified (unlike the ordinary parsing method, which collects all columns and later throws some out) because the tokenizer only stores the necessary columns in memory. Then I worked on other parameters, such as header_start/data_start/data_end (the last one will not improve performance if negative, though), fill_values, fill_include_names/fill_exclude_names, etc. I also made the tokenizer accept a parameter which specifies whether rows with insufficient data should be padded by adding empty values; FastBasic sets the parameter to False by default, so any such case raises an error, but FastCsv has the parameter equal to True just as Csv's functionality differs from Basic. Of course, either reader will raise an error upon reading too many data values in a given row.

In our brief meeting on Tuesday, Tom and I talked a little about the conversion part of the algorithm and how we might make it more efficient. I initially used the method astype of numpy.ndarray to try conversion first to int, then to float, and then to string, but I discovered that this was not very efficient; in fact, I ran into memory errors for a couple of the large files I used for benchmarking. I therefore wrote new conversion code which deals with potential masking (in case of bad/empty values, as specified by fill_values) and calls underlying C code which uses strol() or strtod() to convert strings to ints or floats. This seems to work quite a bit better, but I'd still like to see if there's a better way to convert a chunk of data very efficiently.

Anyhow, it's nice to have an implementation down which passes some preliminary tests and seems to perform pretty well. Out of curiosity, I used timeit to check how the new code holds up to the more flexible code and to Pandas. On an integer-filled file which took the flexible code about 0.7 seconds and Pandas about 0.055,FastBasic took about 0.5 seconds. Not terrible, but I definitely want it to be a lot closer to Pandas' speed, and I think I should be able to improve it quite a bit once I do some profiling. Next week, I'm probably going to finish up the last small fixes I have planned and then focus on improving the overall performance of the implementation. I'll also probably start implementing CSV writing, which should be quite a bit simpler than reading judging from Pandas' code.

EDIT: After I moved some of the code in the conversion functions to C (to avoid the creation of Python strings), FastBasic is now down to about 0.12 seconds with the same file. Nice to have a reader about 6 times as fast as the old one, but there's more to do; Pandas is still about twice as fast asFastBasic!

Sunday, June 22, 2014

Week 5

This was an interesting week, as I began the actual implementation of the new plan for fast reading/writing in io.ascii. My meeting with Tom on Tuesday concluded with the decision to write new routines from scratch instead of using the routines in the Pandas library (although I'm still using ideas from Pandas' approach, of course). Additionally, instead of replacing the existing functionality for basic readers and writers in io.ascii, the plan is to maintain compatibility by creating new readers and writers (FastBasic, FastCsv, etc.) with less flexibility which will fall back on the old reading/writing classes when the C engine is too rigid. The user will also have some sort of option to choose between these new, faster readers and the old readers, perhaps through a parameter new passed to ui.read(). Since text-based parsing in Pandas is somewhat inflexible and unwieldy, it should be helpful to have a more well-documented and tight-knit library for use in Astropy.

I began a new branch for development called fast-c-reader on my Astropy fork, viewable here. I'm using a three-tiered approach to parsing; pure Python classes FastBasic, FastCsv, etc. interface with the rest of io.ascii, the Cython class CParser acts as an intermediate engine which handles Python method calls and invokes the tokenizer, and C code centered around the structtokenizer_t (influenced by the Pandas parser_t) performs the dirty work of reading input data quickly with state-machine logic. The Python class FastBasic, from which other fast classes inherit, raises a ParameterError if the user passes a parameter which the C engine cannot handle -- for example, the C tokenizer can't deal with a regex comment string, so it refuses to read if the user passes anything other than a 1-character string as the comment parameter. Similarly, converters, Outputter, and other flexible parameters cannot be passed to a fast reader.

The real heart of the algorithm, in the C method tokenize, is currently fairly small, but it'll grow as I add back in more functionality like ignoring leading and trailing whitespace, replacing fill values, etc. It deals with the structure tokenizer_t, which has four main components:

char *source, a single string containing all of the input data
char **output_cols, an array of strings representing the output data for each column
int *row_positions, an array of positions denoting where each row begins in output_cols
tokenizer_state state, an enum value denoting the current state of the tokenizer

I'm treating row_positions and each output_cols[i] as "smart arrays"; when their capacity is exceeded, the functions resize_rows()or resize_cols() call realloc() to double the size of the array. Later I'll see if there might be a more efficient way to store output data on-the-fly, but for now it seems like a reasonably memory-efficient approach. Anyway, output_cols is initialized after header reading, which provides the number of columns in the able (or after reading the first line of table data if header_start=None). Some values inoutput_cols are simply '\x00'; these act as fill values because each row takes up the same number of spaces in everyoutput_cols[i]. row_positions denotes where these contiguous blocks begin. For example, if the input is "A,B,C\n10,5.,6\n1,2,3", the following will hold:

source: "A,B,C\n10,5.,6\n1,2,3"
output_cols: ["A101", "B5.2", "C6\x003"]
row_positions: [0, 1, 3]

There's more I've been doing, such as implementing header_start/data_start, handling comments, etc. Until next week!

Sunday, June 15, 2014

Week 4

This week, I spent some time inspecting the Pandas library, particularly its algorithm for parsing data from text files. My goal was to analyze Pandas' algorithm and to determine whether the fast C reader in Pandas (written partly in Cython, partly in C) is fit for use in AstroPy or whether I'll have to implement a new approach from scratch. With regards to the differences between AstroPy and Pandas, I made several discoveries:

Missing/invalid values are handled differently -- Astropy’s default behavior is to change empty values to ‘0’ while Pandas changes the empty string, ‘NaN’, ‘N/A’, ‘NULL’, etc. to NaN. Both can be altered by passing a parameter (na_values, keep_default_na for Pandas and fill_values for Astropy), but Astropy always masks values while Pandas simply inserts NaN. This might cause confusion because Astropy currently allows for NaN as a non-masked numerical value.
Pandas accepts a parameter “comment” for reading, but it will only take a single char while Astropy allows for a regex.
Minor issues: Pandas doesn’t ignore empty lines and will add a NaN-filled row instead. Also, line comments are not supported and are treated as empty lines. Both issues are addressed in https://github.com/pydata/pandas/issues/4466 and https://github.com/pydata/pandas/pull/4505. I made a pull request to fix these issues at https://github.com/pydata/pandas/pull/7470, so this should be sorted out fairly soon.
Pandas has the parameter “header” to specify where header information begins, but it still expects the header to be a one-line list of names (which could be problematic for fixed-width files with two header lines, for example). It also doesn't have parameters to specify where data begins or ends.

At Tom's suggestion, I also checked out DefaultSplitter.__call__ more closely to see whether the function was actually spending most of its time iterating over csv.reader(). As it turns out, the function spends most of its time both calling process_val and spending time in process_val (i.e. calling strip()). In fact, I found with cProfile that DefaultSplitter.__call__() became about 70% faster and read() became about 30% faster when process_val is set to None. There doesn't seem to be a quick way to deal with this, since csv.reader has no option to strip whitespace automatically (except for left-stripping), but this is a good example of how Astropy is currently wasting a lot of time with overhead and intermediate structures for holding text data.

Along those lines, I read through Wes McKinney's old blog entry describing Pandas' approach to text-based file parsing and discovered that Pandas' greatest advantage over other readers (like numpy) is that it forgoes the use of intermediate Python data structures and instead uses C to tokenize file data with state-machine logic. For my own reference, I wrote up a highly simplified/pseudocode version of the code here. The basic gist is this:

read_csv and read_table have the same implementation, which relies on a supplied engine class to do the real work. The C-based reader is the default engine, but there is a Python engine (and a fixed-width engine) which can be employed for greater flexibility.
There are two main stages to the reading algorithm: tokenization and type conversion. The latter is fairly similar to Astropy's conversion method; that is, it tries converting each column to ints, then floats, then booleans, and then strings and moves on when conversion succeeds.
Tokenization is done in a switch statement which deals with each input character on a case-by-case basis depending on the parser's current state (stored as an enum). Its behavior is fairly customizable, e.g. splitting by whitespace and using a custom line terminator.

After discussing this with my mentors, I'll work on adapting this basic idea to Astropy. My guess right now (based on the issues noted above) is that I might need to alter Pandas' Cython/C code or reimplement it before integrating it into Astropy. Tom and Mike previously suggested that the plan should be to hybridize io.ascii in some sense, so that some of the fancier readers and writers can still use the current flexible framework and the simpler readers/writers can default to faster C-based code. I'll probably write a class CReader from which Csv, Rdb, and other formats can inherit (rather than BaseReader) and which will act as a wrapper for my new implementation. Another thought I had, although I'm not really sure if it would turn out to be feasible and efficient, would be to replace the existing conversion algorithm with a conversion system tied in with the tokenizing system in which each value's dtype is recorded and widening conversions are performed on-the-fly. Anyhow, I'll be spending the next couple of weeks on writing my new implementation.

Sunday, June 8, 2014

Week 3

It's the end of the third week of coding, and I've finished the benchmarking stage of the project. After a hangout meeting with two of my mentors, Tom Aldcroft and Michael Droettboom in which we discussed my work so far and how to proceed, I began working on improving the existing benchmarks, creating new ones (some dealing with relevant parts of astropy.table and some comparing the performance of AstroPy with numpy and pandas), doing some high-level profiling, and documenting the results. Although I already posted the link in my previous post, here is the GitHub repo containing most of my work, including randomly generated sample text files for benchmarking.

Some of my work involved fixing or extending the current tests; for example, I used cStringIO as input and output for benchmarks instead of reading and writing from file each time. I also added benchmarks for some other functions that I found to be significant while profiling in addition to running line_profiler on these as well; the current asv graph is here and the line profiling results are here. Although I had expected that the Table class itself would be pretty significant for benchmarking, I actually discovered via profiling that most of the work is done in Column and pprint.py, so the benchmarks I wrote pertain to those.

More importantly, I used cProfiler to get a high-level view of the amount of time the current implementation spends in various functions, which provided me with what should be a good framework for investigating how to change io.ascii in the coming weeks. Per Tom's and Michael's suggestion, I used snakeviz, a nifty little profile visualization tool, to look at cProfiler's output. Finally, I wrote up the results in this markup file, which was particularly necessary because asv doesn't (currently) have a mechanism for comparing benchmarks. I wasn't aware of any way to publish snakeviz's output (e.g. on GitHub pages), so I just took a couple screenshots and linked in my markup file. The writeup is more specific, but my main findings were that pandas >> numpy > AstroPy for reading/writing speed, different formats were fairly similar in terms of speed, and both reading and writing times varied considerably for different data types (probably because conversion is currently inefficient).

Next week I'll be looking closely at how Pandas handles text-based reading and writing in order to see if their approach can be adapted to AstroPy, after which I'll plan out how best to change the current implementation in io.ascii. If Pandas turns out to be compatible with the Table class, then I expect the gains in terms of time efficiency should be pretty huge, since Pandas is currently about five times as fast as AstroPy.

Sunday, June 1, 2014

Week 2

This week, my main focus was on writing benchmarks for the formats in `io.ascii` that I'll be working with over the next month or so. These include CSV, RDB, tab-separated, fixed-width, and more. My git repository containing asv benchmarks can be found here. In addition, here is a graph of the benchmarks so far.

I've become pretty familiar with asv, although I did experience a fairly annoying RuntimeWarning at one point ("Parent module '' not found") -- not sure what that was about, but my guess is that it was something to do with virtualenv. I also used line_profiler to get a basic sense for the percentage of total time `io.ascii` is spending in various functions, viewable here. Anyway, I'm certainly not done with the benchmarking process, but I've made a solid start and I should be done by the end of next week. My focus next week will be on writing benchmarks for parts of the `Table` class relevant to `TableOutputter` (which converts actual table data into an AstroPy `Table`).

More to come...