Google Summer of Code: Week 4

This week, I spent some time inspecting the Pandas library, particularly its algorithm for parsing data from text files. My goal was to analyze Pandas' algorithm and to determine whether the fast C reader in Pandas (written partly in Cython, partly in C) is fit for use in AstroPy or whether I'll have to implement a new approach from scratch. With regards to the differences between AstroPy and Pandas, I made several discoveries:

Missing/invalid values are handled differently -- Astropy’s default behavior is to change empty values to ‘0’ while Pandas changes the empty string, ‘NaN’, ‘N/A’, ‘NULL’, etc. to NaN. Both can be altered by passing a parameter (na_values, keep_default_na for Pandas and fill_values for Astropy), but Astropy always masks values while Pandas simply inserts NaN. This might cause confusion because Astropy currently allows for NaN as a non-masked numerical value.
Pandas accepts a parameter “comment” for reading, but it will only take a single char while Astropy allows for a regex.
Minor issues: Pandas doesn’t ignore empty lines and will add a NaN-filled row instead. Also, line comments are not supported and are treated as empty lines. Both issues are addressed in https://github.com/pydata/pandas/issues/4466 and https://github.com/pydata/pandas/pull/4505. I made a pull request to fix these issues at https://github.com/pydata/pandas/pull/7470, so this should be sorted out fairly soon.
Pandas has the parameter “header” to specify where header information begins, but it still expects the header to be a one-line list of names (which could be problematic for fixed-width files with two header lines, for example). It also doesn't have parameters to specify where data begins or ends.

At Tom's suggestion, I also checked out DefaultSplitter.__call__ more closely to see whether the function was actually spending most of its time iterating over csv.reader(). As it turns out, the function spends most of its time both calling process_val and spending time in process_val (i.e. calling strip()). In fact, I found with cProfile that DefaultSplitter.__call__() became about 70% faster and read() became about 30% faster when process_val is set to None. There doesn't seem to be a quick way to deal with this, since csv.reader has no option to strip whitespace automatically (except for left-stripping), but this is a good example of how Astropy is currently wasting a lot of time with overhead and intermediate structures for holding text data.

Along those lines, I read through Wes McKinney's old blog entry describing Pandas' approach to text-based file parsing and discovered that Pandas' greatest advantage over other readers (like numpy) is that it forgoes the use of intermediate Python data structures and instead uses C to tokenize file data with state-machine logic. For my own reference, I wrote up a highly simplified/pseudocode version of the code here. The basic gist is this:

read_csv and read_table have the same implementation, which relies on a supplied engine class to do the real work. The C-based reader is the default engine, but there is a Python engine (and a fixed-width engine) which can be employed for greater flexibility.
There are two main stages to the reading algorithm: tokenization and type conversion. The latter is fairly similar to Astropy's conversion method; that is, it tries converting each column to ints, then floats, then booleans, and then strings and moves on when conversion succeeds.
Tokenization is done in a switch statement which deals with each input character on a case-by-case basis depending on the parser's current state (stored as an enum). Its behavior is fairly customizable, e.g. splitting by whitespace and using a custom line terminator.

After discussing this with my mentors, I'll work on adapting this basic idea to Astropy. My guess right now (based on the issues noted above) is that I might need to alter Pandas' Cython/C code or reimplement it before integrating it into Astropy. Tom and Mike previously suggested that the plan should be to hybridize io.ascii in some sense, so that some of the fancier readers and writers can still use the current flexible framework and the simpler readers/writers can default to faster C-based code. I'll probably write a class CReader from which Csv, Rdb, and other formats can inherit (rather than BaseReader) and which will act as a wrapper for my new implementation. Another thought I had, although I'm not really sure if it would turn out to be feasible and efficient, would be to replace the existing conversion algorithm with a conversion system tied in with the tokenizing system in which each value's dtype is recorded and widening conversions are performed on-the-fly. Anyhow, I'll be spending the next couple of weeks on writing my new implementation.

Google Summer of Code

Sunday, June 15, 2014

Week 4

No comments:

Post a Comment