Google Summer of Code: Week 5

This was an interesting week, as I began the actual implementation of the new plan for fast reading/writing in io.ascii. My meeting with Tom on Tuesday concluded with the decision to write new routines from scratch instead of using the routines in the Pandas library (although I'm still using ideas from Pandas' approach, of course). Additionally, instead of replacing the existing functionality for basic readers and writers in io.ascii, the plan is to maintain compatibility by creating new readers and writers (FastBasic, FastCsv, etc.) with less flexibility which will fall back on the old reading/writing classes when the C engine is too rigid. The user will also have some sort of option to choose between these new, faster readers and the old readers, perhaps through a parameter new passed to ui.read(). Since text-based parsing in Pandas is somewhat inflexible and unwieldy, it should be helpful to have a more well-documented and tight-knit library for use in Astropy.

I began a new branch for development called fast-c-reader on my Astropy fork, viewable here. I'm using a three-tiered approach to parsing; pure Python classes FastBasic, FastCsv, etc. interface with the rest of io.ascii, the Cython class CParser acts as an intermediate engine which handles Python method calls and invokes the tokenizer, and C code centered around the structtokenizer_t (influenced by the Pandas parser_t) performs the dirty work of reading input data quickly with state-machine logic. The Python class FastBasic, from which other fast classes inherit, raises a ParameterError if the user passes a parameter which the C engine cannot handle -- for example, the C tokenizer can't deal with a regex comment string, so it refuses to read if the user passes anything other than a 1-character string as the comment parameter. Similarly, converters, Outputter, and other flexible parameters cannot be passed to a fast reader.

The real heart of the algorithm, in the C method tokenize, is currently fairly small, but it'll grow as I add back in more functionality like ignoring leading and trailing whitespace, replacing fill values, etc. It deals with the structure tokenizer_t, which has four main components:

char *source, a single string containing all of the input data
char **output_cols, an array of strings representing the output data for each column
int *row_positions, an array of positions denoting where each row begins in output_cols
tokenizer_state state, an enum value denoting the current state of the tokenizer

I'm treating row_positions and each output_cols[i] as "smart arrays"; when their capacity is exceeded, the functions resize_rows()or resize_cols() call realloc() to double the size of the array. Later I'll see if there might be a more efficient way to store output data on-the-fly, but for now it seems like a reasonably memory-efficient approach. Anyway, output_cols is initialized after header reading, which provides the number of columns in the able (or after reading the first line of table data if header_start=None). Some values inoutput_cols are simply '\x00'; these act as fill values because each row takes up the same number of spaces in everyoutput_cols[i]. row_positions denotes where these contiguous blocks begin. For example, if the input is "A,B,C\n10,5.,6\n1,2,3", the following will hold:

source: "A,B,C\n10,5.,6\n1,2,3"
output_cols: ["A101", "B5.2", "C6\x003"]
row_positions: [0, 1, 3]

There's more I've been doing, such as implementing header_start/data_start, handling comments, etc. Until next week!

Google Summer of Code

Sunday, June 22, 2014

Week 5

No comments:

Post a Comment