Google Summer of Code: 2014

Monday, August 18, 2014

Week 13

This was the final week of Google Summer of Code, and since last Monday was the suggested "pencils down" date, I spent the week focusing on getting the main pull request ready for merging. I began by testing the new fast converter for unusual input, then handled issues Erik noted with the PR, filed an issue with Pandas, and began work on a new branch which implements a different memory scheme in the tokenizer. The PR seems to be in a final review stage, so hopefully it'll be merged by next week.

After testing out xstrtod(), I noticed a couple problems with extreme input values and fixed them; the most notable problem was an inability to handle subnormals (values with exponent less that -308). As of now, the converter seems to work pretty well for a wide range of input, and the absolute worst-case error seems to be around 3.0 ULP. Interestingly, when I reported the problems with the old xstrtod() as a bug in Pandas, the response I received was that the current code should remain, but a new parameter float_precision might be added to allow for more accurate conversion. Both Tom and I found this response a little bizarre, since the issues with xstrtod() seem quite buggy, but in any case I have an open PR to implement this in Pandas.

Aside from this, Erik pointed out some suggestions and concerns about the PR, which I dealt with in new commits. For example, he suggested that I use the mmap module in Python rather than dealing with platform-dependent memory mapping in C, which seems to make more sense for the sake of portability. He also pointed out that the method FileString.splitlines(), which returns a generator yielding lines from the memory-mapped file, was inefficient due to repeated calls to chr(). I ultimately rewrote it in C, and although its performance is really only important for commented-header files with header line deep into the file, I managed to get more than a 2x speedup on a 10,000-line integer file with a commented header line in the last row with the new approach.

Although it won't be a part of the main PR, I've also been working on a separate branch change-memory-layout which changes the storage of output in memory in the tokenizer. The main purpose of this branch is to reduce the memory footprint of parsing, as the peak memory usage is almost twice that of Pandas; the basic idea is that instead of storing output in char **output_cols, it's stored instead in a single string char *output and an array of pointers, char **line_ptrs, records the beginning of each line for conversion purposes. While I'm still working on memory improvements, I actually managed to get a bit of a speed boost with this approach. Pure floating-point data is now slightly quicker to read with io.ascii than with Pandas, even without multiprocessing enabled!

Since today is the absolute pencils down date, this marks the official end of the coding period and the end of my blog posts. I plan to continue responding to the review of the main PR and finish up the work in my new branch, but the real work of the summer is basically over. It's been a great experience, and I'm glad I was able to learn a lot and get involved in Astropy development!

Tuesday, August 12, 2014

Week 12

There's not too much to report for this week, as I basically worked on making some final changes and double-checking the writing code to make sure it works with the entire functionality of the legacy writer. After improving performance issues related to tokenization and string conversion, I created a final version of the IPython notebook for reading. Since IPython doesn't work well with multiprocessing, I wrote a separate script to test the performance of the fast reader in parallel and output the results in an HTML file; here are the results on my laptop. Parallel reading seems to work well for very large input files, and I guess the goal of beating Pandas (at least for huge input and ordinary data) is basically complete! Writing is still a little slower than the Pandas method to_csv, but I fixed an issue involving custom formatting; the results can be viewed here.

I also wrote up a separate section in the documentation for fast ASCII I/O, although there's still the question of how to incorporate IPython notebooks in the documentation. For now I have the notebooks hosted in a repo called ascii-profiling, but they may be moved to a new repo called astropy-notebooks. More importantly, Tom noticed that there must actually be something wrong with the fast converter (xstrtod()), since increasing the number of significant figures seems to scale the potential conversion error linearly. After looking over xstrtod() and reading more about IEEE floating-point arithmetic, I found a reasonable solution by forcing xstrtod() to stop parsing digits after the 17th digit (since doubles can only have a maximum precision of 17 digits) and by correcting an issue in the second half of xstrtod(), where the significand is scaled by a power of ten. I tested the new version of xstrtod() in the conversion notebook and found that low-precision values are now guaranteed to be within 0.5 ULP, while high-precision values are within 1.0 ULP about 90% of the time with no linear growth in error.

Once I commit the new xstrtod(), my PR should be pretty close to merging--at this point I'll probably write some more tests just to make sure everything works okay. Today is the suggested "pencils down" date of Google Summer of Code, so I guess it's time to wrap up.

Monday, August 4, 2014

Week 11

Since the real goal at this point is to finish up my main PR and my multiprocessing branch in order to merge, I ended up spending this week on final changes instead of Erik's dtype idea. My code's gotten some more review and my mentors and I have been investigating some details and various test cases, which should be really useful for documentation.

One nice thing I managed to discover was how well xstrtod() (the Pandas-borrowed fast float conversion function) works for various input precisions. Unlike strtod(), which is guaranteed to be within 0.5 ULP (units in the last place, or the distance between the two closest floating-point numbers) of the correct result, xstrtod() has no general bound and in fact might be off by several ULP for input with numerous significant figures. However, it works pretty well when the number of significant figures is relatively low, so users might prefer to choose use_fast_convert=True for fairly low-precision data. I wrote up an IPython notebook showing a few results, which Tom also built on in another notebook. I plan to include the results in the final documentation, as users might find it useful to know more about rounding issues with parsing and judge whether or not the fast converter is appropriate for their purposes.

On the multiprocessing branch, I added in xstrtod() and the use_fast_converter parameter, which defaults to False. After discussion with my mentors, I changed the file reading system so that the parser employs memory mapping whenever it reads from a file; the philosophy is that, aside from speed gains, reading a 1 GB file via memory mapping will save users from having to load a full gigabyte into memory. The main challenge with memory mapping (and later with other compatibility concerns) is getting the code to run correctly on Windows, which turned out to be more frustrating than I expected.

Since Windows doesn't have the POSIX function memmap(), specific Windows memory mapping code has to be wrapped in an #ifdef _WIN32 block, and the fact that Windows has no fork() call for multiprocessing means that memory is not simply copy-on-write as it is on Linux, which leads to a host of other issues. For example, I first ran into a weird issue involving pickling that ultimately turned out to be due to a six bug, which has been noted for versions < 1.7.0. I opened a PR to update the bundled version of six in AstroPy, so that issue should be fixed pretty quickly. There were some other problems, such as the fact that processes cannot be created with bound methods in Windows (which I circumvented by turning _read_chunk into a normal method and making CParser picklable via a custom __reduce__ method), but things seem to work correctly on Windows now. I actually found that there was a slowdown in switching to parallel reading, but I couldn't find a cause with cProfile; it might have been the fact that I used a very old laptop for testing, so I'll have to find some other way to see if there's actually a problem.

Tom also wrote a very informative IPython notebook detailing the performance of the new implementation compared to the old readers,genfromtxt(), and Pandas, which I'll include in the documentation for the new fast readers. It was also nice to see an interesting discussion regarding metadata parsing in issue #2810 and a new PR to remove boilerplate code, which is always good. I also made a quick fix to the HTML reading tests and opened a PR to allow for a user-specified backend parser in HTML reading, as Tom pointed out that certain files will work with one backend (e.g. html5lib) and not the default.

Monday, July 28, 2014

Week 10

I spent this week mostly working on the multiprocessing branch, fixing up some issues from last week and adding a bit more functionality. Most importantly, I finally switched to a more efficient scheme for reading data when given a filename as input; I'd been meaning to deal with this previously, since profiling indicated that the naïve method of simply reading file data into a single Python string took up something like 15% of processing time, so it's nice to have a more efficient method.

One idea I had thought of previously was to split file reading into each process, but my initial changes showed a speed decrease, which made sense after I came across this comment on StackOverflow explaining that it's best from a performance standpoint to access file data sequentially. I then switched over to reading separate chunks for each process in the main process at the beginning of parsing and then passing each chunk to its respective process, which seems more promising but still runs into issues with memory management. I've been trying to find a better solution today, and I think I should be able to figure it out by tomorrow.

Another issue I looked into was finding a faster algorithm for converting strings to doubles, based on Pandas' superior performance. After looking into what Pandas does, I found that a fairly simple conversion function xstrtod() replaces strtod() from the standard library; from what I was told by a Pandas developer, it seems that Pandas considers the speed gains more important than retaining the relatively slow correction loop in strtod() for high-precision values and is therefore willing to have values off by more than 0.5 ULP (units in the last place). numpy's conversion routine doesn't seem to offer a clear benefit (I tried replacing strtod() with numpy's conversion and got mixed results), so I'm not sure if any optimization is possible.

I then added a parallel parameter for reading and patched up the failing tests (due to multi-line quoted values, which Tom suggested should be an acceptable but documented problem with parallelized reading). I also fixed a unicode issue with string conversion in Python 3, which arose because the 'S' dtype in numpy corresponds to bytes in Python3 rather than str. Interestingly I found it a fairly trivial extension of the file reading code to implement memory mapping for reading on non-Windows systems, so I tentatively added it with a memory_map keyword and will test it more thoroughly to see if the performance boost is anything very significant. This does not enable a memory-mapped view of the file inside the output Table, but simply memory-maps the file to a char * pointer for faster reading and then un-maps it after processing is done.

Next week I'll be working on Erik's idea of creating custom numpy dtypes, and I'll also add documentation, make some final fixes, and otherwise get the engine ready for merging in the meantime.

Monday, July 21, 2014

Week 9

This week was a pretty straightforward one--I spent most of it working on implementing parallelization in the Cython reading algorithm (in a new branch, multiprocessing). After Tuesday's discussion about what problem to pursue next, we decided that I should spend a week or two on parallelization and then begin to work on creating custom numpy dtypes, which will be an extension of Erik's work and should be applicable to FITS files as well as ASCII to some extent.

In short, the new strategy I implemented for reading involves splitting input data into N chunks (I currently have N=8, but I'll have to find an optimal N at some point) and performing both tokenization and conversion in a different process for each chunk. There are a few subtleties involved; for example, chunks need to be separated by newlines, certain data has to be copied for each process, etc. One irritating issue I came across was one with conversion in which one process discovers string data and another discovers numeric data in the same column. The basic idea at the end of the algorithm is to take the resulting data from each process and concatenate the chunks together, using widening conversions (i.e. int to float to str) when chunks have different data types. However, if a column contains integers in the first chunk, floats in the second, and strings in the third, concatenating the first two would turn 5 to 5.0 and then concatenating the first two chunks with the third would result in "5.0" instead of the desired "5". The obvious solution is to check explicitly for string data in any of the chunks and immediately convert all the columns to strings, but this still results in improper behavior for numbers in scientific notation, for example (1.3e2 -> 13 -> "13"). I ended up simply using a multiprocessing.Queue to force processes to reconvert columns as strings in this case, discarding the original converted data altogether. Luckily this seems like a pretty unlikely corner case for large data files, so performance shouldn't really take a hit.

A more serious problem, and one I haven't quite decided how to handle, is how to deal with quoted fields spanning multiple lines. Since my approach involves splitting data into chunks by lines, an input string like 'a b c\n1 2 "3\n4"' will not parse correctly because one chunk will try parsing the line 1 2 "3 (noticing an unfinished quote sequence) and another will try parsing 4" (noticing an unexpected quote character and too few data values in the row). Unfortunately no good solution comes to mind, since explicitly searching through the input for quote characters would basically negate the performance gains of multiprocessing in the first place. I plan on asking my mentors what they think, since I'm not sure how commonly this sort of case occurs.

All tests are passing with the new multiprocessing algorithm except for those involving the quoting issue, and there is a speed gain over the old method. Unfortunately it's a little less than I hoped it would be (Tom warned that this could happen, citing Amdahl's Law), but still fairly nice; on a sample file that took the old algorithm about 0.11 seconds (the same file I've been using for a few weeks...I should probably start trying other ones), the new algorithm takes about 0.075 seconds, a speed gain of about 45%. I've been working on improving the reader's performance with reading from file, since inputting the file into a single string takes up about 0.012 seconds of the reading time, although I'm not yet finished. Table.__init__() is also taking about 0.01 seconds, which is probably unavoidable. Pandas is taking about 0.055 seconds on the same file. It's getting trickier to cut down time at this point, but hopefully I can edge out Pandas in the end!

Monday, July 14, 2014

Week 8

After improving the fast writer and fixing up some lingering issues last week, I spent this week looking into porting my reading implementation to numpy and reading the memory mapping code in io.fits to learn more about memory mapping. This felt like a bit of a slow week, since I was sick on Thursday and had trouble figuring out a problem with numpy's build system on Friday, but I think my work on adapting my implementation to genfromtxt() should definitely come in useful (at least at some point when I have time to finish the work).

I tidied up the code and made a few final PRs on Monday and Tuesday, and then at Tom's suggestion I opened a PR for review (https://github.com/astropy/astropy/pull/2716#issuecomment-48749423). Since it's a big PR, I expect it'll take some time to receive an extensive code review, but it's nice to have a roughly final implementation that passes on Travis CI. I also opened a PR in astropy-benchmarks for the io.ascii benchmarks I wrote at the beginning of the summer (https://github.com/astropy/astropy-benchmarks/pull/7). From there I went on to look at the memory mapping in io.fits and read about memory mapping in general, and I'm sure this will be something to discuss in the next hangout; I'm curious whether the Python mmap module will work with an underlying C layer, or how else memory mapping might be achieved.

More importantly, I spent much of the week changing my text reading code to work in numpy as an engine for genfromtxt(). After Tom pointed me to a thread on the numpy discussion list, I forked numpy and started writing a new version of genfromtxt() in a separate branch: https://github.com/amras1/numpy/tree/faster-genfromtxt. I also spoke to a numpy developer on IRC who participated in the mailing list thread, and he believed that my implementation might be of use. He also explained that a particularly important problem with loadtxt() and genfromtxt() is their broken handling of unicode with Python 3, which necessitates a full rewrite of the two functions. After that, I worked on altering the C tokenizing code to handle UTF-8 rather than plain ASCII and wrote related code to have the Cython/C parser work underneath genfromtxt(). As of now, genfromtxt() is working with relatively simple sample data, although there's still more to do with handling the function's parameters (e.g. missing_values and usemask).

Adapting the Cython/C reader to numpy has been a little more difficult than I thought, but I'm glad to have learned a lot of useful information about unicode and the differences between Python 2.x and 3.x. I'm not really sure what I'll be doing next week, but I'll know after the hangout meeting tomorrow.

Monday, July 7, 2014

Week 7

After the bulk work of the last two weeks, I spent this week mostly fixing up some minor issues with the fast readers and adding some more functionality. I think I can finally say that the C reader is more or less done; I might want to work more on getting its speed closer to Pandas' read_csv(), but it's currently stable and passes the tests I've written as well as the old tests (I added parametrized fixtures to the generic reading tests).

As Tom said in our last hangout, there seem to be a few directions to choose from here: improving the C reader's speed, implementing a fast writer, writing more tests/documentation, etc. This week I opted to work on the fast writer after finishing up lingering issues with the functionality of the C reader, but I'd like to address the other issues as well if time permits. Tom also mentioned that numpy is looking for an overhaul of loadtxt() and genfromtxt(), so if possible I'd like to offer up the code I've written for adaptation in numpy. This is a low priority at the moment, since I'm focusing on tightening up the code for Astropy, but at any rate I may be able to donate the code even if I don't have time to work on adapting it to genfromtxt()/loadtxt()—the developers on the numpy mailing list mentioned looking into Pandas, but they might find my implementation more adaptable (albeit slower).

During the first half of the week, I fixed some issues I had previously delayed, such as right stripping whitespace from table fields in the tokenizer, adding a parameter use_fast_reader to the io.ascii infrastructure, etc. as well as adding more comments and writing up a quick design document. I also added new fast readers for commented-header and RDB files after making the _read_header() method in FastBasic more extensible. One idea that Tom had was to simply use the old header classes to read file headers since the time header reading takes is so negligible, but I found that the headers were too closely tied to the BaseReader infrastructure to use them without modification. For now, at least, overriding _read_header() seems like a reasonable way to allow for some flexibility in header reading.

After that, I worked on creating a fast writing system in Cython. I ran into a number of problems when I tried out my algorithm by parametrizing fixtures in the writing test suite, particularly with masking, but I got it working after a while. I also included some format-specific handling to the current fast readers in order to deal with specific writing issues (e.g. omitting the header line, writing column types in RDB, etc). As of right now the implementation is passing all tests, but it could use a little more customization and definitely isn't fast enough—I found the fast reader took 2.8 seconds to write a file that took 3.5 seconds to write without the fast reader, only a 25% speed reduction. Profiling has been little tricky, since I can't find a line profiler for Cython and sometimes splitting into subroutines doesn't work well (since the overhead of function calls becomes a big factor). However, it's definitely clear that a lot of time is spent in the iter_str_vals() method called on each column for string iteration. I should be able to find a way to cut the writing time down significantly.

Although we're not having a hangout this week, after I get some work done on the writing algorithm I plan to ask Tom what direction would be best to go to from here and at what point I should plan on opening a PR for my branch. According to the schedule I should be looking into the possibility of memory mapping soon, but Tom said he's not really sure if it's a feasible idea; I guess that will be tabled for a little while, but we'll see what happens.

Sunday, June 29, 2014

Week 6

This week was fairly straightforward, as I continued my work from last week on the new Cython/C parsing implementation and began looking for areas of improvement. As of now, my new implementation is pretty consistent with the behavior of Basic and other simple readers; there are a couple minor things tagged "//TODO:" and I'm sure there are issues I haven't yet noticed, but the bulk work is basically over. In order to make sure the implementation works as intended and to provide some examples for anyone reading through my code, I wrote some tests this week in addition to writing new functionality.

A good deal of my time this week was spent on implementing the parameters accepted by ascii.read(), outlined here. First, though, I added FastBasic, FastCsv, and FastTab to the infrastructure of ascii.read() so that they belong to the guess list and can be specified for reading (e.g. format="fast_basic"). After that, I improved the tokenizer by dealing with quoted values, fixing a memory allocation problem in C with realloc, skipping whitespace at the beginning of fields, etc. Incidentally, this last point should be much more efficient than the old method of dealing with leading/trailing whitespace, since whitespace is conditionally dealt with in the tokenizer and the overhead of calling strip() on a bunch of Python objects is gone.

I also implemented include_names and exclude_names; these will improve parsing performance if specified (unlike the ordinary parsing method, which collects all columns and later throws some out) because the tokenizer only stores the necessary columns in memory. Then I worked on other parameters, such as header_start/data_start/data_end (the last one will not improve performance if negative, though), fill_values, fill_include_names/fill_exclude_names, etc. I also made the tokenizer accept a parameter which specifies whether rows with insufficient data should be padded by adding empty values; FastBasic sets the parameter to False by default, so any such case raises an error, but FastCsv has the parameter equal to True just as Csv's functionality differs from Basic. Of course, either reader will raise an error upon reading too many data values in a given row.

In our brief meeting on Tuesday, Tom and I talked a little about the conversion part of the algorithm and how we might make it more efficient. I initially used the method astype of numpy.ndarray to try conversion first to int, then to float, and then to string, but I discovered that this was not very efficient; in fact, I ran into memory errors for a couple of the large files I used for benchmarking. I therefore wrote new conversion code which deals with potential masking (in case of bad/empty values, as specified by fill_values) and calls underlying C code which uses strol() or strtod() to convert strings to ints or floats. This seems to work quite a bit better, but I'd still like to see if there's a better way to convert a chunk of data very efficiently.

Anyhow, it's nice to have an implementation down which passes some preliminary tests and seems to perform pretty well. Out of curiosity, I used timeit to check how the new code holds up to the more flexible code and to Pandas. On an integer-filled file which took the flexible code about 0.7 seconds and Pandas about 0.055,FastBasic took about 0.5 seconds. Not terrible, but I definitely want it to be a lot closer to Pandas' speed, and I think I should be able to improve it quite a bit once I do some profiling. Next week, I'm probably going to finish up the last small fixes I have planned and then focus on improving the overall performance of the implementation. I'll also probably start implementing CSV writing, which should be quite a bit simpler than reading judging from Pandas' code.

EDIT: After I moved some of the code in the conversion functions to C (to avoid the creation of Python strings), FastBasic is now down to about 0.12 seconds with the same file. Nice to have a reader about 6 times as fast as the old one, but there's more to do; Pandas is still about twice as fast asFastBasic!

Sunday, June 22, 2014

Week 5

This was an interesting week, as I began the actual implementation of the new plan for fast reading/writing in io.ascii. My meeting with Tom on Tuesday concluded with the decision to write new routines from scratch instead of using the routines in the Pandas library (although I'm still using ideas from Pandas' approach, of course). Additionally, instead of replacing the existing functionality for basic readers and writers in io.ascii, the plan is to maintain compatibility by creating new readers and writers (FastBasic, FastCsv, etc.) with less flexibility which will fall back on the old reading/writing classes when the C engine is too rigid. The user will also have some sort of option to choose between these new, faster readers and the old readers, perhaps through a parameter new passed to ui.read(). Since text-based parsing in Pandas is somewhat inflexible and unwieldy, it should be helpful to have a more well-documented and tight-knit library for use in Astropy.

I began a new branch for development called fast-c-reader on my Astropy fork, viewable here. I'm using a three-tiered approach to parsing; pure Python classes FastBasic, FastCsv, etc. interface with the rest of io.ascii, the Cython class CParser acts as an intermediate engine which handles Python method calls and invokes the tokenizer, and C code centered around the structtokenizer_t (influenced by the Pandas parser_t) performs the dirty work of reading input data quickly with state-machine logic. The Python class FastBasic, from which other fast classes inherit, raises a ParameterError if the user passes a parameter which the C engine cannot handle -- for example, the C tokenizer can't deal with a regex comment string, so it refuses to read if the user passes anything other than a 1-character string as the comment parameter. Similarly, converters, Outputter, and other flexible parameters cannot be passed to a fast reader.

The real heart of the algorithm, in the C method tokenize, is currently fairly small, but it'll grow as I add back in more functionality like ignoring leading and trailing whitespace, replacing fill values, etc. It deals with the structure tokenizer_t, which has four main components:

char *source, a single string containing all of the input data
char **output_cols, an array of strings representing the output data for each column
int *row_positions, an array of positions denoting where each row begins in output_cols
tokenizer_state state, an enum value denoting the current state of the tokenizer

I'm treating row_positions and each output_cols[i] as "smart arrays"; when their capacity is exceeded, the functions resize_rows()or resize_cols() call realloc() to double the size of the array. Later I'll see if there might be a more efficient way to store output data on-the-fly, but for now it seems like a reasonably memory-efficient approach. Anyway, output_cols is initialized after header reading, which provides the number of columns in the able (or after reading the first line of table data if header_start=None). Some values inoutput_cols are simply '\x00'; these act as fill values because each row takes up the same number of spaces in everyoutput_cols[i]. row_positions denotes where these contiguous blocks begin. For example, if the input is "A,B,C\n10,5.,6\n1,2,3", the following will hold:

source: "A,B,C\n10,5.,6\n1,2,3"
output_cols: ["A101", "B5.2", "C6\x003"]
row_positions: [0, 1, 3]

There's more I've been doing, such as implementing header_start/data_start, handling comments, etc. Until next week!

Sunday, June 15, 2014

Week 4

This week, I spent some time inspecting the Pandas library, particularly its algorithm for parsing data from text files. My goal was to analyze Pandas' algorithm and to determine whether the fast C reader in Pandas (written partly in Cython, partly in C) is fit for use in AstroPy or whether I'll have to implement a new approach from scratch. With regards to the differences between AstroPy and Pandas, I made several discoveries:

Missing/invalid values are handled differently -- Astropy’s default behavior is to change empty values to ‘0’ while Pandas changes the empty string, ‘NaN’, ‘N/A’, ‘NULL’, etc. to NaN. Both can be altered by passing a parameter (na_values, keep_default_na for Pandas and fill_values for Astropy), but Astropy always masks values while Pandas simply inserts NaN. This might cause confusion because Astropy currently allows for NaN as a non-masked numerical value.
Pandas accepts a parameter “comment” for reading, but it will only take a single char while Astropy allows for a regex.
Minor issues: Pandas doesn’t ignore empty lines and will add a NaN-filled row instead. Also, line comments are not supported and are treated as empty lines. Both issues are addressed in https://github.com/pydata/pandas/issues/4466 and https://github.com/pydata/pandas/pull/4505. I made a pull request to fix these issues at https://github.com/pydata/pandas/pull/7470, so this should be sorted out fairly soon.
Pandas has the parameter “header” to specify where header information begins, but it still expects the header to be a one-line list of names (which could be problematic for fixed-width files with two header lines, for example). It also doesn't have parameters to specify where data begins or ends.

At Tom's suggestion, I also checked out DefaultSplitter.__call__ more closely to see whether the function was actually spending most of its time iterating over csv.reader(). As it turns out, the function spends most of its time both calling process_val and spending time in process_val (i.e. calling strip()). In fact, I found with cProfile that DefaultSplitter.__call__() became about 70% faster and read() became about 30% faster when process_val is set to None. There doesn't seem to be a quick way to deal with this, since csv.reader has no option to strip whitespace automatically (except for left-stripping), but this is a good example of how Astropy is currently wasting a lot of time with overhead and intermediate structures for holding text data.

Along those lines, I read through Wes McKinney's old blog entry describing Pandas' approach to text-based file parsing and discovered that Pandas' greatest advantage over other readers (like numpy) is that it forgoes the use of intermediate Python data structures and instead uses C to tokenize file data with state-machine logic. For my own reference, I wrote up a highly simplified/pseudocode version of the code here. The basic gist is this:

read_csv and read_table have the same implementation, which relies on a supplied engine class to do the real work. The C-based reader is the default engine, but there is a Python engine (and a fixed-width engine) which can be employed for greater flexibility.
There are two main stages to the reading algorithm: tokenization and type conversion. The latter is fairly similar to Astropy's conversion method; that is, it tries converting each column to ints, then floats, then booleans, and then strings and moves on when conversion succeeds.
Tokenization is done in a switch statement which deals with each input character on a case-by-case basis depending on the parser's current state (stored as an enum). Its behavior is fairly customizable, e.g. splitting by whitespace and using a custom line terminator.

After discussing this with my mentors, I'll work on adapting this basic idea to Astropy. My guess right now (based on the issues noted above) is that I might need to alter Pandas' Cython/C code or reimplement it before integrating it into Astropy. Tom and Mike previously suggested that the plan should be to hybridize io.ascii in some sense, so that some of the fancier readers and writers can still use the current flexible framework and the simpler readers/writers can default to faster C-based code. I'll probably write a class CReader from which Csv, Rdb, and other formats can inherit (rather than BaseReader) and which will act as a wrapper for my new implementation. Another thought I had, although I'm not really sure if it would turn out to be feasible and efficient, would be to replace the existing conversion algorithm with a conversion system tied in with the tokenizing system in which each value's dtype is recorded and widening conversions are performed on-the-fly. Anyhow, I'll be spending the next couple of weeks on writing my new implementation.