I spent this week mostly working on the multiprocessing branch, fixing up some issues from last week and adding a bit more functionality. Most importantly, I finally switched to a more efficient scheme for reading data when given a filename as input; I'd been meaning to deal with this previously, since profiling indicated that the naïve method of simply reading file data into a single Python string took up something like 15% of processing time, so it's nice to have a more efficient method.
One idea I had thought of previously was to split file reading into each process, but my initial changes showed a speed decrease, which made sense after I came across this comment on StackOverflow explaining that it's best from a performance standpoint to access file data sequentially. I then switched over to reading separate chunks for each process in the main process at the beginning of parsing and then passing each chunk to its respective process, which seems more promising but still runs into issues with memory management. I've been trying to find a better solution today, and I think I should be able to figure it out by tomorrow.
Another issue I looked into was finding a faster algorithm for converting strings to doubles, based on Pandas' superior performance. After looking into what Pandas does, I found that a fairly simple conversion function
xstrtod()
replaces strtod()
from the standard library; from what I was told by a Pandas developer, it seems that Pandas considers the speed gains more important than retaining the relatively slow correction loop in strtod()
for high-precision values and is therefore willing to have values off by more than 0.5 ULP (units in the last place). numpy's conversion routine doesn't seem to offer a clear benefit (I tried replacing strtod()
with numpy's conversion and got mixed results), so I'm not sure if any optimization is possible.
I then added a
parallel
parameter for reading and patched up the failing tests (due to multi-line quoted values, which Tom suggested should be an acceptable but documented problem with parallelized reading). I also fixed a unicode issue with string conversion in Python 3, which arose because the 'S' dtype
in numpy corresponds to bytes
in Python3 rather than str
. Interestingly I found it a fairly trivial extension of the file reading code to implement memory mapping for reading on non-Windows systems, so I tentatively added it with a memory_map
keyword and will test it more thoroughly to see if the performance boost is anything very significant. This does not enable a memory-mapped view of the file inside the output Table
, but simply memory-maps the file to a char *
pointer for faster reading and then un-maps it after processing is done.
Next week I'll be working on Erik's idea of creating custom numpy
dtypes
, and I'll also add documentation, make some final fixes, and otherwise get the engine ready for merging in the meantime.
No comments:
Post a Comment