Performance lessons for reading ascii files into numpy arrays
There are a few ways to read data from a csv file into a numpy array:
- pandas.read_csv 
I recently had to read this type of data for a project I'm working on. It's worth looking into the performance since there are a few choices. This post is the result of that investigation.
Sample data setup
I wrote the following small function to generate a sufficient amount of 'random' data for testing:
def generate_test_data(column_names, row_count, filename): """ Generate file of random test data of size (row_count, len(column_names)) column_names - List of column name strings to use as header row row_count - Number of rows of data to generate filename - Name of file to write test data to """ col_count = len(column_names) rand_arr = np.random.rand(row_count, col_count) header_line = ' '.join(column_names) np.savetxt(filename, rand_arr, delimiter=' ', fmt='%1.5f', header=header_line, comments='')
Hopefully this function is straight-forward so I won't discuss it further.
For this test I simplify used the above function to create a relatively small file:
# For testing just create a column for each lower case letter in English # alphabet columns = [char for char in string.lowercase] row_count = 1000 # Don't need the file open. In order to time things properly we should # allow each method to open the file, etc. itself. fd, filename = tempfile.mkstemp() os.close(fd) generate_test_data(filename, columns, row_count)
This creates a space-separated file of random float data that is about 208 KB, comprised of 26 columns and 1000 rows.
The following snippet is from an IPython shell utilizing
%timeit functionality :
>>> import numpy as np >>> import pandas as pd >>> %timeit -n 100 pd.read_csv('test.out', delim_whitespace=True) 100 loops, best of 3: 6.66 ms per loop >>> %timeit -n 100 f = open('test.out', 'r');f.readline();np.loadtxt(f, unpack=True) 100 loops, best of 3: 28 ms per loop
The result is that Pandas is much faster, but why? The short answer, as developer Wes McKinney posted in response to my question:
Short answer: file tokenization and type inference is being handled at the lowest level possible in C/Cython. If you look at the impl of numpy.loadtxt you'll see a lot of Python.
So there you have it, straight from the author! Interestingly enough the
massive speed increases for
pandas.read_csv are relatively recent, and
Wes has written a few great articles detailing it in
- Speeding up pandas' file parsers with Cython
- A new high performance, memory-efficient file parser engine for pandas
- Update on upcoming pandas v0.10, new file parser, other performance wins
pandas.read_csv will return a
object, which is essentially a 2 dimensional
So, the performance improvements of
pandas.read_csv could come at the price
of another dependency in your project for Pandas.
Also, you'll be getting back a
object instead of more stripped down
 Note that I used the
unpack=True argument to
numpy.loadtxt because I
want to read all the data as column-based arrays, not the default row-based.
This was just a requirement for the application I was profiling for. This
isn't necessary in the Pandas library because data
is read into a
object, which allows slicing by columns.
Published: 03-08-2013 19:49:01