Skip to content Skip to sidebar Skip to footer

How To Make Pandas Hdfstore 'put' Operation Faster

I'm trying to build a ETL toolkit with pandas, hdf5. My plan was extracting a table from mysql to a DataFrame; put this DataFrame into a HDFStore; But when i was doing the st

Solution 1:

I am pretty convinced your issue is related to type mapping of the actual types in DataFrames and to how they are stored by PyTables.

  • Simple types (floats/ints/bools) that have a fixed represenation, these are mapped to fixed c-types
  • Datetimes are handled if they can properly be converted (e.g. they have a dtype of 'datetime64[ns]', notably datetimes.date are NOT handled (NaN are a different story and depending on usage can cause the entire column type to be mishandled)
  • Strings are mapped (in Storer objects to Object type, Table maps them to String types)
  • Unicode are not handled
  • all other types are handled as Object in Storers or an Exception is throw for Tables

What this means is that if you are doing a put to a Storer (a fixed-representation), then all of the non-mappable types will become Object, see this. PyTables pickles these columns. See the below reference for ObjectAtom

http://pytables.github.com/usersguide/libref/declarative_classes.html#the-atom-class-and-its-descendants

Table will raise on an invalid type (I should provide a better error message here). I think I will also provide a warning if you try to store a type that is mapped to ObjectAtom (for performance reasons).

To force some types try some of these:

import pandas as pd

# convert None to nan (its currently Object)# converts to float64 (or type of other objs)
x = pd.Series([None])
x = x.where(pd.notnull(x)).convert_objects()

# convert datetime like with embeded nans to datetime64[ns]
df['foo'] = pd.Series(df['foo'].values, dtype = 'M8[ns]')

Heres a sample on 64-bit linux (file is 1M rows, about 1 GB in size on disk)

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.10.1.dev'

In [3]: import tables

In [4]: tables.__version__
Out[4]: '2.3.1'

In [4]: df = pd.DataFrame(np.random.randn(1000 * 1000, 100), index=range(int(
   ...: 1000 * 1000)), columns=['E%03d' % i for i in xrange(100)])

In [5]: for x in range(20):
   ...:     df['String%03d' % x] = 'string%03d' % x

In [6]: df
Out[6]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 120 entries, E000 to String019
dtypes: float64(100), object(20)

# storer put (cannot query) 
In [9]: def test_put():
   ...:     store = pd.HDFStore('test_put.h5','w')
   ...:     store['df'] = df
   ...:     store.close()

In [10]: %timeit test_put()
1 loops, best of 3: 7.65 s per loop

# table put (can query)
In [7]: def test_put():
      ....:     store = pd.HDFStore('test_put.h5','w')
      ....:     store.put('df',df,table=True)
      ....:     store.close()


In [8]: %timeit test_put()
1 loops, best of 3: 21.4 s per loop

Solution 2:

How to make this faster?

  1. use 'io.sql.read_frame' to load data from a sql db to a dataframe. Because the 'read_frame' will take care of the columns whose type is 'decimal' by turning them into float.
  2. fill the missing data for each columns.
  3. call the function 'DataFrame.convert_objects' before putting operation
  4. if having string type columns in dateframe, use 'table' instead of 'storer'

store.put('key', df, table=True)

After doing these jobs, the performance of putting operation has a big improvement with the same data set:

CPU times:user42.07s,sys:28.17s,total:70.24sWall time:98.97s

Profile logs of the second test:

95984 function calls (95958 primitive calls) in 68.688 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      445   16.757    0.038   16.757    0.038 {numpy.core.multiarray.array}
       19   16.250    0.855   16.250    0.855 {method '_append_records' of 'tables.tableExtension.Table' objects}
       16    7.958    0.497    7.958    0.497 {method 'astype' of 'numpy.ndarray' objects}
       19    6.533    0.344    6.533    0.344 {pandas.lib.create_hdf_rows_2d}
        4    6.284    1.571    6.388    1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects}
       20    2.640    0.132    2.641    0.132 {pandas.lib.maybe_convert_objects}
        1    1.785    1.785    1.785    1.785 {pandas.lib.isnullobj}
        7    1.619    0.231    1.619    0.231 {method 'flatten' of 'numpy.ndarray' objects}
       11    1.059    0.096    1.059    0.096 {pandas.lib.infer_dtype}
        1    0.997    0.997   41.952   41.952 pytables.py:2468(write_data)
       19    0.985    0.052   40.590    2.136 pytables.py:2504(write_data_chunk)
        1    0.827    0.827   60.617   60.617 pytables.py:2433(write)
     1504    0.592    0.000    0.592    0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
        4    0.534    0.133   13.676    3.419 pytables.py:1038(set_atom)
        1    0.528    0.528    0.528    0.528 {pandas.lib.max_len_string_array}
        4    0.441    0.110    0.571    0.143 internals.py:1409(_stack_arrays)
       35    0.358    0.010    0.358    0.010 {method 'copy' of 'numpy.ndarray' objects}
        1    0.276    0.276    3.135    3.135 internals.py:208(fillna)
        5    0.263    0.053    2.054    0.411 common.py:128(_isnull_ndarraylike)
       48    0.253    0.005    0.253    0.005 {method '_append' of 'tables.hdf5Extension.Array' objects}
        4    0.240    0.060    1.500    0.375 internals.py:1400(_simple_blockify)
        1    0.234    0.234   12.145   12.145 pytables.py:1066(set_atom_string)
       28    0.225    0.008    0.225    0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects}
       36    0.218    0.006    0.218    0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
     6110    0.155    0.000    0.155    0.000 {numpy.core.multiarray.empty}
        4    0.097    0.024    0.097    0.024 {method 'all' of 'numpy.ndarray' objects}
        6    0.084    0.014    0.084    0.014 {tables.indexesExtension.keysort}
       18    0.084    0.005    0.084    0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects}
    11816    0.064    0.000    0.108    0.000 file.py:1036(_getNode)
       19    0.053    0.003    0.053    0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
     1528    0.045    0.000    0.098    0.000 array.py:342(_interpret_indexing)
    11709    0.040    0.000    0.042    0.000 file.py:248(__getitem__)
        2    0.027    0.013    0.383    0.192 index.py:1099(get_neworder)
        1    0.018    0.018    0.018    0.018 {numpy.core.multiarray.putmask}
        4    0.013    0.003    0.017    0.004 index.py:607(final_idx32)

Post a Comment for "How To Make Pandas Hdfstore 'put' Operation Faster"