How To Make Pandas Hdfstore 'put' Operation Faster

May 30, 2024 Post a Comment

I'm trying to build a ETL toolkit with pandas, hdf5. My plan was extracting a table from mysql to a DataFrame; put this DataFrame into a HDFStore; But when i was doing the st

Solution 1:

I am pretty convinced your issue is related to type mapping of the actual types in DataFrames and to how they are stored by PyTables.

Simple types (floats/ints/bools) that have a fixed represenation, these are mapped to fixed c-types
Datetimes are handled if they can properly be converted (e.g. they have a dtype of 'datetime64[ns]', notably datetimes.date are NOT handled (NaN are a different story and depending on usage can cause the entire column type to be mishandled)
Strings are mapped (in Storer objects to Object type, Table maps them to String types)
Unicode are not handled
all other types are handled as Object in Storers or an Exception is throw for Tables

What this means is that if you are doing a put to a Storer (a fixed-representation), then all of the non-mappable types will become Object, see this. PyTables pickles these columns. See the below reference for ObjectAtom

http://pytables.github.com/usersguide/libref/declarative_classes.html#the-atom-class-and-its-descendants

Table will raise on an invalid type (I should provide a better error message here). I think I will also provide a warning if you try to store a type that is mapped to ObjectAtom (for performance reasons).

To force some types try some of these:

import pandas as pd

# convert None to nan (its currently Object)# converts to float64 (or type of other objs)
x = pd.Series([None])
x = x.where(pd.notnull(x)).convert_objects()

# convert datetime like with embeded nans to datetime64[ns]
df['foo'] = pd.Series(df['foo'].values, dtype = 'M8[ns]')

Heres a sample on 64-bit linux (file is 1M rows, about 1 GB in size on disk)

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.10.1.dev'

In [3]: import tables

In [4]: tables.__version__
Out[4]: '2.3.1'

In [4]: df = pd.DataFrame(np.random.randn(1000 * 1000, 100), index=range(int(
   ...: 1000 * 1000)), columns=['E%03d' % i for i in xrange(100)])

In [5]: for x in range(20):
   ...:     df['String%03d' % x] = 'string%03d' % x

In [6]: df
Out[6]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 120 entries, E000 to String019
dtypes: float64(100), object(20)

# storer put (cannot query) 
In [9]: def test_put():
   ...:     store = pd.HDFStore('test_put.h5','w')
   ...:     store['df'] = df
   ...:     store.close()

In [10]: %timeit test_put()
1 loops, best of 3: 7.65 s per loop

# table put (can query)
In [7]: def test_put():
      ....:     store = pd.HDFStore('test_put.h5','w')
      ....:     store.put('df',df,table=True)
      ....:     store.close()


In [8]: %timeit test_put()
1 loops, best of 3: 21.4 s per loop

Solution 2:

How to make this faster?

use 'io.sql.read_frame' to load data from a sql db to a dataframe. Because the 'read_frame' will take care of the columns whose type is 'decimal' by turning them into float.
fill the missing data for each columns.
call the function 'DataFrame.convert_objects' before putting operation
if having string type columns in dateframe, use 'table' instead of 'storer'

store.put('key', df, table=True)

After doing these jobs, the performance of putting operation has a big improvement with the same data set:

CPU times:user42.07s,sys:28.17s,total:70.24sWall time:98.97s

Profile logs of the second test:

95984 function calls (95958 primitive calls) in 68.688 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      445   16.757    0.038   16.757    0.038 {numpy.core.multiarray.array}
       19   16.250    0.855   16.250    0.855 {method '_append_records' of 'tables.tableExtension.Table' objects}
       16    7.958    0.497    7.958    0.497 {method 'astype' of 'numpy.ndarray' objects}
       19    6.533    0.344    6.533    0.344 {pandas.lib.create_hdf_rows_2d}
        4    6.284    1.571    6.388    1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects}
       20    2.640    0.132    2.641    0.132 {pandas.lib.maybe_convert_objects}
        1    1.785    1.785    1.785    1.785 {pandas.lib.isnullobj}
        7    1.619    0.231    1.619    0.231 {method 'flatten' of 'numpy.ndarray' objects}
       11    1.059    0.096    1.059    0.096 {pandas.lib.infer_dtype}
        1    0.997    0.997   41.952   41.952 pytables.py:2468(write_data)
       19    0.985    0.052   40.590    2.136 pytables.py:2504(write_data_chunk)
        1    0.827    0.827   60.617   60.617 pytables.py:2433(write)
     1504    0.592    0.000    0.592    0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
        4    0.534    0.133   13.676    3.419 pytables.py:1038(set_atom)
        1    0.528    0.528    0.528    0.528 {pandas.lib.max_len_string_array}
        4    0.441    0.110    0.571    0.143 internals.py:1409(_stack_arrays)
       35    0.358    0.010    0.358    0.010 {method 'copy' of 'numpy.ndarray' objects}
        1    0.276    0.276    3.135    3.135 internals.py:208(fillna)
        5    0.263    0.053    2.054    0.411 common.py:128(_isnull_ndarraylike)
       48    0.253    0.005    0.253    0.005 {method '_append' of 'tables.hdf5Extension.Array' objects}
        4    0.240    0.060    1.500    0.375 internals.py:1400(_simple_blockify)
        1    0.234    0.234   12.145   12.145 pytables.py:1066(set_atom_string)
       28    0.225    0.008    0.225    0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects}
       36    0.218    0.006    0.218    0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
     6110    0.155    0.000    0.155    0.000 {numpy.core.multiarray.empty}
        4    0.097    0.024    0.097    0.024 {method 'all' of 'numpy.ndarray' objects}
        6    0.084    0.014    0.084    0.014 {tables.indexesExtension.keysort}
       18    0.084    0.005    0.084    0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects}
    11816    0.064    0.000    0.108    0.000 file.py:1036(_getNode)
       19    0.053    0.003    0.053    0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
     1528    0.045    0.000    0.098    0.000 array.py:342(_interpret_indexing)
    11709    0.040    0.000    0.042    0.000 file.py:248(__getitem__)
        2    0.027    0.013    0.383    0.192 index.py:1099(get_neworder)
        1    0.018    0.018    0.018    0.018 {numpy.core.multiarray.putmask}
        4    0.013    0.003    0.017    0.004 index.py:607(final_idx32)

lacucinadiadine

How To Make Pandas Hdfstore 'put' Operation Faster

Solution 1:

Solution 2:

How to make this faster?

Post a Comment for "How To Make Pandas Hdfstore 'put' Operation Faster"

Widget HTML #3