How To Make Pandas Hdfstore 'put' Operation Faster
Solution 1:
I am pretty convinced your issue is related to type mapping of the actual types in DataFrames and to how they are stored by PyTables.
- Simple types (floats/ints/bools) that have a fixed represenation, these are mapped to fixed c-types
- Datetimes are handled if they can properly be converted (e.g. they have a dtype of 'datetime64[ns]', notably datetimes.date are NOT handled (NaN are a different story and depending on usage can cause the entire column type to be mishandled)
- Strings are mapped (in Storer objects to Object type, Table maps them to String types)
- Unicode are not handled
- all other types are handled as Object in Storers or an Exception is throw for Tables
What this means is that if you are doing a put to a Storer (a fixed-representation), then all of the non-mappable types will become Object, see this. PyTables pickles these columns. See the below reference for ObjectAtom
Table will raise on an invalid type (I should provide a better error message here). I think I will also provide a warning if you try to store a type that is mapped to ObjectAtom (for performance reasons).
To force some types try some of these:
import pandas as pd
# convert None to nan (its currently Object)# converts to float64 (or type of other objs)
x = pd.Series([None])
x = x.where(pd.notnull(x)).convert_objects()
# convert datetime like with embeded nans to datetime64[ns]
df['foo'] = pd.Series(df['foo'].values, dtype = 'M8[ns]')
Heres a sample on 64-bit linux (file is 1M rows, about 1 GB in size on disk)
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.10.1.dev'
In [3]: import tables
In [4]: tables.__version__
Out[4]: '2.3.1'
In [4]: df = pd.DataFrame(np.random.randn(1000 * 1000, 100), index=range(int(
...: 1000 * 1000)), columns=['E%03d' % i for i in xrange(100)])
In [5]: for x in range(20):
...: df['String%03d' % x] = 'string%03d' % x
In [6]: df
Out[6]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 120 entries, E000 to String019
dtypes: float64(100), object(20)
# storer put (cannot query)
In [9]: def test_put():
...: store = pd.HDFStore('test_put.h5','w')
...: store['df'] = df
...: store.close()
In [10]: %timeit test_put()
1 loops, best of 3: 7.65 s per loop
# table put (can query)
In [7]: def test_put():
....: store = pd.HDFStore('test_put.h5','w')
....: store.put('df',df,table=True)
....: store.close()
In [8]: %timeit test_put()
1 loops, best of 3: 21.4 s per loop
Solution 2:
How to make this faster?
- use 'io.sql.read_frame' to load data from a sql db to a dataframe. Because the 'read_frame' will take care of the columns whose type is 'decimal' by turning them into float.
- fill the missing data for each columns.
- call the function 'DataFrame.convert_objects' before putting operation
- if having string type columns in dateframe, use 'table' instead of 'storer'
store.put('key', df, table=True)
After doing these jobs, the performance of putting operation has a big improvement with the same data set:
CPU times:user42.07s,sys:28.17s,total:70.24sWall time:98.97s
Profile logs of the second test:
95984 function calls (95958 primitive calls) in 68.688 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 445 16.757 0.038 16.757 0.038 {numpy.core.multiarray.array} 19 16.250 0.855 16.250 0.855 {method '_append_records' of 'tables.tableExtension.Table' objects} 16 7.958 0.497 7.958 0.497 {method 'astype' of 'numpy.ndarray' objects} 19 6.533 0.344 6.533 0.344 {pandas.lib.create_hdf_rows_2d} 4 6.284 1.571 6.388 1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects} 20 2.640 0.132 2.641 0.132 {pandas.lib.maybe_convert_objects} 1 1.785 1.785 1.785 1.785 {pandas.lib.isnullobj} 7 1.619 0.231 1.619 0.231 {method 'flatten' of 'numpy.ndarray' objects} 11 1.059 0.096 1.059 0.096 {pandas.lib.infer_dtype} 1 0.997 0.997 41.952 41.952 pytables.py:2468(write_data) 19 0.985 0.052 40.590 2.136 pytables.py:2504(write_data_chunk) 1 0.827 0.827 60.617 60.617 pytables.py:2433(write) 1504 0.592 0.000 0.592 0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects} 4 0.534 0.133 13.676 3.419 pytables.py:1038(set_atom) 1 0.528 0.528 0.528 0.528 {pandas.lib.max_len_string_array} 4 0.441 0.110 0.571 0.143 internals.py:1409(_stack_arrays) 35 0.358 0.010 0.358 0.010 {method 'copy' of 'numpy.ndarray' objects} 1 0.276 0.276 3.135 3.135 internals.py:208(fillna) 5 0.263 0.053 2.054 0.411 common.py:128(_isnull_ndarraylike) 48 0.253 0.005 0.253 0.005 {method '_append' of 'tables.hdf5Extension.Array' objects} 4 0.240 0.060 1.500 0.375 internals.py:1400(_simple_blockify) 1 0.234 0.234 12.145 12.145 pytables.py:1066(set_atom_string) 28 0.225 0.008 0.225 0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects} 36 0.218 0.006 0.218 0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects} 6110 0.155 0.000 0.155 0.000 {numpy.core.multiarray.empty} 4 0.097 0.024 0.097 0.024 {method 'all' of 'numpy.ndarray' objects} 6 0.084 0.014 0.084 0.014 {tables.indexesExtension.keysort} 18 0.084 0.005 0.084 0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects} 11816 0.064 0.000 0.108 0.000 file.py:1036(_getNode) 19 0.053 0.003 0.053 0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects} 1528 0.045 0.000 0.098 0.000 array.py:342(_interpret_indexing) 11709 0.040 0.000 0.042 0.000 file.py:248(__getitem__) 2 0.027 0.013 0.383 0.192 index.py:1099(get_neworder) 1 0.018 0.018 0.018 0.018 {numpy.core.multiarray.putmask} 4 0.013 0.003 0.017 0.004 index.py:607(final_idx32)
Post a Comment for "How To Make Pandas Hdfstore 'put' Operation Faster"