Incremental Pca On Big Data
Solution 1:
You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:
>>>import numpy as np>>>np.zeros((1000000, 1000), dtype=np.float32)
If you see a MemoryError
, you either need more RAM, or you need to process your dataset one chunk at a time.
With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.
As I don't have your data, let me start from creating a random dataset of the same size:
import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()
It creates a nice 3.8 GiB file.
Now, if we are in Linux, we can limit how much memory is available to our program:
$ bash$ ulimit -m $((1024*1024*2))$ ulimit -m
2097152
Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)
Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit()
method many times, providing a different slice of the dataset each time.
import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA
h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet
n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000# how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)
for i inrange(0, n//chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
It seems to be working for me, and if I look at what top
reports, the memory allocation stays below 200M.
Solution 2:
One can uses NumPy’s memmap
class, which allows to manipulate a
large array stored in a binary file on disk as if it were entirely in memory; the class loads only the data it needs in memory, when it needs it. Since incrementalPCA uses batches at any given time the memory usage remains under control. here is a sample code
from sklearn.decompositionimportIncrementalPCA
import numpy as np
X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=10, batch_size=batch_size)
inc_pca.fit(X_mm)
Post a Comment for "Incremental Pca On Big Data"