Skip to content Skip to sidebar Skip to footer

Incremental Pca On Big Data

I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying

Solution 1:

You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:

>>>import numpy as np>>>np.zeros((1000000, 1000), dtype=np.float32)

If you see a MemoryError, you either need more RAM, or you need to process your dataset one chunk at a time.

With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.

As I don't have your data, let me start from creating a random dataset of the same size:

import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
    h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()

It creates a nice 3.8 GiB file.

Now, if we are in Linux, we can limit how much memory is available to our program:

$ bash$ ulimit -m $((1024*1024*2))$ ulimit -m
2097152

Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)

Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit() method many times, providing a different slice of the dataset each time.

import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA

h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet

n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000# how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)

for i inrange(0, n//chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])

It seems to be working for me, and if I look at what top reports, the memory allocation stays below 200M.

Solution 2:

One can uses NumPy’s memmap class, which allows to manipulate a large array stored in a binary file on disk as if it were entirely in memory; the class loads only the data it needs in memory, when it needs it. Since incrementalPCA uses batches at any given time the memory usage remains under control. here is a sample code

from sklearn.decompositionimportIncrementalPCA
import numpy as np

X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=10, batch_size=batch_size)
inc_pca.fit(X_mm)

Post a Comment for "Incremental Pca On Big Data"