Resourceexhaustederror :oom When Allocating Tensor With Shape []
Solution 1:
The problem was caused by this line in the training loop:
while s + batch_size < ran:
# ...
batch_xs1 = tf.nn.embedding_lookup(embedding_matrix, batch_id)
Calling the tf.nn.embedding_lookup()
function adds nodes to the TensorFlow graph, and—because these are never garbage collected—doing so in a loop causes a memory leak.
The actual cause of the memory leak is probably the embedding_matrix
NumPy array in the argument to tf.nn.embedding_lookup()
. TensorFlow tries to be helpful and convert all NumPy arrays in the arguments to a function into tf.constant()
nodes in the TensorFlow graph. However, in a loop, this will end up with multiple separate copies of the embedding_matrix
copied into TensorFlow and then onto scarce GPU memory.
The simplest solution is to move the tf.nn.embedding_lookup()
call outside the training loop. For example:
def while_loop(s,e,step):
batch_id_placeholder = tf.placeholder(tf.int32)
batch_xs1 = tf.nn.embedding_lookup(embedding_matrix, batch_id_placeholder)
while s+batch_size<ran:
batch_id=file_id[s:e]
batch_col=label_matrix[s:e]
batch_label = csc_matrix((data, (batch_row, batch_col)), shape=(batch_size, n_classes))
batch_label = batch_label.toarray()
batch_xs=sess.run(batch_xs1, feed_dict={batch_id_placeholder: batch_id})
Solution 2:
I recently had this problem with TF + Keras and previously with Darknet with yolo v3. My dataset contained very large images for the memory of my two GTX 1050s. I had to resize the images to be smaller. On average, a 1024x1024 image needs 6GB per GPU.
Post a Comment for "Resourceexhaustederror :oom When Allocating Tensor With Shape []"