How Can I Shuffle A Very Large List Stored In A File In Python?

July 25, 2024 Post a Comment

I need to deterministically generate a randomized list containing the numbers from 0 to 2^32-1. This would be the naive (and totally nonfunctional) way of doing it, just so it's c

Solution 1:

Computing all the values seems impossible, since Crypto compute a random integer in about a milisecond, so the whole job take days.

Here is a Knuth algorithm implementation as a generator:

from Crypto.Random.random import randint  
import numpy as np

defonthefly(n):
    numbers=np.arange(n,dtype=np.uint32)
    for i inrange(n):
        j=randint(i,n-1)
        numbers[i],numbers[j]=numbers[j],numbers[i]
        yield numbers[i]

For n=10 :

gen=onthefly(10)
print([next(gen) for i inrange(9)])
print(next(gen))
#[9, 0, 2, 6, 4, 8, 7, 3, 1]#5

For n=2**32, the generator take a minute to initialize, but calls are O(1).

Solution 2:

If you have a continuous range of numbers, you don't need to store them at all. It is easy to devise a bidirectional mapping between the value in a shuffled list and its position in that list. The idea is to use a pseudo-random permutation and this is exactly what block ciphers provide.

The trick is to find a block cipher that matches exactly your requirement of 32-bit integers. There are very few such block ciphers, but the Simon and Speck ciphers (released by the NSA) are parameterisable and support a block size of 32-bit (usually block sizes are much larger).

This library seems to provide an implementation of that. We can devise the following functions:

def get_value_from_index(key, i):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.encrypt(i)

def get_index_from_value(key, val):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.decrypt(val)

The library works with Python's big integers, so you might not even need to encode them.

A 64-bit key (for example 0x123456789ABCDEF0) is not much. You could use a similar construction that increased the key size in DES to Triple DES. Keep in mind that keys should be chosen randomly and they have to be constant if you want determinism.

If you don't want to use an algorithm by the NSA for that, I would understand. There are others, but I can't find them now. The Hasty Pudding cipher is even more flexible, but I don't know if there is an implementation of that for Python.

Solution 3:

The class I created uses a bitarray of keep track which numbers have already been used. With the comments, I think the code is pretty self explanatory.

import bitarray
import random


classUniqueRandom:
    def__init__(self):
        """ Init boolean array of used numbers and set all to False
        """
        self.used = bitarray.bitarray(2**32)
        self.used.setall(False)

    defdraw(self):
        """ Draw a previously unused number
         Return False if no free numbers are left
        """# Check if there are numbers left to use; return False if none are leftif self._free() == 0:
            returnFalse# Draw a random index
        i = random.randint(0, 2**32-1)

        # Skip ahead from the random index to a undrawn numberwhile self.used[i]:
            i = (i+1) % 2**32# Update used array
        self.used[i] = True# return the selected numberreturn i

    def_free(self):
        """ Check how many places are unused
        """return self.used.count(False)


defmain():
    r = UniqueRandom()
    for _ inrange(20):
        print r.draw()


if __name__ == '__main__':
    main()

Design considerations While Garrigan Stafford's answer is perfectly fine, the memory footprint of this solution is much smaller (a bit more than 4 GB). Another difference between our answers is that Garrigan's algorithm takes more time to generate a random number when the amount of generated numbers increases (because he keeps iterating until an unused number is found). This algorithm just looks up the next unused number if a certain number is already used. This makes the time it takes to draw a number every time practically the same, regardless of how far the pool of free numbers is exhausted.

Solution 4:

Here is a permutation RNG which I wrote which uses the fact that a squaring a number mod a prime (plus some intricacies) gives a pseudorandom permutation.

https://github.com/pytorch/pytorch/blob/09b4f4f2ff88088306ecedf1bbe23d8aac2d3f75/torch/utils/data/_utils/index_utils.py

Short version:

def_is_prime(n):
    if n == 2:
        returnTrueif n == 1or n % 2 == 0:
        returnFalsefor d inrange(3, floor(sqrt(n)) + 1, 2):  # can use isqrt in Python 3.8if n % d == 0:
            returnFalsereturnTrueclassPermutation(Range):
    """
    Generates a random permutation of integers from 0 up to size.
    Inspired by https://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/
    """

    size: int
    prime: int
    seed: intdef__init__(self, size: int, seed: int):
        self.size = size
        self.prime = self._get_prime(size)
        self.seed = seed % self.prime

    def__getitem__(self, index):
        x = self._map(index)

        while x >= self.size:
            # If we map to a number greater than size, then the cycle of successive mappings must eventually result# in a number less than size. Proof: The cycle of successive mappings traces a path# that either always stays in the set n>=size or it enters and leaves it,# else the 1:1 mapping would be violated (two numbers would map to the same number).# Moreover, `set(range(size)) - set(map(n) for n in range(size) if map(n) < size)`# equals the `set(map(n) for n in range(size, prime) if map(n) < size)`# because the total mapping is exhaustive.# Which means we'll arrive at a number that wasn't mapped to by any other valid index.# This will take at most `prime-size` steps, and `prime-size` is on the order of log(size), so fast.# But usually we just need to remap once.
            x = self._map(x)

        return x

    @staticmethoddef_get_prime(size):
        """
        Returns the prime number >= size which has the form (4n-1)
        """
        n = size + (3 - size % 4)
        whilenot _is_prime(n):
            # We expect to find a prime after O(log(size)) iterations# Using a brute-force primehood test, total complexity is O(log(size)*sqrt(size)), which is pretty good.
            n = n + 4return n

    def_map(self, index):
        a = self._permute_qpr(index)
        b = (a + self.seed) % self.prime
        c = self._permute_qpr(b)
        return c

    def_permute_qpr(self, x):
        residue = pow(x, 2, self.prime)

        if x * 2 < self.prime:
            return residue
        else:
            return self.prime - residue

Solution 5:

So one way is to keep track of which numbers you have already given out and keep handing out new random numbers one at a time, consider

import random
random.seed(0)

classRandomDeck:
      def__init__(self):
           self.usedNumbers = set()

      defdraw(self):
          number = random.randint(0,2**32)
          while number in self.usedNumbers:
                 number = random.randint(0,2**32)
          self.usedNumbers.append(number)
          return number

      defshuffle(self):
          self.usedNumbers = set()

As you can see we essentially have a deck of random numbers between 0 and 2^32 but we only store the numbers we have given out to ensure we don't have repeats. Then you can re-shuffle the deck by forgetting all the numbers you have already given out.

This should be efficient in most use cases as long as you don't draw ~1 million numbers without a reshuffle.

lacucinadiadine