Skip to content Skip to sidebar Skip to footer

How To Get Subkeys To Iterate Over And Eventually The Files Inside Them In Aws S3

I have AWS S3 key path as bucket-name/fo1/fo2/fo3 that has subpaths as bucket-name/fo1/fo2/fo3/fo_1, bucket-name/fo1/fo2/fo3/fo_2, bucket-name/fo1/fo2/fo3/fo_3 and so on. I want t

Solution 1:

The first thing to understand about Amazon S3 is that folders do not exist. Rather, objects are stored with their full path as their Key (filename).

For example, I could copy a file to a bucket using the AWS Command-Line Interface (CLI):

aws s3 cp foo.txt s3://my-bucket/fo1/fo2/fo3/foo.txt

This would work even though the folders do not exist.

To make things convenient for humans, there is a "pretend" set of folders that are provided via the concept of a common prefix. Thus, in the management console, the folders would appear to be there. However, if the object was then deleted with:

aws s3 rm s3://my-buket/fo1/fo2/fo3/foo.txt

The result is that the folders would immediately disappear because they never actually existed!

Also for convenience, some Amazon S3 commands allow you to specify a Prefix and Delimiter. This can be used to, for example, only list objects in the fo3 folder. What it is really doing is merely listing the objects that have a Key that starts with fo1/fo2/fo3/. When the Key for the object is returned, it will always have the full path to the object, because the Key actually is the full path. (There is no concept of a filename separate to the complete Key.)

So, if you want a listing of all files in fo1 and fo2 and fo3, you can do a listing with a Prefix of fo1 and receive back all objects that start with fo1/, but this will include objects in sub-folders since they all have a prefix of fo1/.

Bottom line: Rather than thinking of old-fashioned directories, think of Amazon S3 as a flat storage structure. If necessary, you can do filtering of results in your own code.

Solution 2:

You should examine the value returned by the list_objects_v2() call to understand the data that is being returned.

  • If a Prefix has been specified, it returns the contents of the exact prefix that has been provided. Any sub-directories are returned as CommonPrefixes.
  • If no Prefix is provided, all objects in the bucket are returned. You can then filter it yourself in code, as shown below.
import boto3

s3_client = boto3.client('s3', region_name='ap-southeast-2')
s3_bucket = 'my-bucket'
prefix = 'fo1/fo2/fo3/'

response = s3_client.list_objects_v2(Bucket=s3_bucket)
forobjectin response['Contents']:
    ifobject['Key'].startswith(prefix):
        print(object['Key'])

Solution 3:

Possibly the following piece of code can be of use for you. I expanded a bit on John's answer as I was looking for something similar. I basically recreated the os.walk() behavior, which you might be more familiar with

import os
    import boto3
    
    # function to replicate os.walk behaviordefs3walk( locations,prefix):
        
        # recursively add location to roots starting from prefixdefprocessLocation( root,prefixLocal,location):
            # add new root location if not availableifnot prefixLocal in root:
                root[prefixLocal]=(set(),set())
            # check how many folders are available after prefix
            remainder = location[len(prefixLocal)+1:]
            structure = remainder.split('/')
            #if we are not yet in the folder of the file we need to continue with a larger prefixiflen(structure)>1:
                # add folder dir
                root[prefixLocal][0].add(structure[0])
                #make sure file is added allong the way
                processLocation(root, prefixLocal+'/'+structure[0],location )
            else:
                # add to file
                root[prefixLocal][1].add(structure[0])
            
        root={}
        for location in locations:
            processLocation(root,prefix,location)
    
        return root.items()
    
    if __name__ == "__main__":    
        s3_client = boto3.client('s3', region_name='eu-west-3')
        s3_bucket = 'bucket-name'
        prefix = 'fo1/fo2/fo3'# get list of objects with prefix 
        response = s3_client.list_objects_v2(Bucket=s3_bucket,Prefix=prefix)
        # retrieve key values
        locations = [ object['Key'] forobjectin response['Contents']]
       
    
    
        for root, (subdir, files) in s3walk(locations,prefix):
            print(root,subdir,files)

Post a Comment for "How To Get Subkeys To Iterate Over And Eventually The Files Inside Them In Aws S3"