Skip to content Skip to sidebar Skip to footer

Avoid Recomputing Size Of All Cloud Storage Files In Beam Python Sdk

I'm working on a pipeline that reads ~5 million files from a Google Cloud Storage (GCS) directory. I have it configured to run on Google Cloud Dataflow. The problem is that when I

Solution 1:

Thanks for reporting this. Beam have two transforms for reading text. ReadFromText and ReadAllFromText. ReadFromText will run into this issue but ReadAllFromText shouldn't.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py#L438

Downside of ReadAllFromText is that it won't perform dynamic work rebalancing, but this should not be an issue when reading a a large number of files.

Created https://issues.apache.org/jira/browse/BEAM-9620 for tracking issues with ReadFromText (and file-based sources in general).

Post a Comment for "Avoid Recomputing Size Of All Cloud Storage Files In Beam Python Sdk"