Avoid Recomputing Size Of All Cloud Storage Files In Beam Python Sdk
I'm working on a pipeline that reads ~5 million files from a Google Cloud Storage (GCS) directory. I have it configured to run on Google Cloud Dataflow. The problem is that when I
Solution 1:
Thanks for reporting this. Beam have two transforms for reading text. ReadFromText
and ReadAllFromText
. ReadFromText
will run into this issue but ReadAllFromText
shouldn't.
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py#L438
Downside of ReadAllFromText
is that it won't perform dynamic work rebalancing, but this should not be an issue when reading a a large number of files.
Created https://issues.apache.org/jira/browse/BEAM-9620 for tracking issues with ReadFromText (and file-based sources in general).
Post a Comment for "Avoid Recomputing Size Of All Cloud Storage Files In Beam Python Sdk"