Skip to content Skip to sidebar Skip to footer

Why Is The Combine Function Called Three Times?

I'm trying to understand the combine transformer in a apache beam pipeline. Considering the following example pipeline: def test_combine(data): logging.info('test combine')

Solution 1:

It looks it's happening due to the MapReduce structure. When using Combiners, the output that one combiner has is used as a input.

As an example, imagine summing 3 numbers (1, 2, 3). The combiner MAY sum first 1 and 2 (3) and use that number as input with 3 (3 + 3 = 6). In your case [1, 2, 3] seems to be used as an input in the next combiner.

An example that really helped me understand this:

p = beam.Pipeline()

defmake_list(elements):
    print(elements)
    return elements

(p | Create(range(30))
   | beam.core.CombineGlobally(make_list))

p.run()

See that the element [1,..,10] is used in the next combiner.

Post a Comment for "Why Is The Combine Function Called Three Times?"