Data Flow

Spark Worker

In Standalone mode, Spark Worker is responsible for starting Driver and Executor. While initialising worker, ExternalShuffleService will also be started. It is used to transfer the shuffle data to the consumers. After an executor is started on this worker, it will also be registered to the ExternalShuffleService, so that tasks consuming the data generated by tasks running on this executor know where to get the data.

There are two types of tasks, one is ResultTask, like count / collect operation on the RDD, which generates result from data. Another one is ShuffleMapTask, which is generated by some operations, like keyBy, to shuffle data between machines. The ShuffleMapTask will generate shuffle data, which will be dispatched to other nodes through network.

Generate Shuffle Data

While computing, the result data from ShuffleMapTask will be sent to ShuffleWriter, created by ShuffleManager, through write interface. Those data is fed into ExternalSorter. ExternalSorter maintains a memory buffer to hold result data. When the buffer is full, it tries to get more memory from TaskMemoryManager. If it failed, a temp file is created to hold the spilled data. All the data is sorted by TimSort.

Read Shuffle Data

While the task runs the ShuffleRDD, the first step is to get a BlockStoreShuffleReader from ShuffleManager. The reader will fetch shuffled data from remote ExternalShuffleService through ShuffleClient. The underlying network relies on the Netty.

How does Shuffle work

Data Flow

Spark Worker

Generate Shuffle Data

Read Shuffle Data

results matching ""

No results matching ""