From Driver To Executor

While resources for running the tasks are available, the TaskDescriptor containing the serialised RDDs is send to ExecutorBackend, which will launch Task in the Executor.

Task

Two types of tasks: ShuffleMapTask and ResultTask.

ResultTask

Normally, there should be an action at the last RDD of ResultTasks, like RDD.count(). Every function passed to RDD transformations and actions will be wrapped in an anoynmous function, taking extra parameters including an iterator from previous RDD. We use an example to illustrate how it works.

Here we have following Spark Job:

sourceRDD.map(x => (x, 1)).collect();

We read the data from source, doing a map and collector the result.

While running the job(have only one task), each partition is computed in the following way:

sourceIter.map(x => (x, 1)).toArray()

If we read data from JDBC connection, the JDBCRdd reads data from database. For each Iterator.next(), it may fetch some data from database and then the data is processed by the map, and then convert them as an Array. This is the same as Scala Iterator.

ShuffleMapTask

Like ResultTask, Spark computes and process data based on RDDs. Unlike ResultTask, the result data in ShuffleMapTask is written into ShuffleWriter. Please refer to "How does Shuffle Work".

Result

The result from task run will return back to TaskRunner. The result data will be wrapped in TaskResult. If its size is larger than max direct result size(defined by spark.task.maxDirectResultSize), the result data will be saved into BlockManager. Otherwise, the result data is sent back to the Driver directly.

results matching ""

    No results matching ""