Common Operations in RDD

persist: save current RDD into the map containing all cached RDDs in SparkContext
getNarrowAncestors: get all ancestors of current RDD through narrow dependency
map/flatMap/filter: create an MapPartitionsRDD
distinct: remove duplicated elements in current RDD and return a new RDD
repartition: change the number of partitions of current RDD
coalesce: re-partition current RDD, can specify partitioner
sample: sample part of data from current RDD
randomSplit:
union: union with another RDD
sortBy: sort elements in the RDD based on key function
intersection: get an RDD while all keys are both in current RDD and another RDD
cartesian:
groupBy: group all values in each key together
pipe: execute command line for each elements in the RDD
mapPartitions: apply the function to each partition of current RDD
zip: zip two RDDs together

Actions

foreach: apply the function to each element in RDD
collect: get all elements in the RDD
subtract: return an RDD where all elements are in first RDD but not in second one
reduce: apply a reduce function to each element in the RDD
fold
aggregate
count
countApproxDistinct: count approximate number of distinct elements
take/first/top
max/min
saveAsTextFile
keyBy

MapPartitionsRDD

For transformations like map/flatMap, narrow dependencies are generated on current RDD. Narrow dependency uses MapPartitionsRDD to wrapper the transformation. E.g. while invoking rdd.map(x => (x, 1)), a new MapPartitionsRDD is created, named mapRDD. mapRDD has dependency on rdd, and rdd is OneToOneDependency for mapRDD.

While triggering actions, DAGScheduler will try compute each partition of current RDD. If some partition is not cached, the compute method in RDD will find its parents RDD, and compute all partitions in them. This procedure is recursive, until reaching the first RDD. E.g. sc.textFile().map().filter().count(), the compute action starts from the RDD from sc.textFile(), once this RDD is computed, an iterator of the computed RDD is passed to the next RDD wrapping map() transformation. When count() action is done, the result will be returned. All narrow dependencies will be computed in the same executor.

CoalescedRDD

PartitionwiseSampledRDD

UnionRDD

PipedRDD

Using ProcessBuilder to execute the command line and redirect output to an iterator(Each RDD has to override compute() method, returning an iterator to traverse all records in the result)

ZippedPartitionsRDD2

CheckpointRDD

CoGroupRDD

CartesianRDD

How about different types of RDDs