Common Operations in RDD
- persist: save current RDD into the map containing all cached RDDs in SparkContext
- getNarrowAncestors: get all ancestors of current RDD through narrow dependency
- map/flatMap/filter: create an MapPartitionsRDD
- distinct: remove duplicated elements in current RDD and return a new RDD
- repartition: change the number of partitions of current RDD
coalesce: re-partition current RDD, can specify partitioner
sample: sample part of data from current RDD
randomSplit:
union: union with another RDD
sortBy: sort elements in the RDD based on key function
intersection: get an RDD while all keys are both in current RDD and another RDD
cartesian:
groupBy: group all values in each key together
pipe: execute command line for each elements in the RDD
mapPartitions: apply the function to each partition of current RDD
zip: zip two RDDs together
Actions
- foreach: apply the function to each element in RDD
- collect: get all elements in the RDD
- subtract: return an RDD where all elements are in first RDD but not in second one
- reduce: apply a reduce function to each element in the RDD
- fold
- aggregate
- count
- countApproxDistinct: count approximate number of distinct elements
- take/first/top
- max/min
- saveAsTextFile
- keyBy
MapPartitionsRDD
For transformations like map/flatMap, narrow dependencies are generated on current RDD. Narrow dependency uses MapPartitionsRDD to wrapper the transformation. E.g. while invoking rdd.map(x => (x, 1)), a new MapPartitionsRDD is created, named mapRDD. mapRDD has dependency on rdd, and rdd is OneToOneDependency for mapRDD.
While triggering actions, DAGScheduler will try compute each partition of current RDD. If some partition is not cached, the compute method in RDD will find its parents RDD, and compute all partitions in them. This procedure is recursive, until reaching the first RDD. E.g. sc.textFile().map().filter().count(), the compute action starts from the RDD from sc.textFile(), once this RDD is computed, an iterator of the computed RDD is passed to the next RDD wrapping map() transformation. When count() action is done, the result will be returned. All narrow dependencies will be computed in the same executor.
CoalescedRDD
PartitionwiseSampledRDD
UnionRDD
PipedRDD
Using ProcessBuilder to execute the command line and redirect output to an iterator(Each RDD has to override compute() method, returning an iterator to traverse all records in the result)
ZippedPartitionsRDD2
CheckpointRDD
CoGroupRDD
CartesianRDD