Common Operations in RDD

  • persist: save current RDD into the map containing all cached RDDs in SparkContext
  • getNarrowAncestors: get all ancestors of current RDD through narrow dependency
  • map/flatMap/filter: create an MapPartitionsRDD
  • distinct: remove duplicated elements in current RDD and return a new RDD
  • repartition: change the number of partitions of current RDD
  • coalesce: re-partition current RDD, can specify partitioner

  • sample: sample part of data from current RDD

  • randomSplit:

  • union: union with another RDD

  • sortBy: sort elements in the RDD based on key function

  • intersection: get an RDD while all keys are both in current RDD and another RDD

  • cartesian:

  • groupBy: group all values in each key together

  • pipe: execute command line for each elements in the RDD

  • mapPartitions: apply the function to each partition of current RDD

  • zip: zip two RDDs together

Actions

  • foreach: apply the function to each element in RDD
  • collect: get all elements in the RDD
  • subtract: return an RDD where all elements are in first RDD but not in second one
  • reduce: apply a reduce function to each element in the RDD
  • fold
  • aggregate
  • count
  • countApproxDistinct: count approximate number of distinct elements
  • take/first/top
  • max/min
  • saveAsTextFile
  • keyBy

MapPartitionsRDD

For transformations like map/flatMap, narrow dependencies are generated on current RDD. Narrow dependency uses MapPartitionsRDD to wrapper the transformation. E.g. while invoking rdd.map(x => (x, 1)), a new MapPartitionsRDD is created, named mapRDD. mapRDD has dependency on rdd, and rdd is OneToOneDependency for mapRDD.

While triggering actions, DAGScheduler will try compute each partition of current RDD. If some partition is not cached, the compute method in RDD will find its parents RDD, and compute all partitions in them. This procedure is recursive, until reaching the first RDD. E.g. sc.textFile().map().filter().count(), the compute action starts from the RDD from sc.textFile(), once this RDD is computed, an iterator of the computed RDD is passed to the next RDD wrapping map() transformation. When count() action is done, the result will be returned. All narrow dependencies will be computed in the same executor.

CoalescedRDD

PartitionwiseSampledRDD

UnionRDD

PipedRDD

Using ProcessBuilder to execute the command line and redirect output to an iterator(Each RDD has to override compute() method, returning an iterator to traverse all records in the result)

ZippedPartitionsRDD2

CheckpointRDD

CoGroupRDD

CartesianRDD

results matching ""

    No results matching ""