DStream is the programming entry point for Spark Streaming, like RDD for Spark Batch. DStream can be created from StreamingContext, like socketStream(), or by user defined DStream, like KafkaUtils.createStream(). Once source DStream is created, we can do data transformation, like map, flatMap.

Operations on DStream

  • DStream.map(func): MappedDStream wraps the map function. While Streaming JobScheduler runs the job(mini batch), the MappedDStream.compute() calls the RDD.map(), passing the map function into RDD.
  • DStream.flatMap(func): FlatMappedDStream, like MappedDStream, wrapping flatMap function.
  • DStream.filter(func): FilteredDStram.
  • DStream.transform(func): TransformedDStream.
  • DStream.mapPartitions(func): MapPartitionedDStream.
  • DStream.reduce(func): convert current DStream into MappedDStream and apply reduceByKey().
  • DStream.count()
  • DStream.foreachRDD(): create a ForEachDStream and then register current DStream into DStreamGraph. While JobScheduler run the job, JobGenerator will invoke DStreamGraph.generateJobs() to get a Job, which contains all RDDs for all data transformations.

results matching ""

    No results matching ""