Spark Streaming is another form of Spark Batch. It leverages mini batch if minimize real-time processing. Unlike Apache Flink, which processes each input record one by one, Spark uses mini batch to process input data from source. We can set mini batch interval to milli-sec to achieve low latency. Like Spark Batch, Spark Streaming also has an entry point for programming - StreamingContext.

StreamingContext

One difference between StreamingContext and SparkContext is the checkpoint path while constructing StreamingContext. Spark Streaming enables checkpoint to continue working from latest checkpoint. So once streaming job fails, it can recover from last checkpoint, without having to process data from start.

StreamingContext contains an instance of SparkContext, since some features are available from SparkContext. If checkpoint presents, StreamingContext will build a SparkContext based this checkpoint. CheckpointReader reads the serialized checkpoint data, deserialize it, retrieve configuration, which will be used to setup SparkContext.

Create Stream Source

There are several ways to create sources of Spark Stream.

  • receiverStream(receiver): create an source stream with user defined receiver - an instance of Receiver. A ReceiverInputStream is created, containing the Receiver instance. For each mini batch, the compute method tries to create BlockRDD or WriteAheadLogBackedBlockRDD, depending on the WAL setting.
    • BlockRDD
    • WriteAheadLogBackedBlockRDD

The compute() method in each is actually used to generate RDDs for current mini batch. Those RDDs will be scheduled by JobScheduler.

Run Streaming

Once job graph(RDDs) has been defined, StreamingContext.start() launches streaming execution. JobScheduler is a member of StreamingContext. It starts JobGenerator, which will generate RDDs for each batch interval at fixed rate interval, defined by batch interval. Once RDDs are generated, they will be submitted and run this mini batch.

JobScheduler & JobGenerator

JobScheduler is responsible for scheduling and running each Spark mini batch job. Spark streaming job consists of a lots of mini batch jobs with fixed batch interval. JobScheduler will start a JobGenerator, which is responsible for generating RDDs for each mini batch. In Spark Streaming, DStream likes RDDs in Spark Batch. We can do data process and transformation with methods like map(), flatMap(). Unlike RDD.compute(), this method is DStream is used to generate RDDs for each batch interval. This is done by JobGenerator.

JobGenerator is a member variable in JobScheduler, and it is started after the start of JobScheduler. There is timer in JobGenerator, an instance of RecurringTimer, which will post GenerateJobs event to JobGenerator. An EventLoop in JobGenerator will handle this event, calling DStreamGraph to generate a Job set. A Job set contains multiple mini batch jobs. When want to do some actions, like print() or saveAsTextFiles() at the end of streaming, the DStreaming.foreachRDD() is invoked, those actions is wrapped into a foreach function, then this foreach function is saved into ForEachDStream object. E.g. for saveAsTextFiles(), the foreach function includes rdd.saveAsTextFile(file) to save the data from RDDs. Once ForEachDStream object is created, it is registered into DStreamGraph. When JobGenerator's RecurringTimer reaches the end of batch interval, DStreamGraph.generateJobs() is invoked to get a Job set, which delegates to ForEachDStream.generateJob(), it simply wraps the foreach function into Job object.

Once everything is OK, the generated Job set is submitted back to JobScheduler by JobGenerator through JobScheduler.submitJobSet(). Each Job is wrapped into JobHandler and executed by another thread pool in JobScheduler. The left thing is quite like Spark Batch job. JobHandler call Job.run(), which will invoke foreach function. Using saveAsTextFile() as an example, the foreach function calls RDD.saveAsTextFile(), which is an action to trigger running batch job.

results matching ""

    No results matching ""