One of the best features of Databricks is that it facilitates transitioning batch pipelines to streaming pipelines. Depending on your transformation logic it may be as easy as changing two lines of code. In this post we will review some batch and streaming concepts and show how to write a notebook that can be run in both batch and streaming modes.
I’ve also recorded a webinar that accompanies this post Databricks Batch and Streaming
To compose your notebooks / jobs as batch use the commands spark.read and spark.write to read and write your dataframes.
They will run one time with all the data that exists at the source, processing it all through the pipeline and into the sink.
For an example of how to do this see the provided notebook smokerPipelineBatch.py. It uses publicly available datasets and anyone can run it using the free trial of Databricks.
If a batch pipeline used a data source that was continually added to, you may have to write your own logic to keep track of the pipeline progress using your own checkpoint. One way to do this is to process partitions of the source data once they are ready for processing.
If at some point you realize you need more real-time data you can convert your notebook to be a streaming notebook using spark.readStream and spark.writeStream.
For an example of how to do this see the provided notebook smokerPipelineStreaming.py
Databricks will take care of keeping track of what data has been processed using its own checkpointing logic. All the user has to do is specify the checkpoint location in the sink. The first time the notebook runs in streaming mode all the data that exists at the source will be processed in one micro-batch.
Streaming Output Modes
- Append (default) – appends only the new rows
- Complete – entire result table written out
- Update – only updated rows are output (equivalent to append if no aggregations)
The trigger specifies how it will be processed
- Unspecified (default) – The streaming pipeline will run in micro-batch mode
- Fixed interval – manually specify the micro-batch interval
- Once – only one micro-batch will run
- Continuous (Experimental) – lower latency but at least once fault tolerance
foreach & foreachBatch
These modes are very powerful and allow the application of custom logic to each micro-batch.
There are a number of unsupported operations when streaming. These are outside the scope of this post but can and should be understood if developing complex streaming pipelines.
Batch or Streaming? Batch AND Streaming
The fact that the majority, if not all, of the logic can remain the same allows us to write notebooks that could be run in batch or streaming mode based on a parameter passed to the notebook when the job is called.
Here is an example of how this can be performed using notebook parameterization: smokerPipelineBatchOrStreaming.py