One of the best features of Databricks is that it facilitates transitioning batch pipelines to streaming pipelines. Depending on your transformation logic it may be as easy as changing two lines of code. In this post we will review some batch and streaming concepts and show how to write a notebook that can be run in both batch and streaming modes.
I’ve also recorded a webinar that accompanies this post Databricks Batch and Streaming
To compose your notebooks / jobs as batch use the commands spark.read and spark.write to read and write your dataframes.
They will run one time with all the data that exists at the source, processing it all through the pipeline and into the sink.
For an example of how to do this see the provided notebook smokerPipelineBatch.py. It uses publicly available datasets and anyone can run it using the free trial of Databricks.
If a batch pipeline used a data source that was continually added to, you may have to write your own logic to keep track of the pipeline progress using your own checkpoint. One way to do this is to process partitions of the source data once they are ready for processing.
If at some point you realize you need more real-time data you can convert your notebook to be a streaming notebook using spark.readStream and spark.writeStream.
For an example of how to do this see the provided notebook smokerPipelineStreaming.py
Databricks will take care of keeping track of what data has been processed using its own checkpointing logic. All the user has to do is specify the checkpoint location in the sink. The first time the notebook runs in streaming mode all the data that exists at the source will be processed in one micro-batch.
The trigger specifies how it will be processed
These modes are very powerful and allow the application of custom logic to each micro-batch.
There are a number of unsupported operations when streaming. These are outside the scope of this post but can and should be understood if developing complex streaming pipelines.
The fact that the majority, if not all, of the logic can remain the same allows us to write notebooks that could be run in batch or streaming mode based on a parameter passed to the notebook when the job is called.
Here is an example of how this can be performed using notebook parameterization: smokerPipelineBatchOrStreaming.py
Microsoft Azure and Amazon Web Services (AWS) are two of the most popular cloud platforms.…
Cloud management is difficult to do manually, especially if you work with multiple cloud…
Azure’s scalable infrastructure is often cited as one of the primary reasons why it's the…
https://www.youtube.com/watch?v=wDzCN0d8SeA Watch our "Unlocking the Power of AI in your Software Development Life Cycle (SDLC)"…
FinOps is a strategic approach to managing cloud costs. It combines financial management best practices…
Using Kubernetes with Azure combines the power of Kubernetes container orchestration and the cloud capabilities…