Blog

Batch and Streaming Pipelines

One of the best features of Databricks is that it facilitates transitioning batch pipelines to streaming pipelines. Depending on your transformation logic it may be as easy as changing two lines of code. In this post we will review some batch and streaming concepts and show how to write a notebook that can be run in both batch and streaming modes.

I’ve also recorded a webinar that accompanies this post Databricks Batch and Streaming

Batch

To compose your notebooks / jobs as batch use the commands spark.read and spark.write to read and write your dataframes.

They will run one time with all the data that exists at the source, processing it all through the pipeline and into the sink.

For an example of how to do this see the provided notebook smokerPipelineBatch.py. It uses publicly available datasets and anyone can run it using the free trial of Databricks.

If a batch pipeline used a data source that was continually added to, you may have to write your own logic to keep track of the pipeline progress using your own checkpoint. One way to do this is to process partitions of the source data once they are ready for processing.

Streaming

If at some point you realize you need more real-time data you can convert your notebook to be a streaming notebook using spark.readStream and spark.writeStream.

For an example of how to do this see the provided notebook smokerPipelineStreaming.py

Databricks will take care of keeping track of what data has been processed using its own checkpointing logic. All the user has to do is specify the checkpoint location in the sink. The first time the notebook runs in streaming mode all the data that exists at the source will be processed in one micro-batch.

Streaming Output Modes

  • Append (default) – appends only the new rows
  • Complete – entire result table written out
  • Update – only updated rows are output (equivalent to append if no aggregations)

Trigger

The trigger specifies how it will be processed

  • Unspecified (default) – The streaming pipeline will run in micro-batch mode
  • Fixed interval – manually specify the micro-batch interval
  • Once – only one micro-batch will run
  • Continuous (Experimental) – lower latency but at least once fault tolerance

Sink Types

foreach & foreachBatch

These modes are very powerful and allow the application of custom logic to each micro-batch.

Streaming Limitations

There are a number of unsupported operations when streaming. These are outside the scope of this post but can and should be understood if developing complex streaming pipelines.

Batch or Streaming? Batch AND Streaming

The fact that the majority, if not all, of the logic can remain the same allows us to write notebooks that could be run in batch or streaming mode based on a parameter passed to the notebook when the job is called.

Here is an example of how this can be performed using notebook parameterization: smokerPipelineBatchOrStreaming.py

Matt Szafir

Recent Posts

8-Step AWS to Microsoft Azure Migration Strategy

Microsoft Azure and Amazon Web Services (AWS) are two of the most popular cloud platforms.…

7 days ago

How to Navigate Azure Governance

 Cloud management is difficult to do manually, especially if you work with multiple cloud…

2 weeks ago

Why Azure’s Scalability is Your Key to Business Growth & Efficiency

Azure’s scalable infrastructure is often cited as one of the primary reasons why it's the…

4 weeks ago

Unlocking the Power of AI in your Software Development Life Cycle (SDLC)

https://www.youtube.com/watch?v=wDzCN0d8SeA Watch our "Unlocking the Power of AI in your Software Development Life Cycle (SDLC)"…

1 month ago

The Role of FinOps in Accelerating Business Innovation

FinOps is a strategic approach to managing cloud costs. It combines financial management best practices…

2 months ago

Azure Kubernetes Security Best Practices

Using Kubernetes with Azure combines the power of Kubernetes container orchestration and the cloud capabilities…

2 months ago