BI - Data Warehouse/ Lake/ Lakehouse

Near Real-Time Processing Using DBT and Databricks

January 13, 2025

253

Yes, DBT (Data Build Tool) can work with streaming data when using Databricks, but with certain considerations and limitations. Here’s how it works and what to keep in mind:

1. Databricks and Streaming Data

Databricks supports processing streaming data using Structured Streaming, which is built on Apache Spark. Databricks handles real-time data ingestion, transformations, and analytics through its native streaming capabilities.

2. DBT’s Role in the Process

DBT is primarily designed for batch transformations in a SQL-based analytics workflow. It uses SQL to model data in a warehouse or a lakehouse, such as the Databricks Lakehouse Platform. While DBT itself does not natively support streaming transformations (since it’s oriented around batch processing), you can use it to:

Define batch transformations that consume data from a streaming source that has been written into a Delta table.
Work with Delta Lake, which supports both streaming and batch queries.

3. Workflow for Streaming with DBT and Databricks

A typical approach to using DBT with streaming data in Databricks involves:

Streaming Ingestion to Delta Lake:
- Use Databricks Structured Streaming to ingest real-time data into Delta Lake tables.
- These Delta tables act as both the source and sink for streaming data.
Batch Processing with DBT:
- Define DBT models to perform batch transformations on the Delta tables.
- DBT can process data that has been ingested into Delta tables in micro-batches.
- You can define incremental models in DBT to process only the new data (e.g., is_incremental logic).
Querying Delta Tables with Stream-Aware Logic:
- Delta tables allow the mixing of batch and streaming queries. DBT can read the current state of Delta tables after each micro-batch is complete.
Materializing Output:
- Use DBT to materialize tables/views or write back to Delta tables for downstream batch processing.

4. Limitations

No Native Streaming in DBT: DBT doesn’t have built-in support for real-time streaming processing (e.g., handling continuous data ingestion).
Latency Considerations: DBT can work with Delta tables in near-real-time scenarios (via incremental models), but it isn’t suitable for low-latency, real-time requirements.
Triggering Jobs: Streaming jobs are typically triggered and managed in Databricks, while dbt transformations are triggered as batch jobs.

5. Example Workflow

Streaming Job in Databricks: from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate()

# Read from a streaming source (e.g., Kafka) streaming_df = spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers", "server:9092") \ .option("subscribe", "topic") \ .load()

# Write to a Delta table streaming_df.writeStream \ .format("delta") \ .option("checkpointLocation", "/delta/checkpoints/streaming") \ .table("streaming_table")
DBT Model for Incremental Processing: -- models/transform_streaming_data.sql {{ config(materialized='incremental') }}

WITH new_data AS ( SELECT * FROM {{ source('schema', 'streaming_table') }} WHERE _ingest_timestamp > ( SELECT MAX(_ingest_timestamp) FROM {{ this }} ) )
SELECT *, CURRENT_TIMESTAMP AS processed_at FROM new_data
Schedule dbt Jobs: Use a scheduler (e.g., dbt Cloud, Airflow, or Databricks Workflows) to trigger dbt models after streaming jobs written to Delta tables.

6. When to Use DBT with Streaming Data

Use dbt for transformations on streaming data that has already been ingested into Delta tables.
Use Databricks streaming jobs for continuous real-time processing and writing to Delta tables.
Combine both tools when you need the flexibility of SQL transformations (DBT) alongside the scalability of Spark-based streaming (Databricks).

If you need low-latency real-time transformations, consider tools like Apache Flink or Spark Structured Streaming directly. DBT and Databricks are an excellent combination for batch transformations with near-real-time data.

Near Real-Time Processing Using DBT and Databricks

1. Databricks and Streaming Data

2. DBT’s Role in the Process

3. Workflow for Streaming with DBT and Databricks

4. Limitations

5. Example Workflow

6. When to Use DBT with Streaming Data

EDITOR PICKS

Estimation for Agile Developers While Status Reporting to Waterfall Managers

5 Major Reasons Why So Many Companies Fail At Social Media

Best Practices for Distributed Or Remote Teams in the Age of...

POPULAR POSTS

How to use business objects @Prompt Variable to build flexible universes...

How to Merge Data from Multiple Data Providers in WEBIntelligence (webi)

How to Calculate Number Of Days in a Month or Month...

POPULAR CATEGORY

Handling disk I/O Bottlenecks in PostgreSQL

The Value Of Content