BI - Data Warehouse/ Lake/ Lakehouse

Walkthrough of a Databricks Real-Time Data Processing Pipeline

December 18, 2024

148

Pipeline Overview:

The goal of the pipeline was to process real-time telemetry data from industrial IoT (IIoT) sensors for predictive maintenance and anomaly detection. The data included temperature, pressure, and vibration metrics from hundreds of sensors.

Steps:

Data Ingestion:
- Data was ingested using Apache Kafka topics. Each topic represented a sensor type (e.g., temperature, vibration).
- Kafka brokers distributed the high-velocity sensor data across multiple partitions for scalability.
kafka_df = ( spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "kafka_broker:9092") .option("subscribe", "sensor-data") .load() )
Schema Parsing and Transformation:
- Raw sensor data was serialized as JSON. We parsed it into a structured format for processing.
- Example fields: sensor_id, timestamp, metric_type, value.
from pyspark.sql.functions import from_json, col from pyspark.sql.types import StructType, StringType, FloatType, TimestampType schema = StructType([ StructField("sensor_id", StringType(), True), StructField("timestamp", TimestampType(), True), StructField("metric_type", StringType(), True), StructField("value", FloatType(), True) ]) parsed_df = kafka_df.select( from_json(col("value").cast("string"), schema).alias("data") ).select("data.*")
Anomaly Detection:
- Applied custom logic to detect anomalies, e.g., flagging if the temperature exceeded a threshold.
- Used Spark SQL for time-window-based aggregations.
from pyspark.sql.functions import window anomaly_df = parsed_df.filter(col("metric_type") == "temperature").filter(col("value") > 100) anomalies_with_window = ( anomaly_df.groupBy(window("timestamp", "1 minute"), "sensor_id") .count() .filter(col("count") > 5) # Example threshold )
Output Storage:
- Anomalies were written to an Azure Data Lake for long-term storage and an Azure Event Hub for alert notifications.
anomalies_with_window.writeStream \ .format("delta") \ .option("checkpointLocation", "/mnt/checkpoints/anomalies") \ .start("/mnt/datalake/anomalies")
Visualization:
- Data was fed into Power BI for anomaly trend visualization, using Azure Synapse Analytics as the integration layer.

2. Trade-Offs: Micro-Batch vs. Continuous Processing in Spark Streaming

Micro-Batch Processing:

How It Works: Data is processed in small, discrete time intervals (e.g., every second or minute).
Pros:
- Easier to implement and debug.
- Can use existing batch processing logic.
- Achieves near real-time latency with less complexity.
Cons:
- Slight latency due to batching intervals.
- Inefficient for low-latency use cases like fraud detection or industrial automation.

Continuous Processing:

How It Works: Data is processed record-by-record as it arrives.
Pros:
- Lowest latency, ideal for real-time use cases.
- Suitable for event-driven architectures.
Cons:
- Harder to debug and maintain.
- Requires careful resource tuning for scalability.
- Supported in Spark Streaming, but not widely adopted due to implementation complexity.

Trade-Off:

For most use cases, micro-batching strikes a balance between simplicity and performance. Continuous processing is chosen only for ultra-low-latency requirements, like stock trading or fraud detection.

Here is a sample coding difference between Micro-Batch and Continuous Processing in Spark Streaming, using PySpark as an example.

2.1. Micro-Batch Processing

Micro-batch processing divides the data into small, fixed-duration batches for processing.

Code Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create SparkSession
spark = SparkSession.builder \
    .appName("MicroBatchExample") \
    .getOrCreate()

# Define streaming DataFrame (e.g., reading from Kafka topic)
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "topic_name") \
    .load()

# Transformation: Process streaming data
transformed_df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
    .withColumn("value_length", col("value").cast("string").length())

# Write output in micro-batches (e.g., to console or file sink)
query = transformed_df.writeStream \
    .outputMode("append") \
    .format("console") \
    .trigger(processingTime="5 seconds") \  # Micro-batch trigger interval
    .start()

query.awaitTermination()

Key Points:

Uses trigger(processingTime="5 seconds") to process in fixed intervals (5 seconds).
Data is collected and processed in small, discrete batches.

2.2 Continuous Processing

Continuous processing (introduced in Spark 2.3) processes data with very low latency in a near-real-time fashion, skipping micro-batch intervals.

Code Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create SparkSession
spark = SparkSession.builder \
    .appName("ContinuousProcessingExample") \
    .getOrCreate()

# Define streaming DataFrame (e.g., reading from Kafka topic)
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "topic_name") \
    .load()

# Transformation: Process streaming data
transformed_df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
    .withColumn("value_length", col("value").cast("string").length())

# Write output in continuous mode
query = transformed_df.writeStream \
    .outputMode("append") \
    .format("console") \
    .trigger(continuous="1 second") \  # Continuous processing
    .start()

query.awaitTermination()

Key Points:

Uses trigger(continuous="1 second") for continuous streaming with a low-latency trigger.
Supports only append output mode.
Continuous mode has lower latency but fewer supported sinks and transformations compared to micro-batch.

Key Differences in Code:

Feature	Micro-Batch	Continuous Processing
Trigger Configuration	`trigger(processingTime="X seconds")`	`trigger(continuous="X seconds")`
Latency	Higher latency (batch intervals)	Lower latency (continuous flow)
Output Modes	Supports `append`, `complete`, `update`	Only supports `append`
Use Case	Suitable for batch-like processing	Best for low-latency real-time use

Choose the mode depending on your latency requirements and support for transformations/sinks.

3. Databricks Integration with Azure Services

Azure Data Lake:

Databricks seamlessly integrates with Azure Data Lake Storage Gen2 for storing structured and unstructured data.
Used Delta Lake format to ensure ACID compliance and efficient querying.
Example: Storing processed telemetry data for historical analysis. delta_df.write.format("delta").mode("append").save("abfss://<container>@<storage_account>.dfs.core.windows.net/sensor-data")

Azure Event Hub:

Used for real-time alert notifications when anomalies were detected.
Databricks wrote the alert data directly to Event Hub. alerts_df.writeStream \ .format("eventhubs") \ .option("eventhubs.connectionString", "<EventHubConnectionString>") \ .start()

Azure Synapse Analytics:

Processed data was fed into Synapse for integration with Power BI dashboards. data.write.format("com.databricks.spark.sqldw") \ .option("url", "<Synapse_SQL_Endpoint>") \ .option("tempDir", "abfss://<temp_dir>.dfs.core.windows.net/") \ .save()

4. Fault Tolerance and Scalability in a Kafka-Based System

Fault Tolerance:

Replication:
- Kafka replicates partitions across brokers, ensuring data availability during broker failures.
Consumer Offsets:
- Stored in Kafka or an external database (e.g., ZooKeeper), consumers can resume from the last processed record.
Checkpointing:
- Spark Streaming checkpoints data to persistent storage (e.g., Azure Blob) to recover state after failures. stream_df.writeStream \ .format("delta") \ .option("checkpointLocation", "/mnt/checkpoints/sensor-data") \ .start()

Scalability:

Partitioning:
- Distribute sensor data across Kafka partitions for parallel processing by Spark.
Scaling Consumers:
- Multiple Spark executors (or microservices) consume data from different partitions concurrently.
Compression:
- Use Kafka compression (e.g., Snappy) to optimize storage and reduce network bandwidth.

5. Real-World Example of Event-Driven Architecture

Example: Predictive Maintenance in Manufacturing

Event Source:
- IIoT sensors emit real-time events (e.g., temperature, vibration) to Kafka topics.
Event Processing:
- Anomaly detection logic in Databricks processes these events.
Trigger Actions:
- If an anomaly is detected, the pipeline sends alerts to Azure Event Hub, which triggers:
  - A Logic App will notify the maintenance team.
  - A Power Automate workflow to create a ticket in a maintenance management system.
Feedback Loop:
- Sensor data, anomalies, and outcomes are fed back into machine learning models for continuous improvement.

6. How IoT Edge Enhances Event-Driven Architectures

Key Benefits:

Local Processing:
- IoT Edge processes data locally on devices, reducing latency and dependency on cloud connectivity.
- Example: Perform real-time anomaly detection directly on manufacturing equipment.
Event Filtering:
- Filters data before sending to the cloud, reducing bandwidth usage.
Offline Capability:
- IoT Edge stores events locally offline and syncs with the cloud when connectivity is restored.
Custom Modules:
- Developers can deploy custom Docker containers (e.g., ML models) on IoT Edge for event-driven processing.

Example:

An IoT Edge device in a factory detects abnormal vibrations using an on-device ML model. If an anomaly is found:

An alert is sent to the Kafka pipeline for cloud processing.
Local actions (e.g., stopping the machine) are triggered instantly without waiting for cloud input.

Walkthrough of a Databricks Real-Time Data Processing Pipeline

Pipeline Overview:

Steps:

2. Trade-Offs: Micro-Batch vs. Continuous Processing in Spark Streaming

Micro-Batch Processing:

Continuous Processing:

Trade-Off:

2.1. Micro-Batch Processing

Code Example:

2.2 Continuous Processing

Code Example:

Key Differences in Code:

3. Databricks Integration with Azure Services

Azure Data Lake:

Azure Event Hub:

Azure Synapse Analytics:

4. Fault Tolerance and Scalability in a Kafka-Based System

Fault Tolerance:

Scalability:

5. Real-World Example of Event-Driven Architecture

Example: Predictive Maintenance in Manufacturing

6. How IoT Edge Enhances Event-Driven Architectures

Key Benefits:

Example:

EDITOR PICKS

Estimation for Agile Developers While Status Reporting to Waterfall Managers

5 Major Reasons Why So Many Companies Fail At Social Media

Best Practices for Distributed Or Remote Teams in the Age of...

POPULAR POSTS

How to use business objects @Prompt Variable to build flexible universes...

How to Merge Data from Multiple Data Providers in WEBIntelligence (webi)

How to Calculate Number Of Days in a Month or Month...

POPULAR CATEGORY

Increase your Performance using a Broadcast Join in Apache Spark

Transform Your Business with a Powerful Data Model