BeatBlast-RealTime-Streaming-Pipeline

BeatBlast: Real-Time Stream Processing & Data Lake Engineering

This project demonstrates a robust, end-to-end Real-Time Data Engineering pipeline for BeatBlast, a digital music streaming platform. It covers the entire lifecycle of big dataβ€”from high-velocity data simulation to stateful stream processing and optimized data lake storage.


πŸ“Œ Project Objective

The goal was to build a system capable of processing near real-time user interaction events (playing, skipping, and liking songs) to derive actionable business insights. Using Apache Spark Structured Streaming, the pipeline identifies trending content and monitors platform engagement while persisting data in a highly organized, partitioned format.


πŸ› οΈ Tech Stack

πŸ‘‰ View the Full Technical Report here

πŸ‘‰ View the Full Structured Streaming Script (PDF)


πŸ—οΈ Data Pipeline Architecture

1. Real-Time Data Simulation (Producer)

A Python-based simulator mimics a live production environment by generating a continuous stream of JSON events.

2. Stream Processing Engine

The PySpark application reads the JSON stream from the landing zone with a strictly enforced schema.

3. Windowed Aggregations


πŸ’» Core Streaming Logic

The pipeline utilizes Spark’s readStream to ingest JSON events and writeStream to persist data in a partitioned Parquet format.

# Core logic from the Structured Streaming script
query = (
    processed_df
    .writeStream
    .format("parquet")
    .option("path", "/content/beatblast_datalake/song_plays")
    .option("checkpointLocation", "/content/checkpoints/song_plays")
    .partitionBy("year", "month", "day", "country")
    .outputMode("append")
    .start()
)

πŸ“‚ Data Lake Design & Partitioning

To facilitate high-performance analytics, the songPlay event stream is written to a simulated Data Lake in Parquet format.

Partitioning Strategy: /year/month/day/country/


🧠 Technical Justifications


πŸš€ Key Results & Output

| Window Start | Window End | songId | play_count | | :β€” | :β€” | :β€” | :β€” | | 2025-07-18 22:50:00 | 2025-07-18 23:00:00 | song_002 | 13 | | 2025-07-18 22:50:00 | 2025-07-18 23:00:00 | song_001 | 12 |

Storage Structure Result

/beatblast_datalake/song_plays/
 └── year=2025/
     └── month=07/
         └── day=18/
             β”œβ”€β”€ country=CA/
             β”œβ”€β”€ country=GB/
             β”œβ”€β”€ country=IN/
             └── country=US/

πŸ“‚ Project Resources

To explore the technical depth of this project, please refer to the following resources:

πŸ‘€ Author

Tejashwini Saravanan LinkedIn