This project demonstrates a robust, end-to-end Real-Time Data Engineering pipeline for BeatBlast, a digital music streaming platform. It covers the entire lifecycle of big dataβfrom high-velocity data simulation to stateful stream processing and optimized data lake storage.
The goal was to build a system capable of processing near real-time user interaction events (playing, skipping, and liking songs) to derive actionable business insights. Using Apache Spark Structured Streaming, the pipeline identifies trending content and monitors platform engagement while persisting data in a highly organized, partitioned format.
π View the Full Technical Report here
π View the Full Structured Streaming Script (PDF)
A Python-based simulator mimics a live production environment by generating a continuous stream of JSON events.
songPlay, songSkip, songLike, and appOpen.userId, sessionId, platform, country, and high-precision eventTimestamp.The PySpark application reads the JSON stream from the landing zone with a strictly enforced schema.
TimestampType for accurate event-time operations.The pipeline utilizes Sparkβs readStream to ingest JSON events and writeStream to persist data in a partitioned Parquet format.
# Core logic from the Structured Streaming script
query = (
processed_df
.writeStream
.format("parquet")
.option("path", "/content/beatblast_datalake/song_plays")
.option("checkpointLocation", "/content/checkpoints/song_plays")
.partitionBy("year", "month", "day", "country")
.outputMode("append")
.start()
)
To facilitate high-performance analytics, the songPlay event stream is written to a simulated Data Lake in Parquet format.
Partitioning Strategy: /year/month/day/country/
| Window Start | Window End | songId | play_count | | :β | :β | :β | :β | | 2025-07-18 22:50:00 | 2025-07-18 23:00:00 | song_002 | 13 | | 2025-07-18 22:50:00 | 2025-07-18 23:00:00 | song_001 | 12 |
/beatblast_datalake/song_plays/
βββ year=2025/
βββ month=07/
βββ day=18/
βββ country=CA/
βββ country=GB/
βββ country=IN/
βββ country=US/
To explore the technical depth of this project, please refer to the following resources:
Tejashwini Saravanan LinkedIn