Scalable-Health-Data-Solutions-GCP

🏥 Scalable Health Data Solutions & Big Data ML Pipelines 🚀

🌟 Project Overview

This project demonstrates a dual-lens approach to Big Data in Healthcare. It bridges high-level Cloud Strategy (GCP) with hands-on Distributed Machine Learning (PySpark) to solve critical industry challenges like early disease detection and treatment cost optimization.

🛠️ Technical Execution: PySpark ML Pipeline 📊

I developed a high-performance machine learning workflow using Apache Spark to handle large-scale data processing. This pipeline automates the journey from raw data ingestion to multi-model evaluation.

🧩 Core Engineering Workflow

To ensure data integrity and scalability, I implemented a modular PySpark pipeline:

Feature Engineering & Transformation Workflow

imputer = Imputer(inputCols=["Age", "Fare"], outputCols=["Age_imputed", "Fare_imputed"])
indexers = [StringIndexer(inputCol=col, outputCol=col + "_Index") for col in ["Sex", "Embarked", "Pclass"]]
encoders = [OneHotEncoder(inputCol=col + "_Index", outputCol=col + "_Vec") for col in ["Sex", "Embarked", "Pclass"]]
assembler = VectorAssembler(inputCols=["Age_imputed", "SibSp", "Parch", "Sex_Vec", "Embarked_Vec", "Pclass_Vec"], outputCol="features")

🖼️ Model Performance & Data Insights

📈 Model Predictions 📉 Confusion Matrix 📋 Data Engineering
Real-time prediction scores Accuracy breakdown (83.12%) Cleaned ETL output

🏆 Key Metrics SummarySurvival Prediction (Random Forest):

🧬 Phase 1: Strategic Healthcare Case Study

💡 Business Problem & Opportunity

🌐 The 3 Vs of Health Data

☁️ Essential Cloud Services (GCP Stack)

I proposed a Google Cloud Platform architecture to support scalable health analytics:

🧰 Tools & Technologies

📜 License

This project is licensed under the MIT License.

👤 Author

Tejashwini Saravanan LinkedIn