Tool: Orange Data Mining | Domain: Healthcare & Clinical Research
This project focuses on the strategic application of Data Mining to enhance Clinical Trial Patient Selection. By leveraging Orange Data Mining software, I developed a pipeline to analyze patient profiles and predict Admission Types (Elective, Urgent, Emergency). In the context of clinical research, this methodology helps identify patient stability and risk levels, which is critical for optimizing recruitment processes and ensuring trial safety.In the context of clinical research, predicting admission types is critical: it allows researchers to proactively identify patient stability, screen out highly unstable candidates, reduce trial dropout rates, and ensure overall patient safety
The project utilizes a structured modular flow in Orange, designed to ensure data integrity and clinical relevance.

Data Preprocessing & Governance
Feature Importance (The “Rank” Node)
I utilized the Rank node to score the impact of each attribute on the target variable (Admission Type) using Information Gain and Gini Decrease. This identified which patient traits such as medical conditions or insurance providers—served as the strongest indicators for admission urgency.
I compared high-performance algorithms including Neural Networks and Gradient Boosting to determine the best predictive fit for the healthcare dataset.
In addition to the Orange outputs, I developed a comprehensive Model Performance Analysis Matrix in Excel. This sheet includes:
In a 3-class classification problem (Urgent, Emergency, Elective), a random guess yields 33.3% accuracy. My models hovered around this baseline, providing a vital Data Business Insight:
Feature Sufficiency: The results prove that demographic and billing data alone are not strong enough predictors for medical admission types in this synthetic dataset.
Clinical Complexity: Real-world emergency admissions are often stochastic (random) events.
Strategic Recommendation : This model serves as a baseline. For real-world implementation, I recommend integrating dynamic Electronic Health Records (EHR) such as vital signs, lab results, and historical comorbidities to move the predictive AUC well beyond this baseline.”
I used ROC Analysis to evaluate the trade-off between “False Positives” and “True Positives.” In healthcare, missing an emergency (False Negative) is more costly than a false alarm; this analysis allows us to tune the model for patient safety.
To validate the model findings, I conducted a granular review of the dataset and algorithm performance metrics.
Using Information Gain and Gini Decrease, I identified the primary drivers in the dataset. While the predictive power was balanced across classes, “Billing Amount” and “Blood Type” emerged as the most significant features for categorization.

The ROC curves for Neural Networks and Gradient Boosting demonstrate the “Random Guess” baseline (~0.50 AUC). This visualization is crucial for healthcare stakeholders to understand that additional clinical data points are required for a deployable model.

The model was trained on 10,000 patient records with a mix of categorical and numeric features, ensuring a robust sample size for cross-validation.

📂 data/: Raw Patient_Health_Records_Dataset.csv.
📂 workflow/: The Admission_Prediction_Workflow.ows (Orange file).
📂 docs/: Project_Report_and_Analysis.pdf and Model_Performance_Analysis_Matrix.xlsx.
📄 README.md: Project overview and results.
📄 requirements.txt: Software requirements.
🖼️ workflow_preview.png: Visual representation of the data pipeline.
Tejashwini Saravanan LinkedIn