wine-reviews-text-mining-rapidminer

Wine Reviews β€” Text Mining & Analysis

Uncovering patterns in 130,000+ wine reviews through data mining techniques


[![RapidMiner](https://img.shields.io/badge/Tool-RapidMiner-FFD700?style=for-the-badge)](https://rapidminer.com/) [![License: MIT](https://img.shields.io/badge/License-MIT-black?style=for-the-badge)](LICENSE) [![Dataset](https://img.shields.io/badge/Dataset-Kaggle-20BEFF?style=for-the-badge&logo=kaggle&logoColor=white)](https://www.kaggle.com/datasets/zynicide/wine-reviews) [![Status](https://img.shields.io/badge/Status-Completed-brightgreen?style=for-the-badge)]() [![Academic](https://img.shields.io/badge/Course-ISM6359%20Data%20Mining-8A2BE2?style=for-the-badge)]()

A graduate-level text mining project applying Correlation Analysis, Association Rule Mining, and K-Means Clustering to a dataset of 130K+ wine reviews collected via Twitter β€” built entirely in RapidMiner Studio.



Table of Contents


Project Overview

This project applies text mining and data analysis techniques to a large corpus of wine reviews collected via Twitter. The goal is to uncover hidden patterns in wine data β€” from how price relates to quality, to the most common grape blends and geographic clusters.

The analysis explores three core research questions:


Repository Structure

wine-reviews-text-mining-rapidminer/
β”‚
β”œβ”€β”€ Wine_Reviews.xlsx          # Dataset (sourced from Kaggle)
β”œβ”€β”€ Wine_Review_.rmp           # RapidMiner process file
β”œβ”€β”€ Wine_Reviews.pdf           # Full project presentation (slides)
β”‚
└── visualizations/
    β”œβ”€β”€ process .png            # RapidMiner workflow diagram
    β”œβ”€β”€ Data From Process .png  # Clustered output data
    β”œβ”€β”€ statistics .png         # Dataset statistics overview
    β”œβ”€β”€ correlation Matrix .png # Correlation matrix (points vs price)
    β”œβ”€β”€ bell curve - plot .png  # Price distribution (bell curve)
    β”œβ”€β”€ Histogram .png          # Points frequency by taster
    β”œβ”€β”€ box plot .png           # Points distribution by province
    β”œβ”€β”€ scatter plot .png       # Winery distribution by region
    └── Cluster Model .png      # K-Means clustering result

Dataset

Attribute Details
Source Kaggle β€” Wine Reviews
Records ~9,000 filtered records (original: 130K+)
Attributes 14 total (variety, location, winery, price, points, description, taster Twitter handle, etc.)
Numerical points (0–100 scale), price (USD)
Text description (free-text taster reviews from Twitter)

Key Statistics:

Metric Points Price
Min 80 $4
Max 100 $2,013
Average 89.15 $37.58
Std Dev 2.83 $27.25

Missing values handled via the Replace Missing Values operator in RapidMiner.


Tools & Technologies

Tool Purpose
RapidMiner Studio End-to-end data mining pipeline
Microsoft Excel Dataset storage and initial review
Kaggle Dataset source

RapidMiner Process Pipeline

The full process was built in RapidMiner Studio and consists of three parallel sub-pipelines:

1. Correlation Analysis Pipeline

Retrieve Wine Reviews β†’ Select Attributes β†’ Filter Examples β†’ Correlation Matrix

Analyzes the statistical relationship between points and price.

2. Association Rule Mining Pipeline

Retrieve Wine Reviews β†’ Replace Missing Values β†’ Select Attributes β†’
Process Documents β†’ Numerical to Binominal β†’ FP-Growth β†’ Create Association Rules

Mines frequent itemsets from review text to discover grape variety patterns.

3. Clustering Analysis Pipeline

Retrieve Wine Reviews β†’ Replace Missing Values β†’ Select Attributes β†’
Process Documents β†’ Clustering (K-Means)

Groups wines into natural clusters based on descriptive language in reviews.



Full RapidMiner process pipeline across all three analyses

Analysis & Results

Correlation Analysis

A Pearson Correlation Matrix was computed between points and price.

Attributes Points Price
Points 1.000 0.430
Price 0.430 1.000

Finding: Points and price show a weak positive correlation (r = 0.43), meaning higher-rated wines tend to cost more, but price alone is not a strong predictor of quality.


Correlation matrix β€” points vs price

Price Distribution

A bell curve of wine prices for US wines shows a near-normal distribution centered around $30–$40, confirming that most wines fall in the affordable-to-mid range.


Price distribution bell curve β€” US wines

Points by Province (Box Plot)

Comparing wine quality scores across US provinces:

Province Median Points Notes
California ~90 Highest median
Oregon ~89 Strong performer
Washington ~89 Consistent quality
New York ~84 Lower range

Finding: California produces the highest-rated wines on average among US provinces.


Points distribution by province β€” box plot

Association Rule Mining

Using the FP-Growth algorithm, frequent patterns in wine descriptions were extracted. Top association rules discovered:

Rule Confidence
[sparkling] β†’ [blend] 0.955
[noir] β†’ [pinot] 0.997
[bordeaux] β†’ [blend] 1.000
[style, bordeaux] β†’ [blend] 1.000
[blanc] β†’ [sauvignon] 0.784

Finding: Bordeaux Blend is the most commonly referenced grape blend in wine descriptions, with 100% confidence in multiple rules.


K-Means Clustering

The wine reviews were grouped into 5 distinct clusters based on text description features:

Cluster Size Flavor Profile
Cluster 0 32,235 General / Mixed
Cluster 1 21 Citrus notes
Cluster 2 15 Fruity
Cluster 3 61 Berry / Spice
Cluster 4 81 Brand-focused

Performance Vector:

   
Cluster model output (left) β€” Clustered data from process (right)

Additional Visualizations

Histogram β€” Points Frequency by Taster

Most reviews fall between 87–92 points, with @vboone and @mattkettmann being the most prolific reviewers.


Points frequency histogram by taster Twitter handle

Scatter Plot β€” Winery Distribution by Region

Shows the geographic spread of wineries across regions, with Willamette Valley and California having the highest density.


Winery distribution across US wine regions

Dataset Statistics

Full attribute-level statistics including missing values, min/max, and distributions.


RapidMiner dataset statistics panel

Key Findings Summary

Research Question Finding
Price vs. Rating Weakly correlated (r = 0.43) β€” price doesn’t guarantee quality
Best Province California leads in average points among US provinces
Most Common Grape Blend Bordeaux Blend dominates association rules
Wine Clusters 5 clusters identified: Citrus, Fruity, Berry, Spice, Brand

Insights & Business Value

This analysis provides actionable insights for wine industry stakeholders:


Limitations & Future Work

What could be improved:


Author

Tejashwini Saravanan

πŸ“§ saravanant@spu.edu  |  πŸŽ“ Master's Program β€” ISM6359 Data Mining [![GitHub](https://img.shields.io/badge/GitHub-TejashwiniSaravanan-181717?style=flat-square&logo=github)](https://github.com/TejashwiniSaravanan)

License

This project was completed as part of academic coursework. The dataset is publicly available on Kaggle under its original license. This repository is licensed under the MIT License.


Built with RapidMiner Studio  |  ISM6359 Data Mining  |  Master's Program