A graduate-level text mining project applying Correlation Analysis, Association Rule Mining, and K-Means Clustering to a dataset of 130K+ wine reviews collected via Twitter β built entirely in RapidMiner Studio.
This project applies text mining and data analysis techniques to a large corpus of wine reviews collected via Twitter. The goal is to uncover hidden patterns in wine data β from how price relates to quality, to the most common grape blends and geographic clusters.
The analysis explores three core research questions:
wine-reviews-text-mining-rapidminer/
β
βββ Wine_Reviews.xlsx # Dataset (sourced from Kaggle)
βββ Wine_Review_.rmp # RapidMiner process file
βββ Wine_Reviews.pdf # Full project presentation (slides)
β
βββ visualizations/
βββ process .png # RapidMiner workflow diagram
βββ Data From Process .png # Clustered output data
βββ statistics .png # Dataset statistics overview
βββ correlation Matrix .png # Correlation matrix (points vs price)
βββ bell curve - plot .png # Price distribution (bell curve)
βββ Histogram .png # Points frequency by taster
βββ box plot .png # Points distribution by province
βββ scatter plot .png # Winery distribution by region
βββ Cluster Model .png # K-Means clustering result
| Attribute | Details |
|---|---|
| Source | Kaggle β Wine Reviews |
| Records | ~9,000 filtered records (original: 130K+) |
| Attributes | 14 total (variety, location, winery, price, points, description, taster Twitter handle, etc.) |
| Numerical | points (0β100 scale), price (USD) |
| Text | description (free-text taster reviews from Twitter) |
Key Statistics:
| Metric | Points | Price |
|---|---|---|
| Min | 80 | $4 |
| Max | 100 | $2,013 |
| Average | 89.15 | $37.58 |
| Std Dev | 2.83 | $27.25 |
Missing values handled via the Replace Missing Values operator in RapidMiner.
| Tool | Purpose |
|---|---|
| RapidMiner Studio | End-to-end data mining pipeline |
| Microsoft Excel | Dataset storage and initial review |
| Kaggle | Dataset source |
The full process was built in RapidMiner Studio and consists of three parallel sub-pipelines:
Retrieve Wine Reviews β Select Attributes β Filter Examples β Correlation Matrix
Analyzes the statistical relationship between points and price.
Retrieve Wine Reviews β Replace Missing Values β Select Attributes β
Process Documents β Numerical to Binominal β FP-Growth β Create Association Rules
Mines frequent itemsets from review text to discover grape variety patterns.
Retrieve Wine Reviews β Replace Missing Values β Select Attributes β
Process Documents β Clustering (K-Means)
Groups wines into natural clusters based on descriptive language in reviews.
A Pearson Correlation Matrix was computed between points and price.
| Attributes | Points | Price |
|---|---|---|
| Points | 1.000 | 0.430 |
| Price | 0.430 | 1.000 |
Finding: Points and price show a weak positive correlation (r = 0.43), meaning higher-rated wines tend to cost more, but price alone is not a strong predictor of quality.
A bell curve of wine prices for US wines shows a near-normal distribution centered around $30β$40, confirming that most wines fall in the affordable-to-mid range.
Comparing wine quality scores across US provinces:
| Province | Median Points | Notes |
|---|---|---|
| California | ~90 | Highest median |
| Oregon | ~89 | Strong performer |
| Washington | ~89 | Consistent quality |
| New York | ~84 | Lower range |
Finding: California produces the highest-rated wines on average among US provinces.
Using the FP-Growth algorithm, frequent patterns in wine descriptions were extracted. Top association rules discovered:
| Rule | Confidence |
|---|---|
[sparkling] β [blend] |
0.955 |
[noir] β [pinot] |
0.997 |
[bordeaux] β [blend] |
1.000 |
[style, bordeaux] β [blend] |
1.000 |
[blanc] β [sauvignon] |
0.784 |
Finding: Bordeaux Blend is the most commonly referenced grape blend in wine descriptions, with 100% confidence in multiple rules.
The wine reviews were grouped into 5 distinct clusters based on text description features:
| Cluster | Size | Flavor Profile |
|---|---|---|
| Cluster 0 | 32,235 | General / Mixed |
| Cluster 1 | 21 | Citrus notes |
| Cluster 2 | 15 | Fruity |
| Cluster 3 | 61 | Berry / Spice |
| Cluster 4 | 81 | Brand-focused |
Performance Vector:
Most reviews fall between 87β92 points, with @vboone and @mattkettmann being the most prolific reviewers.
Shows the geographic spread of wineries across regions, with Willamette Valley and California having the highest density.
Full attribute-level statistics including missing values, min/max, and distributions.
| Research Question | Finding |
|---|---|
| Price vs. Rating | Weakly correlated (r = 0.43) β price doesnβt guarantee quality |
| Best Province | California leads in average points among US provinces |
| Most Common Grape Blend | Bordeaux Blend dominates association rules |
| Wine Clusters | 5 clusters identified: Citrus, Fruity, Berry, Spice, Brand |
This analysis provides actionable insights for wine industry stakeholders:
What could be improved:
This project was completed as part of academic coursework. The dataset is publicly available on Kaggle under its original license. This repository is licensed under the MIT License.