Data Stories on Datasets

Data Stories on Datasets#

We explore two Kaggle datasets—one for classification (Heart Disease) and another for regression (Avocado Prices). Each dataset is analyzed as follows:

Problem Statement
Understanding the Data
Data Transformations & Exploration
Visualizing Patterns
Deciding on the Approach
Data Sufficiency & Next Steps

Story 1: Heart Disease Classification#

1. Problem Statement#

We aim to predict the presence of heart disease using patient demographics and medical measurements.

Dataset: Heart Disease UCI
File Name: heart.csv

2. Understanding the Data#

Columns Overview:#

age: Age in years
sex: 1 = male, 0 = female
cp: Chest pain type (4 categories)
trestbps: Resting blood pressure (mm Hg)
chol: Serum cholesterol (mg/dl)
fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg: Resting ECG results (0, 1, 2)
thalach: Maximum heart rate achieved
exang: Exercise-induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest
slope: Slope of the peak exercise ST segment
ca: Number of major vessels (0–3)
thal: 3 = normal, 6 = fixed defect, 7 = reversible defect
target: 1 = heart disease present, 0 = no heart disease (binary target)

Key Insight: target is binary → This is a classification problem.

3. Data Transformations & Exploration#

Python Code:#

import pandas as pd

# Load the dataset
heart_data = pd.read_csv('heart.csv')

# Explore dataset
print(heart_data.head())
print(heart_data.info())
print(heart_data.describe())

# Check for missing values
print(heart_data.isnull().sum())

Observations:

The dataset has no significant missing values.
Features like cp, thal, restecg need encoding for modeling.

4. Visualizing Patterns#

4.1 Target Distribution:#

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='target', data=heart_data)
plt.title('Heart Disease Distribution (0 = No, 1 = Yes)')
plt.xlabel('Heart Disease')
plt.ylabel('Count')
plt.show()

4.2 Correlation Heatmap:#

corr_matrix = heart_data.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

5. Deciding on the Approach#

This is a classification problem because the target variable (target) is binary (0 or 1).

6. Data Sufficiency & Next Steps#

Dataset Size: ~303 rows, suitable for exploratory analysis.
Potential Models: Logistic Regression, Random Forest, or XGBoost.
Evaluation Metrics: Accuracy, Precision, Recall, F1, and ROC-AUC.
Validation: Perform cross-validation for robustness.

Story 2: Avocado Prices Regression#

1. Problem Statement#

Predict the average price of avocados in different US regions.

Dataset: Avocado Prices
File Name: avocado.csv

2. Understanding the Data#

Columns Overview:#

Date: Observation date
AveragePrice: Average price of a single avocado (target variable)
Total Volume: Total number of avocados sold
type: Avocado type (conventional or organic)
year: Year of observation
region: Region of observation
Size-specific columns: 4046, 4225, 4770 (PLU codes for specific avocado sizes)

Key Insight: AveragePrice is a continuous variable → This is a regression problem.

3. Data Transformations & Exploration#

Python Code:#

import pandas as pd

# Load the dataset
avocado_data = pd.read_csv('avocado.csv')

# Explore dataset
print(avocado_data.head())
print(avocado_data.info())
print(avocado_data.describe())

# Check for missing values
print(avocado_data.isna().sum())

Observations:

Convert Date to datetime format.
Encode type and region as categorical features.

4. Visualizing Patterns#

4.1 Price Distribution:#

sns.histplot(data=avocado_data, x='AveragePrice', kde=True)
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show()

4.2 Price by Type:#

sns.boxplot(data=avocado_data, x='type', y='AveragePrice')
plt.title('Price by Avocado Type')
plt.xlabel('Type')
plt.ylabel('Price')
plt.show()

4.3 Correlation Heatmap:#

numerical_cols = ['AveragePrice', 'Total Volume', '4046', '4225', '4770']
corr_matrix = avocado_data[numerical_cols].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

5. Deciding on the Approach#

This is a regression problem because the target variable (AveragePrice) is continuous.

6. Data Sufficiency & Next Steps#

Dataset Size: ~18,000 rows, suitable for regression modeling.
Potential Models: Linear Regression, Random Forest, or XGBoost.
Evaluation Metrics: RMSE, MAE, and R².
Validation: Perform train-test split and cross-validation for robust results.

Conclusion#

We explored two datasets:

Heart Disease (Classification):
- Target: target (0 or 1)
- Suggested models: Logistic Regression, Random Forest.
Avocado Prices (Regression):
- Target: AveragePrice (continuous)
- Suggested models: Linear Regression, Random Forest.

Common Workflow:#

Understand the dataset (features, target variable).
Clean and preprocess the data (missing values, encoding).
Visualize patterns and correlations.
Choose the correct approach (classification or regression).
Validate results using appropriate metrics.

Data Stories on Datasets

Contents

Data Stories on Datasets#

Story 1: Heart Disease Classification#

1. Problem Statement#

2. Understanding the Data#

Columns Overview:#

3. Data Transformations & Exploration#

Python Code:#

4. Visualizing Patterns#

4.1 Target Distribution:#

4.2 Correlation Heatmap:#

5. Deciding on the Approach#

6. Data Sufficiency & Next Steps#

Story 2: Avocado Prices Regression#

1. Problem Statement#

2. Understanding the Data#

Columns Overview:#

3. Data Transformations & Exploration#

Python Code:#

4. Visualizing Patterns#

4.1 Price Distribution:#

4.2 Price by Type:#

4.3 Correlation Heatmap:#

5. Deciding on the Approach#

6. Data Sufficiency & Next Steps#

Conclusion#

Common Workflow:#