Data Stories on Datasets#
We explore two Kaggle datasets—one for classification (Heart Disease) and another for regression (Avocado Prices). Each dataset is analyzed as follows:
Problem Statement
Understanding the Data
Data Transformations & Exploration
Visualizing Patterns
Deciding on the Approach
Data Sufficiency & Next Steps
Story 1: Heart Disease Classification#
1. Problem Statement#
We aim to predict the presence of heart disease using patient demographics and medical measurements.
Dataset: Heart Disease UCI
File Name: heart.csv
2. Understanding the Data#
Columns Overview:#
age: Age in yearssex: 1 = male, 0 = femalecp: Chest pain type (4 categories)trestbps: Resting blood pressure (mm Hg)chol: Serum cholesterol (mg/dl)fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)restecg: Resting ECG results (0, 1, 2)thalach: Maximum heart rate achievedexang: Exercise-induced angina (1 = yes; 0 = no)oldpeak: ST depression induced by exercise relative to restslope: Slope of the peak exercise ST segmentca: Number of major vessels (0–3)thal: 3 = normal, 6 = fixed defect, 7 = reversible defecttarget: 1 = heart disease present, 0 = no heart disease (binary target)
Key Insight: target is binary → This is a classification problem.
3. Data Transformations & Exploration#
Python Code:#
import pandas as pd
# Load the dataset
heart_data = pd.read_csv('heart.csv')
# Explore dataset
print(heart_data.head())
print(heart_data.info())
print(heart_data.describe())
# Check for missing values
print(heart_data.isnull().sum())
Observations:
The dataset has no significant missing values.
Features like
cp,thal,restecgneed encoding for modeling.
4. Visualizing Patterns#
4.1 Target Distribution:#
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='target', data=heart_data)
plt.title('Heart Disease Distribution (0 = No, 1 = Yes)')
plt.xlabel('Heart Disease')
plt.ylabel('Count')
plt.show()
4.2 Correlation Heatmap:#
corr_matrix = heart_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
5. Deciding on the Approach#
This is a classification problem because the target variable (target) is binary (0 or 1).
6. Data Sufficiency & Next Steps#
Dataset Size: ~303 rows, suitable for exploratory analysis.
Potential Models: Logistic Regression, Random Forest, or XGBoost.
Evaluation Metrics: Accuracy, Precision, Recall, F1, and ROC-AUC.
Validation: Perform cross-validation for robustness.
Story 2: Avocado Prices Regression#
1. Problem Statement#
Predict the average price of avocados in different US regions.
Dataset: Avocado Prices
File Name: avocado.csv
2. Understanding the Data#
Columns Overview:#
Date: Observation dateAveragePrice: Average price of a single avocado (target variable)Total Volume: Total number of avocados soldtype: Avocado type (conventional or organic)year: Year of observationregion: Region of observationSize-specific columns:
4046,4225,4770(PLU codes for specific avocado sizes)
Key Insight: AveragePrice is a continuous variable → This is a regression problem.
3. Data Transformations & Exploration#
Python Code:#
import pandas as pd
# Load the dataset
avocado_data = pd.read_csv('avocado.csv')
# Explore dataset
print(avocado_data.head())
print(avocado_data.info())
print(avocado_data.describe())
# Check for missing values
print(avocado_data.isna().sum())
Observations:
Convert
Dateto datetime format.Encode
typeandregionas categorical features.
4. Visualizing Patterns#
4.1 Price Distribution:#
sns.histplot(data=avocado_data, x='AveragePrice', kde=True)
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show()
4.2 Price by Type:#
sns.boxplot(data=avocado_data, x='type', y='AveragePrice')
plt.title('Price by Avocado Type')
plt.xlabel('Type')
plt.ylabel('Price')
plt.show()
4.3 Correlation Heatmap:#
numerical_cols = ['AveragePrice', 'Total Volume', '4046', '4225', '4770']
corr_matrix = avocado_data[numerical_cols].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
5. Deciding on the Approach#
This is a regression problem because the target variable (AveragePrice) is continuous.
6. Data Sufficiency & Next Steps#
Dataset Size: ~18,000 rows, suitable for regression modeling.
Potential Models: Linear Regression, Random Forest, or XGBoost.
Evaluation Metrics: RMSE, MAE, and R².
Validation: Perform train-test split and cross-validation for robust results.
Conclusion#
We explored two datasets:
Heart Disease (Classification):
Target:
target(0 or 1)Suggested models: Logistic Regression, Random Forest.
Avocado Prices (Regression):
Target:
AveragePrice(continuous)Suggested models: Linear Regression, Random Forest.
Common Workflow:#
Understand the dataset (features, target variable).
Clean and preprocess the data (missing values, encoding).
Visualize patterns and correlations.
Choose the correct approach (classification or regression).
Validate results using appropriate metrics.