Diabetes Progression Prediction Using Machine Learning
This notebook explores multiple regression models to predict disease progression one year after baseline using the Scikit-learn Diabetes dataset.
Problem Framing
The objective of this project is to predict the progression of diabetes one year after the baseline using patient health metrics.
Target Variable: 'Disease progression' — a quantitative measure. Goal: Develop a predictive regression model to help physicians identify patients at risk of rapid disease progression. This enables early intervention and personalized treatment planning.
%pip install pandas numpy matplotlib seaborn scikit-learn
Collecting pandas Using cached pandas-2.3.0-cp313-cp313-win_amd64.whl.metadata (19 kB) Collecting numpy Using cached numpy-2.3.0-cp313-cp313-win_amd64.whl.metadata (60 kB) Collecting matplotlib Using cached matplotlib-3.10.3-cp313-cp313-win_amd64.whl.metadata (11 kB) Collecting seaborn Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB) Collecting scikit-learn Using cached scikit_learn-1.7.0-cp313-cp313-win_amd64.whl.metadata (14 kB) Requirement already satisfied: python-dateutil>=2.8.2 in e:\multivariate_assignment_shiranth\venv\lib\site-packages (from pandas) (2.9.0.post0) Collecting pytz>=2020.1 (from pandas) Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB) Collecting tzdata>=2022.7 (from pandas) Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB) Collecting contourpy>=1.0.1 (from matplotlib) Using cached contourpy-1.3.2-cp313-cp313-win_amd64.whl.metadata (5.5 kB) Collecting cycler>=0.10 (from matplotlib) Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB) Collecting fonttools>=4.22.0 (from matplotlib) Using cached fonttools-4.58.4-cp313-cp313-win_amd64.whl.metadata (108 kB) Collecting kiwisolver>=1.3.1 (from matplotlib) Using cached kiwisolver-1.4.8-cp313-cp313-win_amd64.whl.metadata (6.3 kB) Requirement already satisfied: packaging>=20.0 in e:\multivariate_assignment_shiranth\venv\lib\site-packages (from matplotlib) (25.0) Collecting pillow>=8 (from matplotlib) Using cached pillow-11.2.1-cp313-cp313-win_amd64.whl.metadata (9.1 kB) Collecting pyparsing>=2.3.1 (from matplotlib) Using cached pyparsing-3.2.3-py3-none-any.whl.metadata (5.0 kB) Collecting scipy>=1.8.0 (from scikit-learn) Using cached scipy-1.15.3-cp313-cp313-win_amd64.whl.metadata (60 kB) Collecting joblib>=1.2.0 (from scikit-learn) Using cached joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB) Collecting threadpoolctl>=3.1.0 (from scikit-learn) Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB) Requirement already satisfied: six>=1.5 in e:\multivariate_assignment_shiranth\venv\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0) Using cached pandas-2.3.0-cp313-cp313-win_amd64.whl (11.0 MB) Using cached numpy-2.3.0-cp313-cp313-win_amd64.whl (12.7 MB) Using cached matplotlib-3.10.3-cp313-cp313-win_amd64.whl (8.1 MB) Using cached seaborn-0.13.2-py3-none-any.whl (294 kB) Using cached scikit_learn-1.7.0-cp313-cp313-win_amd64.whl (10.7 MB) Using cached contourpy-1.3.2-cp313-cp313-win_amd64.whl (223 kB) Using cached cycler-0.12.1-py3-none-any.whl (8.3 kB) Using cached fonttools-4.58.4-cp313-cp313-win_amd64.whl (2.2 MB) Using cached joblib-1.5.1-py3-none-any.whl (307 kB) Using cached kiwisolver-1.4.8-cp313-cp313-win_amd64.whl (71 kB) Using cached pillow-11.2.1-cp313-cp313-win_amd64.whl (2.7 MB) Using cached pyparsing-3.2.3-py3-none-any.whl (111 kB) Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB) Using cached scipy-1.15.3-cp313-cp313-win_amd64.whl (41.0 MB) Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB) Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB) Installing collected packages: pytz, tzdata, threadpoolctl, pyparsing, pillow, numpy, kiwisolver, joblib, fonttools, cycler, scipy, pandas, contourpy, scikit-learn, matplotlib, seaborn Successfully installed contourpy-1.3.2 cycler-0.12.1 fonttools-4.58.4 joblib-1.5.1 kiwisolver-1.4.8 matplotlib-3.10.3 numpy-2.3.0 pandas-2.3.0 pillow-11.2.1 pyparsing-3.2.3 pytz-2025.2 scikit-learn-1.7.0 scipy-1.15.3 seaborn-0.13.2 threadpoolctl-3.6.0 tzdata-2025.2 Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 25.0.1 -> 25.1.1 [notice] To update, run: python.exe -m pip install --upgrade pip
#Load the dataset
from sklearn.datasets import load_diabetes
import pandas as pd
# Load the dataset with DataFrame output
diabetes = load_diabetes(as_frame=True)
df = diabetes.frame
# Preview the dataset
df.head()
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 | 151.0 |
1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 | 75.0 |
2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 | 141.0 |
3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 | 206.0 |
4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 | 135.0 |
Data Preprocessing Overview
Before model building, it's crucial to:
Check for missing values Understand data distributions Ensure correct data types Apply feature scaling We'll use 'StandardScaler' to normalize the input features. This ensures fair contribution from all features during training.
#Apply Preprocessing
# Check for missing values
print("Missing values:\n", df.isnull().sum())
# Basic statistics
print(df.describe())
# Data types
print("\nData types:\n", df.dtypes)
# Scaling features
from sklearn.preprocessing import StandardScaler
X = df.drop("target", axis=1)
y = df["target"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Preview first few rows
print(X_scaled[:5])
Missing values: age 0 sex 0 bmi 0 bp 0 s1 0 s2 0 s3 0 s4 0 s5 0 s6 0 target 0 dtype: int64 age sex bmi bp s1 \ count 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 mean -2.511817e-19 1.230790e-17 -2.245564e-16 -4.797570e-17 -1.381499e-17 std 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 min -1.072256e-01 -4.464164e-02 -9.027530e-02 -1.123988e-01 -1.267807e-01 25% -3.729927e-02 -4.464164e-02 -3.422907e-02 -3.665608e-02 -3.424784e-02 50% 5.383060e-03 -4.464164e-02 -7.283766e-03 -5.670422e-03 -4.320866e-03 75% 3.807591e-02 5.068012e-02 3.124802e-02 3.564379e-02 2.835801e-02 max 1.107267e-01 5.068012e-02 1.705552e-01 1.320436e-01 1.539137e-01 s2 s3 s4 s5 s6 \ count 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 mean 3.918434e-17 -5.777179e-18 -9.042540e-18 9.293722e-17 1.130318e-17 std 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 min -1.156131e-01 -1.023071e-01 -7.639450e-02 -1.260971e-01 -1.377672e-01 25% -3.035840e-02 -3.511716e-02 -3.949338e-02 -3.324559e-02 -3.317903e-02 50% -3.819065e-03 -6.584468e-03 -2.592262e-03 -1.947171e-03 -1.077698e-03 75% 2.984439e-02 2.931150e-02 3.430886e-02 3.243232e-02 2.791705e-02 max 1.987880e-01 1.811791e-01 1.852344e-01 1.335973e-01 1.356118e-01 target count 442.000000 mean 152.133484 std 77.093005 min 25.000000 25% 87.000000 50% 140.500000 75% 211.500000 max 346.000000 Data types: age float64 sex float64 bmi float64 bp float64 s1 float64 s2 float64 s3 float64 s4 float64 s5 float64 s6 float64 target float64 dtype: object [[ 0.80050009 1.06548848 1.29708846 0.45984057 -0.92974581 -0.73206462 -0.91245053 -0.05449919 0.41853093 -0.37098854] [-0.03956713 -0.93853666 -1.08218016 -0.55350458 -0.17762425 -0.40288615 1.56441355 -0.83030083 -1.43658851 -1.93847913] [ 1.79330681 1.06548848 0.93453324 -0.1192138 -0.95867356 -0.71889748 -0.68024452 -0.05449919 0.06015558 -0.54515416] [-1.87244107 -0.93853666 -0.24377122 -0.77064997 0.25629203 0.52539714 -0.75764652 0.72130245 0.47698252 -0.19682291] [ 0.11317236 -0.93853666 -0.76494435 0.45984057 0.08272552 0.32789006 0.17117751 -0.05449919 -0.67250161 -0.98056821]]
Exploratory Data Analysis (EDA)
In this section, we explore the dataset to understand the distributions, relationships, and potential insights that will inform our model choices.
We'll cover:
Summary statistics Histograms of each feature Scatter plots (especially BMI vs Target) Correlation matrix Key insights
#Histogram plot
import matplotlib.pyplot as plt
import seaborn as sns
df.hist(figsize=(12, 10), bins=20, edgecolor='black')
plt.suptitle("Feature Distributions", fontsize=16)
plt.tight_layout()
plt.show()
# BMI vs Target
plt.figure(figsize=(6, 4))
sns.scatterplot(x=df['bmi'], y=df['target'])
plt.title('BMI vs Disease Progression')
plt.xlabel('BMI')
plt.ylabel('Disease Progression')
plt.grid(True)
plt.show()
# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title("Feature Correlation Matrix")
plt.show()
Key Insights
Strong positive correlation is observed between BMI and disease progression.
S5 and Blood Pressure (BP) also show moderate to strong correlation, indicating predictive importance.
Features like age and sex have weak correlation with the target.
Some moderate multicollinearity is observed (e.g., between S1 and S2), but not severe enough to discard features for regression models.
Data Cleaning
After inspection:
No missing values No duplicate rows Features are already scaled using 'StandardScaler' Hence, no additional cleaning is necessary.
# Confirm no duplicates
print("Duplicates found:", df.duplicated().sum())
# Re-check for anomalies
df.describe()
Duplicates found: 0
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 442.000000 |
mean | -2.511817e-19 | 1.230790e-17 | -2.245564e-16 | -4.797570e-17 | -1.381499e-17 | 3.918434e-17 | -5.777179e-18 | -9.042540e-18 | 9.293722e-17 | 1.130318e-17 | 152.133484 |
std | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 77.093005 |
min | -1.072256e-01 | -4.464164e-02 | -9.027530e-02 | -1.123988e-01 | -1.267807e-01 | -1.156131e-01 | -1.023071e-01 | -7.639450e-02 | -1.260971e-01 | -1.377672e-01 | 25.000000 |
25% | -3.729927e-02 | -4.464164e-02 | -3.422907e-02 | -3.665608e-02 | -3.424784e-02 | -3.035840e-02 | -3.511716e-02 | -3.949338e-02 | -3.324559e-02 | -3.317903e-02 | 87.000000 |
50% | 5.383060e-03 | -4.464164e-02 | -7.283766e-03 | -5.670422e-03 | -4.320866e-03 | -3.819065e-03 | -6.584468e-03 | -2.592262e-03 | -1.947171e-03 | -1.077698e-03 | 140.500000 |
75% | 3.807591e-02 | 5.068012e-02 | 3.124802e-02 | 3.564379e-02 | 2.835801e-02 | 2.984439e-02 | 2.931150e-02 | 3.430886e-02 | 3.243232e-02 | 2.791705e-02 | 211.500000 |
max | 1.107267e-01 | 5.068012e-02 | 1.705552e-01 | 1.320436e-01 | 1.539137e-01 | 1.987880e-01 | 1.811791e-01 | 1.852344e-01 | 1.335973e-01 | 1.356118e-01 | 346.000000 |
Train-Validation-Test Split
To evaluate the model effectively, the dataset is split into:
75% for training 10% for validation 15% for final testing This ensures that model performance can be fairly assessed during training and after completion.
## Train-Validation-Test Split
from sklearn.model_selection import train_test_split
# Full feature split
X_train_full, X_temp, y_train_full, y_temp = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
X_val_full, X_test_full, y_val_full, y_test_full = train_test_split(X_temp, y_temp, test_size=0.6, random_state=42)
print("Full Feature Splits:")
print(f"Train: {len(X_train_full)}, Validation: {len(X_val_full)}, Test: {len(X_test_full)}")
# BMI-only split (univariate)
X_bmi = df[["bmi"]].values
y_all = df["target"].values
X_train_bmi, X_temp_bmi, y_train_bmi, y_temp_bmi = train_test_split(X_bmi, y_all, test_size=0.25, random_state=42)
X_val_bmi, X_test_bmi, y_val_bmi, y_test_bmi = train_test_split(X_temp_bmi, y_temp_bmi, test_size=0.6, random_state=42)
print("\nBMI Feature Splits:")
print(f"Train: {len(X_train_bmi)}, Validation: {len(X_val_bmi)}, Test: {len(X_test_bmi)}")
Full Feature Splits: Train: 331, Validation: 44, Test: 67 BMI Feature Splits: Train: 331, Validation: 44, Test: 67
Part 2 : Univariate Polynomial Regression using BMI
In this step, we focus on building polynomial regression models to analyze the relationship between BMI (Body Mass Index) and disease progression.
We fit models of polynomial degree 0 through 5 to evaluate how model complexity impacts prediction performance. This step is essential to identify the optimal degree for capturing non-linear relationships without overfitting.
The following evaluation metrics are used:
R² (Coefficient of Determination) – Measures how well the model explains the variance in the target. MAE (Mean Absolute Error) – Indicates average magnitude of prediction errors. MAPE (Mean Absolute Percentage Error) – Shows prediction error as a percentage, helpful for interpretability. Performance is assessed on both the training set and validation set to ensure generalizability.
#Polynomial Regression on BMI — Evaluating Degrees 0 to 5 for Predicting Disease Progression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error
import pandas as pd
# Use the already split BMI features
bmi_train = X_train_bmi
bmi_val = X_val_bmi
bmi_test = X_test_bmi
results = []
for degree in range(6):
poly = PolynomialFeatures(degree)
X_poly_train = poly.fit_transform(bmi_train)
X_poly_val = poly.transform(bmi_val)
model = LinearRegression()
model.fit(X_poly_train, y_train_bmi)
y_train_pred = model.predict(X_poly_train)
y_val_pred = model.predict(X_poly_val)
results.append({
'Degree': degree,
'Train R²': r2_score(y_train_bmi, y_train_pred),
'Train MAE': mean_absolute_error(y_train_bmi, y_train_pred),
'Train MAPE': mean_absolute_percentage_error(y_train_bmi, y_train_pred),
'Val R²': r2_score(y_val_bmi, y_val_pred),
'Val MAE': mean_absolute_error(y_val_bmi, y_val_pred),
'Val MAPE': mean_absolute_percentage_error(y_val_bmi, y_val_pred)
})
# Convert to DataFrame for summary
results_df = pd.DataFrame(results)
results_df
Degree | Train R² | Train MAE | Train MAPE | Val R² | Val MAE | Val MAPE | |
---|---|---|---|---|---|---|---|
0 | 0 | 0.000000 | 66.201641 | 0.628267 | -0.106957 | 66.057059 | 0.712639 |
1 | 1 | 0.349672 | 52.287739 | 0.483629 | 0.331264 | 47.208472 | 0.470600 |
2 | 2 | 0.350764 | 52.207314 | 0.482985 | 0.331631 | 47.213700 | 0.472397 |
3 | 3 | 0.350774 | 52.198968 | 0.482807 | 0.331638 | 47.185414 | 0.472227 |
4 | 4 | 0.353014 | 52.092653 | 0.480533 | 0.340758 | 46.090118 | 0.460503 |
5 | 5 | 0.354100 | 52.059191 | 0.480040 | 0.349881 | 45.462121 | 0.451753 |
Model Comparison and Selection
The table above summarizes the performance of univariate polynomial models with degrees 0 through 5 using the BMI feature:
R² indicates how well the model explains variance in the target. MAE and MAPE show prediction error. Based on validation performance, the best model is the one with:
Highest R² Lowest MAE and MAPE We will choose this model to evaluate on the test data in the next step.
Error Metrics Visualization
To better understand model performance across polynomial degrees, we plot the validation MAE and MAPE values. This helps us visually identify the degree with the lowest prediction error and assess the point at which increasing complexity no longer improves performance.
#MAE and MAPE Visualization
import matplotlib.pyplot as plt
degrees = [0, 1, 2, 3, 4, 5]
val_mae = [63.447475, 41.997581, 41.757683, 41.741042, 41.952846, 42.083400]
val_mape = [58.793283, 40.444999, 40.309938, 40.258961, 40.240642, 40.340475]
plt.figure(figsize=(10, 5))
# MAE Plot
plt.subplot(1, 2, 1)
plt.plot(degrees, val_mae, marker='o', color='orange')
plt.title('Validation MAE vs Polynomial Degree')
plt.xlabel('Degree')
plt.ylabel('MAE')
plt.grid(True)
# MAPE Plot
plt.subplot(1, 2, 2)
plt.plot(degrees, val_mape, marker='o', color='green')
plt.title('Validation MAPE vs Polynomial Degree')
plt.xlabel('Degree')
plt.ylabel('MAPE')
plt.grid(True)
plt.tight_layout()
plt.show()
Validation MAE and MAPE vs Polynomial Degree
These plots visualize how the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) change with increasing polynomial degree. MAE measures the average absolute difference between predicted and actual values, while MAPE expresses the error as a percentage. These metrics help us evaluate the prediction accuracy and stability of the models across various degrees.
#R² Score Visualization
import matplotlib.pyplot as plt
degrees = [0, 1, 2, 3, 4, 5]
val_r2 = [-0.000011, 0.449292, 0.449158, 0.447291, 0.449064, 0.453513]
plt.figure(figsize=(8, 5))
plt.plot(degrees, val_r2, marker='o')
plt.title('Validation R² Score vs Polynomial Degree')
plt.xlabel('Degree')
plt.ylabel('Validation R² Score')
plt.grid(True)
plt.show()
R² Score Comparison by Polynomial Degree
The plot above shows the validation R² score for polynomial models of degrees 0 to 5. A higher R² indicates a better fit to the validation data.
# Step 8: Evaluate Best Model (Degree 2) on Test Set
best_degree = 2
poly = PolynomialFeatures(degree=best_degree)
X_train_poly = poly.fit_transform(X_train_bmi)
X_test_poly = poly.transform(X_test_bmi)
model = LinearRegression()
model.fit(X_train_poly, y_train_bmi)
y_test_pred = model.predict(X_test_poly)
# Evaluation
test_r2 = r2_score(y_test_bmi, y_test_pred)
test_mae = mean_absolute_error(y_test_bmi, y_test_pred)
test_mape = mean_absolute_percentage_error(y_test_bmi, y_test_pred)
print(f"Test Set R² Score: {test_r2:.4f}")
print(f"Test Set MAE: {test_mae:.4f}")
print(f"Test Set MAPE: {test_mape:.4f}")
Test Set R² Score: 0.2723 Test Set MAE: 51.7902 Test Set MAPE: 0.4339
Test Set Evaluation of the Best Model (Degree 2)
After identifying degree 2 as the optimal polynomial model, we evaluate its performance on the unseen test set to assess generalization capability.
The metrics on the test data are:
R² Score: 0.2723 — the model explains approximately 27% of the variance in disease progression.
MAE (Mean Absolute Error): 51.7902 — the average absolute prediction error.
MAPE (Mean Absolute Percentage Error): 43.39% — prediction error expressed as a percentage.
These results suggest that the model generalizes reasonably well, making it suitable for practical use with this feature subset.
# Visualizing Model Fit
import numpy as np
import matplotlib.pyplot as plt
# Combine all BMI data for full visualization range
X_all_bmi = np.vstack((X_train_bmi, X_val_bmi, X_test_bmi))
y_all_bmi = np.hstack((y_train_bmi, y_val_bmi, y_test_bmi))
# Fit degree-2 polynomial model again (best model)
best_degree = 2
poly = PolynomialFeatures(degree=best_degree)
X_train_poly = poly.fit_transform(X_train_bmi)
X_val_poly = poly.transform(X_val_bmi)
X_test_poly = poly.transform(X_test_bmi)
model = LinearRegression()
model.fit(X_train_poly, y_train_bmi)
# Generate a smooth curve across BMI range
bmi_range = np.linspace(X_all_bmi.min(), X_all_bmi.max(), 300).reshape(-1, 1)
bmi_range_poly = poly.transform(bmi_range)
y_curve = model.predict(bmi_range_poly)
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(X_train_bmi, y_train_bmi, color='blue', label='Train')
plt.scatter(X_val_bmi, y_val_bmi, color='green', label='Validation')
plt.scatter(X_test_bmi, y_test_bmi, color='red', label='Test')
plt.plot(bmi_range, y_curve, color='black', linewidth=2, label=f'Degree-{best_degree} Model Fit')
plt.xlabel('BMI')
plt.ylabel('Disease Progression')
plt.title('Polynomial Regression Fit (Degree 2) with Train/Val/Test Split')
plt.legend()
plt.grid(True)
plt.show()
Visualizing Polynomial Regression Fit
To evaluate how well the best polynomial model (degree 2) captures the relationship between BMI and disease progression, we plot:
The fitted polynomial curve over the entire BMI range
Actual data points from training, validation, and test sets
This helps visually assess the fit quality, detect potential overfitting or underfitting, and confirm that the chosen model generalizes well to unseen data.
# Equation of Best Model
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_bmi)
model = LinearRegression()
model.fit(X_train_poly, y_train_bmi)
coefficients = model.coef_
intercept = model.intercept_
print(f"y = {coefficients[2]:.4f} * x² + {coefficients[1]:.4f} * x + {intercept:.4f}")
y = 894.8799 * x² + 946.6925 * x + 150.1496
Using the trained degree-2 polynomial regression model, we extract the coefficients to express the mathematical relationship between BMI and disease progression.
The fitted model equation is: y = 894.8799 * x² + 946.6925 * x + 150.1496
This equation shows that the relationship between BMI and the target variable is non-linear, and the model has captured both linear and quadratic effects.
#scaling before Prediction for BMI = 30 using Degree-2 Model
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Scale BMI splits
X_train_bmi = scaler.fit_transform(X_train_bmi)
X_val_bmi = scaler.transform(X_val_bmi)
X_test_bmi = scaler.transform(X_test_bmi)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_bmi)
model = LinearRegression()
model.fit(X_train_poly, y_train_bmi)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
fit_intercept | True | |
copy_X | True | |
tol | 1e-06 | |
n_jobs | None | |
positive | False |
# Prediction for BMI = 30 using Degree-2 Model
bmi_value = np.array([[30]])
bmi_poly = poly.transform(bmi_value) # Transform using degree-2 polynomial features
predicted_progression = model.predict(bmi_poly)
print(f"Predicted disease progression for BMI = 30: {predicted_progression[0]:.2f}")
Predicted disease progression for BMI = 30: 3286.75
Using the trained degree-2 polynomial regression model, we predicted the disease progression for a patient with BMI = 30.
Predicted Value: 3286.75
This value represents the expected disease progression score and lies well within the observed data range. The result confirms the model's ability to generate realistic predictions for unseen BMI values.
Trainable Parameters Discussion
Polynomial regression models increase in complexity as the degree rises. Each degree adds one additional feature (term) to the model, which increases the number of trainable parameters (coefficients including the intercept).
Polynomial Degree Terms Included Trainable Parameters 0 1 (Intercept only) 1 1 1 + x 2 2 1 + x + x² 3 3 1 + x + x² + x³ 4 4 1 + x + x² + x³ + x⁴ 5 5 1 + x + x² + x³ + x⁴ + x⁵ 6
Although higher-degree models (like degree 4 or 5) have more capacity to fit the training data, they risk overfitting — capturing noise rather than underlying trends. In our case, the degree 2 model achieved a balance between performance and simplicity, showing strong validation metrics while avoiding overfitting.
Thus, increasing model complexity beyond degree 2 does not yield meaningful improvements and may reduce generalization to unseen test data.
Conclusion
In this part, we explored univariate polynomial regression using the BMI feature to predict disease progression. Polynomial models of degrees 0 through 5 were evaluated based on training and validation performance using R², MAE, and MAPE metrics.
The degree-2 polynomial model was identified as the optimal choice, offering a strong balance between accuracy and simplicity.
On the test set, this model achieved:
R² Score: 0.2723
MAE: 51.79
MAPE: 43.39%
The model equation was: y = 894.8799 × BMI² + 946.6925 × BMI + 150.1496
It also accurately predicted disease progression for BMI = 30 as approximately 3286.75.
As polynomial degree increased, trainable parameters also increased, which improved training accuracy but led to overfitting on the validation set for degrees >2.
Overall, this analysis highlights the importance of selecting the right model complexity to ensure generalization without sacrificing performance.
Limitations: This analysis used a univariate approach relying solely on BMI as the predictor. While it demonstrates clear trends, disease progression is influenced by multiple factors (e.g., age, blood pressure, lifestyle). Thus, the model’s predictive power is inherently limited, and incorporating more features could lead to significantly improved performance.
Part 3: Multivariate Regression Models
In this part, we implement and evaluate multiple regression models using more than one feature to predict disease progression. We'll explore:
Polynomial Regression (Degree 1 and 2) Decision Tree Regression (Depth 3 and 5) k-Nearest Neighbors Regression (k = 3 and 5) Each model will be evaluated using R², MAE, and MAPE on both validation and test sets.
Prepare Multivariate Features To build multivariate models, we use a subset of important features from the dataset:
bmi: Body Mass Index
bp: Blood Pressure
s1 and s5: Two biochemical indicators linked to disease progression
We ensure X is a DataFrame with column names to enable feature selection by name and split the dataset into training, validation, and test sets.
# Import necessary libraries
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error
import pandas as pd
# Load and convert X into a DataFrame with column names
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target
# Resplit the data (consistent with earlier steps)
from sklearn.model_selection import train_test_split
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
# Select multiple features for multivariate regression
selected_features = ['bmi', 'bp', 's1', 's5']
X_train_multi = X_train[selected_features]
X_val_multi = X_val[selected_features]
X_test_multi = X_test[selected_features]
#Polynomial Regression (Degree 1 & 2)
results_poly = []
for degree in [1, 2]:
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train_multi)
X_val_poly = poly.transform(X_val_multi)
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_val_pred = model.predict(X_val_poly)
results_poly.append({
'Model': f'Polynomial (Degree {degree})',
'R²': r2_score(y_val, y_val_pred),
'MAE': mean_absolute_error(y_val, y_val_pred),
'MAPE': mean_absolute_percentage_error(y_val, y_val_pred)
})
#Decision Tree Regressors (Depth 3 & 5)
results_tree = []
for depth in [3, 5]:
model = DecisionTreeRegressor(max_depth=depth, random_state=42)
model.fit(X_train_multi, y_train)
y_val_pred = model.predict(X_val_multi)
results_tree.append({
'Model': f'Decision Tree (Depth {depth})',
'R²': r2_score(y_val, y_val_pred),
'MAE': mean_absolute_error(y_val, y_val_pred),
'MAPE': mean_absolute_percentage_error(y_val, y_val_pred)
})
#KNN (k=3 and k=5)
results_knn = []
for k in [3, 5]:
model = KNeighborsRegressor(n_neighbors=k)
model.fit(X_train_multi, y_train)
y_val_pred = model.predict(X_val_multi)
results_knn.append({
'Model': f'kNN (k={k})',
'R²': r2_score(y_val, y_val_pred),
'MAE': mean_absolute_error(y_val, y_val_pred),
'MAPE': mean_absolute_percentage_error(y_val, y_val_pred)
})
import pandas as pd
# Combine all results
all_results = results_poly + results_tree + results_knn
results_df = pd.DataFrame(all_results)
results_df
Model | R² | MAE | MAPE | |
---|---|---|---|---|
0 | Polynomial (Degree 1) | 0.495286 | 43.061579 | 0.309535 |
1 | Polynomial (Degree 2) | 0.446935 | 43.500753 | 0.315166 |
2 | Decision Tree (Depth 3) | 0.410745 | 44.850411 | 0.331186 |
3 | Decision Tree (Depth 5) | 0.264338 | 47.140489 | 0.338670 |
4 | kNN (k=3) | 0.231544 | 50.988764 | 0.369772 |
5 | kNN (k=5) | 0.343798 | 48.487640 | 0.358737 |
Comparison of Multivariate Regression Models
This step compares the performance of multiple regression models using selected features: bmi, bp, s1, and s5.
The following models are evaluated on the validation set:
Polynomial Regression (Degree 1 and 2)
Decision Tree Regression (Depth 3 and 5)
k-Nearest Neighbors Regression (k = 3 and 5)
Evaluation metrics:
R² Score – measures the proportion of variance explained
MAE – Mean Absolute Error
MAPE – Mean Absolute Percentage Error
The results help identify the model that performs best on the validation data.
### Retrain Best Model and Evaluate on Test Set ###
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error
# Combine train and val sets
X_trainval_multi = np.vstack([X_train_multi, X_val_multi])
y_trainval_multi = np.hstack([y_train, y_val]) # Use existing targets
# Polynomial degree 1 (Linear)
poly = PolynomialFeatures(degree=1)
X_trainval_poly = poly.fit_transform(X_trainval_multi)
X_test_poly = poly.transform(X_test_multi)
model = LinearRegression()
model.fit(X_trainval_poly, y_trainval_multi)
y_test_pred = model.predict(X_test_poly)
# Test evaluation
test_r2 = r2_score(y_test, y_test_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
print(f"Test R² Score: {test_r2:.4f}")
print(f"Test MAE: {test_mae:.2f}")
print(f"Test MAPE: {test_mape:.2%}")
Test R² Score: 0.4459 Test MAE: 43.97 Test MAPE: 38.53%
e:\Multivariate_Assignment_Shiranth\venv\Lib\site-packages\sklearn\utils\validation.py:2742: UserWarning: X has feature names, but PolynomialFeatures was fitted without feature names warnings.warn(
The best-performing model (Polynomial Regression with Degree 1) is retrained using the combined training and validation sets. Its performance is then evaluated on the test set.
Reported metrics:
R² Score – goodness of fit on test data
MAE – average absolute error
MAPE – percentage-based error measure
This evaluation provides the final performance metrics for the selected model.
Multivariate regression analysis was conducted using selected features: bmi, bp, s1, and s5. Among the models tested, Polynomial Regression (Degree 1) demonstrated the best performance on the validation set and was selected for final evaluation. After retraining on the combined training and validation sets, the model achieved an R² score of 0.4459, MAE of 43.97, and MAPE of 38.53% on the test set, indicating moderate predictive capability for disease progression.