Multiple Linear Regression from Scratch (with Diagnostics)

This is a ground-up implementation of multiple linear regression using gradient descent. The model is trained on the Kaggle House Prices dataset and includes full diagnostics to validate assumptions.

If you have not read Simple Linear Regression on Housing Data (Notes) hop on over there now.

Dataset and Feature Prep

prep.py

df = pd.read_csv("train.csv")
df = df[[
    "MSSubClass", "LotFrontage", "LotArea", "GrLivArea",
    "OverallQual", "OverallCond", "1stFlrSF", "2ndFlrSF",
    "LowQualFinSF", "FullBath", "HalfBath", "SalePrice"
]]

Missing values are filled with 0. Features are standardized using StandardScaler. Target variable SalePrice is left unscaled.

Training: Linear Regression + Gradient Descent

Prediction function:

Predicts the output for a single sample using multiple linear regression.

\hat{y}^{(i)} = \mathbf{w} \cdot \mathbf{x}^{(i)} + b

Args:

X (np.ndarray): Feature vector for a single sample (shape: [num features,])
w (np.ndarray): Weight vector (shape: [num features,])
b (float): Bias term (scalar).

Returns:

np.ndarray: Predicted value for the input sample.

This function computes the dot product of the feature vector and the weights, then adds the bias term to produce the prediction for one sample.

predict.py

def predict(X: np.ndarray, w: np.ndarray, b: float) -> np.ndarray:
    return np.dot(w, X) + b

Cost (Mean Squared Error):

Computes the cost (mean squared error) for multiple linear regression.

J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2

Where:

$m$ : Number of training examples
$\hat{y}^{(i)}$ : Predicted value for example $i$
$y^{(i)}$ : Actual target value

Args:

X (np.ndarray): Feature matrix of shape (m, n), where m is the number of samples and n is the number of features.
y (np.ndarray): Target vector of shape (m,).
w (np.ndarray): Weight vector of shape (n,).
b (float): Bias term (scalar).

Returns:

float: The cost value computed as the mean squared error divided by 2.

This function iterates over each training example, computes the prediction using the current weights and bias, calculates the squared error for each example, and returns the average cost.

compute_cost.py

def compute_cost(X: np.ndarray, y: np.ndarray, w: np.ndarray, b: float) -> float:):
    m = y.shape[0]
    cost = 0.0
    for i in range(m):
        cost += (predict(X[i], w, b) - y[i]) ** 2
    return cost / (2 * m)

Gradients:

Computes the gradients of the cost function with respect to weights and bias.

Gradient of the Cost Function with respect to weights $\mathbf{w}$:

\frac{\partial J}{\partial \mathbf{w}} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) \mathbf{x}^{(i)}

Gradient with respect to the bias $b$:

\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)

Args:

X (np.ndarray): Feature matrix of shape (m, n), where m is the number of samples and n is the number of features.
y (np.ndarray): Target vector of shape (m,).
w (np.ndarray): Weight vector of shape (n,).
b (float): Bias term (scalar).

Returns:

tuple[np.ndarray, float]: A tuple containing the gradient with respect to weights (dw) and the gradient with respect to bias (db).

compute_gradients.py

def compute_gradients(
    X: np.ndarray,
    y: np.ndarray,
    w: np.ndarray,
    b: float,
) -> tuple[np.ndarray, float]:
    m = y.shape[0]
    dw = np.zeros(w.shape)
    db = 0.0
    for i in range(m):
        error = predict(X[i], w, b) - y[i]
        dw += error * X[i]
        db += error
    return dw / m, db / m

Gradient Descent:

Gradient Descent Update Rule

\theta := \theta - \alpha \nabla_\theta J(\theta)

Where:

$\theta$ : Model parameters (e.g., weights $\mathbf{w}$ and bias $b$ )
$\alpha$ : Learning rate
$\nabla_\theta J(\theta)$ : Gradient of the cost function with respect to $\theta$

We move theta in the direction that reduces the cost, scaled by the learning rate alpha.

Args:

X (np.ndarray): Feature matrix of shape (m, n), where m is the number of samples and n is the number of features.
y (np.ndarray): Target vector of shape (m,).
w (np.ndarray): Initial weight vector of shape (n,).
b (float): Initial bias term (scalar).
alpha (float): Learning rate.
epochs (int): Number of iterations for gradient descent.
gradient_function (callable): Function to compute gradients.
cost_values (list[float], optional): List to store cost values at each epoch.

Returns:

tuple[np.ndarray, float]: The optimized parameters w and b after training.

gradient_descent.py

def gradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    w_in: np.ndarray,
    b: float,
    alpha: float,
    epochs: int,
    gradient_function: callable,
    cost_values: list[float] = None,
) -> tuple[np.ndarray, float]:
    w = copy.deepcopy(w_in)
    for i in range(epochs):
        dw, db = gradient_function(X, y, w, b)
        w -= alpha * dw
        b -= alpha * db
        if cost_values is not None:
            cost_values.append(compute_cost(X, y, w, b))
    return w, b

Training Run

Features: 11
Samples: 1460
Epochs: 1000
Learning rate: 0.01

train.py

X = df[feature_cols].to_numpy()
y = df["SalePrice"].to_numpy()
w = np.zeros(X.shape[1])
b = 0
w, b = gradient_descent(X, y, w, b, alpha=1e-2, epochs=1000, gradient_function=compute_gradients, cost_values=cost_values)

Model Diagnostics

Training Loss Curve

Predicted vs Actual Sale Price

Histogram of Residuals

Residual vs Predicted Sale Price

Takeaways

Preprocessing Matters
- Standardizing features was critical for gradient descent to converge.
- Filling missing values with 0 is naive but acceptable for this baseline.
Training from Scratch
- Implementing the loop manually clarified how updates occur per epoch.
- Cost monitoring is a must to ensure the model is learning.
Diagnostics Are Essential
- Residuals should be centered and normally distributed.
- Residual patterns (or lack thereof) reveal model bias and fit.
- Feature importances offer insight into model behavior.
Export and Inspect
- Always save predictions with inputs for inspection.