Simple Linear Regression on Housing Data (Notes)

Trying out a basic linear regression model using NumPy and pandas. Goal is to estimate house prices using just one feature: the ground living area (GrLivArea). Using the Kaggle House Prices dataset (train.csv).

Step 1: Load the data

load.py

df = pd.read_csv("train.csv")
df = df[["GrLivArea", "SalePrice"]]

Only keeping the columns we need.

Step 2: Define the model

Compute the predicted output of a simple linear regression model.

This function uses the linear equation:

\hat{y} = w \cdot x + b

Parameters:

x: np.ndarray — The input feature (e.g., square footage of a house).
w: float — The weight (slope), which determines how much $x$ influences $y$ .
b: float — The bias (intercept), which shifts the prediction up or down.

Returns:

np.ndarray — The predicted target values (e.g., sale prices).

predict.py

def predict(X: np.ndarray, w: float, b: float) -> np.ndarray:
    return w * X + b

Step 3: Cost function

Compute the Mean Squared Error cost function.

This cost function tells us how far off our predictions are from actual values.
It is defined as:

J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2

Parameters:

X: np.ndarray — Input feature values (independent variable).
y: np.ndarray — Actual target values (dependent variable).
w: float — Current weight parameter.
b: float — Current bias parameter.

Returns:

float — The average squared error between predicted and actual values.

compute_cost.py

def compute_cost(X: np.ndarray, y: np.ndarray, w: float, b: float) -> float:
    m = y.shape[0]
    predictions = predict(X, w, b)
    return (1 / (2 * m)) * ((predictions - y) ** 2).sum()

Step 4: Gradients

Compute the gradients of the cost function with respect to w and b.

These gradients tell us how to change w and b to reduce the error.
They are partial derivatives of the cost function:

\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right) x_i

\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)

Parameters:

x: np.ndarray — Input feature values.
y: np.ndarray — Actual target values.
w: float — Current weight parameter.
b: float — Current bias parameter.

Returns:

tuple[float, float] — The gradients dw and db to be used in parameter updates.

compute_gradients.py

def compute_gradients(
    X: np.ndarray, y: np.ndarray, w: float, b: float, cost_values: list[float] = None
) -> tuple[float, float]:
    predictions = predict(X, w, b)
    dw = ((predictions - y) * X).mean()
    db = (predictions - y).mean()

    if cost_values is not None:
        cost_values.append(compute_cost(X, y, w, b))

    return dw, db

Step 5: Gradient descent

Perform gradient descent to optimize the parameters w and b.

This function iteratively updates w and b to minimize the cost function.

Parameters:

X: np.ndarray — Input feature values.
y: np.ndarray — Actual target values.
w: float — Initial weight parameter.
b: float — Initial bias parameter.
alpha: float — Learning rate, which controls how much we adjust $w$ and $b$ in each step.
epochs: int — Number of iterations to perform.
gradient_function: callable — Function to compute the gradients.

Returns:

tuple[float, float] — The optimized parameters $w$ and $b$ after training.

gradient_descent.py

def gradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    w: float,
    b: float,
    alpha: float,
    epochs: int,
    gradient_function: callable,
    cost_values: list[float] = None,
) -> tuple[float, float]:
    for _ in range(epochs):
        dw, db = gradient_function(X, y, w, b, cost_values)
        w -= alpha * dw
        b -= alpha * db
    return w, b

Step 6: Training

training.py

X = df["GrLivArea"].to_numpy()
y = df["SalePrice"].to_numpy()
w, b = 0, 0
alpha = 1e-7
epochs = 20
cost_values = []

w, b = gradient_descent(X, y, w, b, alpha, epochs, compute_gradients, cost_values)
print(f"w = {w}, b = {b}")

Step 7: Visualization

Model fit:

viz.py

prediction = predict(X, w, b)
plt.plot(X, prediction, color="blue", label="Model")
plt.scatter(X, y, marker="x", color="red", label="Actual")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.title("Linear Regression Fit")
plt.legend()
plt.show()

Cost over epochs:

cost_over_epoch.py

plt.plot(range(epochs), cost_values, color="purple")
plt.title("Cost over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

Notes

Learning rate is very small due to large feature scale.
After ~15 epochs, cost stabilizes.
Not doing any normalization or feature scaling for now.