Simple Linear Regression on Housing Data (Notes)

Trying out a basic linear regression model using NumPy and pandas. Goal is to estimate house prices using just one feature: the ground living area (GrLivArea
). Using the Kaggle House Prices dataset (train.csv
).
Step 1: Load the data
load.py
df = pd.read_csv("train.csv")
df = df[["GrLivArea", "SalePrice"]]
Only keeping the columns we need.
Step 2: Define the model
Compute the predicted output of a simple linear regression model.
This function uses the linear equation:
Parameters:
- x: np.ndarray — The input feature (e.g., square footage of a house).
- w: float — The weight (slope), which determines how much influences .
- b: float — The bias (intercept), which shifts the prediction up or down.
Returns:
- np.ndarray — The predicted target values (e.g., sale prices).
predict.py
def predict(X: np.ndarray, w: float, b: float) -> np.ndarray:
return w * X + b
Step 3: Cost function
Compute the Mean Squared Error cost function.
This cost function tells us how far off our predictions are from actual values.
It is defined as:
Parameters:
- X: np.ndarray — Input feature values (independent variable).
- y: np.ndarray — Actual target values (dependent variable).
- w: float — Current weight parameter.
- b: float — Current bias parameter.
Returns:
- float — The average squared error between predicted and actual values.
compute_cost.py
def compute_cost(X: np.ndarray, y: np.ndarray, w: float, b: float) -> float:
m = y.shape[0]
predictions = predict(X, w, b)
return (1 / (2 * m)) * ((predictions - y) ** 2).sum()
Step 4: Gradients
Compute the gradients of the cost function with respect to w and b.
These gradients tell us how to change w and b to reduce the error.
They are partial derivatives of the cost function:
Parameters:
x: np.ndarray
— Input feature values.y: np.ndarray
— Actual target values.w: float
— Current weight parameter.b: float
— Current bias parameter.
Returns:
tuple[float, float]
— The gradientsdw
anddb
to be used in parameter updates.
compute_gradients.py
def compute_gradients(
X: np.ndarray, y: np.ndarray, w: float, b: float, cost_values: list[float] = None
) -> tuple[float, float]:
predictions = predict(X, w, b)
dw = ((predictions - y) * X).mean()
db = (predictions - y).mean()
if cost_values is not None:
cost_values.append(compute_cost(X, y, w, b))
return dw, db
Step 5: Gradient descent
Perform gradient descent to optimize the parameters w and b.
This function iteratively updates w and b to minimize the cost function.
Parameters:
X: np.ndarray
— Input feature values.y: np.ndarray
— Actual target values.w: float
— Initial weight parameter.b: float
— Initial bias parameter.alpha: float
— Learning rate, which controls how much we adjust and in each step.epochs: int
— Number of iterations to perform.gradient_function: callabl
e — Function to compute the gradients.
Returns:
tuple[float, float]
— The optimized parameters and after training.
gradient_descent.py
def gradient_descent(
X: np.ndarray,
y: np.ndarray,
w: float,
b: float,
alpha: float,
epochs: int,
gradient_function: callable,
cost_values: list[float] = None,
) -> tuple[float, float]:
for _ in range(epochs):
dw, db = gradient_function(X, y, w, b, cost_values)
w -= alpha * dw
b -= alpha * db
return w, b
Step 6: Training
training.py
X = df["GrLivArea"].to_numpy()
y = df["SalePrice"].to_numpy()
w, b = 0, 0
alpha = 1e-7
epochs = 20
cost_values = []
w, b = gradient_descent(X, y, w, b, alpha, epochs, compute_gradients, cost_values)
print(f"w = {w}, b = {b}")
Step 7: Visualization
Model fit:
viz.py
prediction = predict(X, w, b)
plt.plot(X, prediction, color="blue", label="Model")
plt.scatter(X, y, marker="x", color="red", label="Actual")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.title("Linear Regression Fit")
plt.legend()
plt.show()

Cost over epochs:
cost_over_epoch.py
plt.plot(range(epochs), cost_values, color="purple")
plt.title("Cost over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

Notes
- Learning rate is very small due to large feature scale.
- After ~15 epochs, cost stabilizes.
- Not doing any normalization or feature scaling for now.