Logistic Regression from Scratch with Python (Full Implementation)

Here I use the Bank Marketing Dataset, which contains customer attributes and a binary label: did the customer subscribe to a term deposit?
If you have not read Multiple Linear Regression from Scratch (with Diagnostics) hop on over there.
Step 1: Prediction + Sigmoid
The sigmoid function maps any real-valued number to a value between 0 and 1, making it perfect for binary classification.
prediction.py
def predict(X: np.ndarray, w: np.ndarray, b: float) -> np.ndarray:
value = np.dot(w, X) + b
return value
def sigmoid(z: np.ndarray) -> np.ndarray:
return 1 / (1 + np.exp(-z))
Step 2: Logistic Regression Cost Function
Logistic regression uses the binary cross-entropy (or log loss) as its cost function:
- : number of training examples
- : predicted probability for example
- : true label (0 or 1)
compute_cost.py
def compute_cost(X, y, w, b):
m = y.shape[0]
cost = 0.0
for i in range(m):
f_wb_i = sigmoid(predict(X[i, :], w, b))
f_wb_i = np.clip(f_wb_i, 1e-15, 1 - 1e-15) # Prevent log(0)
loss = -(y[i] * np.log(f_wb_i) + (1 - y[i]) * np.log(1 - f_wb_i))
cost += loss
return cost / m
def compute_cost_vectorized(X: np.ndarray, y: np.ndarray, w: np.ndarray, b: float) -> float:
# Vectorized implementation
z = np.dot(X, w) + b # Linear combination for all samples
y_hat = sigmoid(z)
y_hat = np.clip(y_hat, 1e-15, 1 - 1e-15) # Avoid log(0)
cost = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
return cost
Step 3: Gradient Computation
compute_gradients.py
def compute_gradients(
X: np.ndarray,
y: np.ndarray,
w: np.ndarray,
b: float,
) -> tuple[np.ndarray, float]:
m = y.shape[0] # Number of training examples
# Loop through each training example
dw = np.zeros(w.shape) # Initialize gradient for weights
db = 0.0 # Initialize gradient for bias
for i in range(m):
y_hat = sigmoid(predict(X[i, :], w, b))
error = y_hat - y[i]
# Update the gradients
dw = dw + error * X[i, :]
db = db + error # Update the bias gradient
return dw / m, db / m # Average the gradients over all examples
def compute_gradients_vectorized(
X: np.ndarray,
y: np.ndarray,
w: np.ndarray,
b: float,
) -> tuple[np.ndarray, float]:
m = y.shape[0]
z = np.dot(X, w) + b
y_hat = sigmoid(z)
error = y_hat - y
dw = np.dot(X.T, error) / m
db = np.sum(error) / m
return dw, db
Step 4: Gradient Descent
gradient_descent.py
def gradient_descent(
X: np.ndarray,
y: np.ndarray,
w: np.ndarray,
b: float,
alpha: float = 0.01,
epochs: int = 1000,
gradient_function=None,
cost_values: list[float] = None,
) -> tuple[np.ndarray, float]:
for epoch in range(epochs):
dw, db = gradient_function(X, y, w, b)
w -= alpha * dw
b -= alpha * db
cost = compute_cost_vectorized(X, y, w, b) # Compute the cost
if cost_values is not None:
cost_values.append(cost)
print(f"Epoch {epoch + 1}/{epochs}, Cost: {cost:.4f}")
return w, b
Step 5: Data Cleaning & Feature Engineering
data_cleaning.py
# Age → Binned age_group
df["age_group"] = pd.cut(df["age"], bins=[18,25,35,45,55,65,100],
labels=["18–24", "25–34", "35–44", "45–54", "55–64", "65+"])
df.drop(columns=["age", "day"], inplace=True)
# Binary flags
df["default_flag"] = df["default"].map({"yes": 1, "no": 0})
df["housing_flag"] = df["housing"].map({"yes": 1, "no": 0})
df["loan_flag"] = df["loan"].map({"yes": 1, "no": 0})
df["deposit"] = df["deposit"].map({"yes": 1, "no": 0})
df.drop(columns=["default", "housing", "loan"], inplace=True)
# One-hot encode categorical variables
df = pd.get_dummies(df, columns=[
"age_group", "marital", "education", "poutcome",
"job", "contact", "month"
], drop_first=True)
Step 6: Standardization
standardization.py
feature_cols = [col for col in df.columns if col != "deposit"]
scaler = StandardScaler()
df[feature_cols] = scaler.fit_transform(df[feature_cols])
X = df[feature_cols].astype(float).to_numpy()
y = df["deposit"].to_numpy()
Step 7: Train the Model
train.py
w = np.zeros(X.shape[1])
b = 0
epochs = 1000
cost_values = []
w, b = gradient_descent(X, y, w, b, alpha=0.1, epochs=epochs,
gradient_function=compute_gradients,
cost_values=cost_values)
Step 9: Model Evaluation
... gonna post about model evaluation later! That deserves its own article.
Learnings
Working through this project, I really came to appreciate how much of machine learning is about the data, not just the math. I started by wrangling the raw bank dataset, turning categorical variables into numbers, binning ages into groups, and dropping columns that didn’t add value. One-hot encoding for multi-class features was a must, and I saw how even a single irrelevant column could throw off the whole model. It’s clear now that careful feature engineering and cleaning are the foundation for any successful model.
One of the biggest lessons was the importance of feature scaling. Initially, my model’s weights would explode, or the loss would get stuck, all because my features were on wildly different scales (I forgot that lesson learned from the housing dataset in the last post). Once I standardized everything (sklearn.StandardScaler). I also learned that the learning rate is a delicate balance, too high and the model diverges, too low and it barely learns. Watching the loss curve in real time was a great way to tune this and see if my model was improving.
I encountered several numerical issues, including overflow in the sigmoid function and the problem of taking the log of zero. These bugs taught me always to clip my predicted probabilities and inputs to the sigmoid. Interpreting the loss curve became second nature: if it’s flat or bouncing around, something’s wrong; if it’s steadily dropping, I’m on the right track.
Most of all, this project was a reminder that machine learning is an iterative, hands-on process, like regular run of the mill product development. I had to debug, experiment, and check every step. It’s not just about getting a model to run, but about understanding every part and being able to explain why it works.