Logistic Regression from Scratch with Python (Full Implementation)

Here I use the Bank Marketing Dataset, which contains customer attributes and a binary label: did the customer subscribe to a term deposit?

If you have not read Multiple Linear Regression from Scratch (with Diagnostics) hop on over there.

Step 1: Prediction + Sigmoid

The sigmoid function maps any real-valued number to a value between 0 and 1, making it perfect for binary classification.

\sigma(z) = \frac{1}{1 + e^{-z}}

prediction.py

def predict(X: np.ndarray, w: np.ndarray, b: float) -> np.ndarray:
    value = np.dot(w, X) + b
    return value

def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-z))

Step 2: Logistic Regression Cost Function

Logistic regression uses the binary cross-entropy (or log loss) as its cost function:

J(w, b) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]

$m$ : number of training examples
$\hat{y}^{(i)}$ : predicted probability for example $i$
$y^{(i)}$ : true label (0 or 1)

compute_cost.py

def compute_cost(X, y, w, b):
    m = y.shape[0]
    cost = 0.0
    for i in range(m):
        f_wb_i = sigmoid(predict(X[i, :], w, b))
        f_wb_i = np.clip(f_wb_i, 1e-15, 1 - 1e-15)  # Prevent log(0)
        loss = -(y[i] * np.log(f_wb_i) + (1 - y[i]) * np.log(1 - f_wb_i))
        cost += loss
    return cost / m

def compute_cost_vectorized(X: np.ndarray, y: np.ndarray, w: np.ndarray, b: float) -> float:
    # Vectorized implementation
    z = np.dot(X, w) + b  # Linear combination for all samples
    y_hat = sigmoid(z)
    y_hat = np.clip(y_hat, 1e-15, 1 - 1e-15)  # Avoid log(0)
    cost = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
    return cost

Step 3: Gradient Computation

compute_gradients.py

def compute_gradients(
    X: np.ndarray,
    y: np.ndarray,
    w: np.ndarray,
    b: float,
) -> tuple[np.ndarray, float]:
    m = y.shape[0]  # Number of training examples

    # Loop through each training example
    dw = np.zeros(w.shape)  # Initialize gradient for weights
    db = 0.0  # Initialize gradient for bias
    for i in range(m):
        y_hat = sigmoid(predict(X[i, :], w, b))
        error = y_hat - y[i]
        # Update the gradients
        dw = dw + error * X[i, :]
        db = db + error  # Update the bias gradient

    return dw / m, db / m  # Average the gradients over all examples


def compute_gradients_vectorized(
    X: np.ndarray,
    y: np.ndarray,
    w: np.ndarray,
    b: float,
) -> tuple[np.ndarray, float]:
    m = y.shape[0]
    z = np.dot(X, w) + b
    y_hat = sigmoid(z)
    error = y_hat - y
    dw = np.dot(X.T, error) / m
    db = np.sum(error) / m
    return dw, db

Step 4: Gradient Descent

gradient_descent.py

def gradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    w: np.ndarray,
    b: float,
    alpha: float = 0.01,
    epochs: int = 1000,
    gradient_function=None,
    cost_values: list[float] = None,
) -> tuple[np.ndarray, float]:
    for epoch in range(epochs):
        dw, db = gradient_function(X, y, w, b)
        w -= alpha * dw
        b -= alpha * db

        cost = compute_cost_vectorized(X, y, w, b)  # Compute the cost
        if cost_values is not None:
            cost_values.append(cost)

        print(f"Epoch {epoch + 1}/{epochs}, Cost: {cost:.4f}")

    return w, b

Step 5: Data Cleaning & Feature Engineering

data_cleaning.py

# Age → Binned age_group
df["age_group"] = pd.cut(df["age"], bins=[18,25,35,45,55,65,100], 
                         labels=["18–24", "25–34", "35–44", "45–54", "55–64", "65+"])
df.drop(columns=["age", "day"], inplace=True)

# Binary flags
df["default_flag"] = df["default"].map({"yes": 1, "no": 0})
df["housing_flag"] = df["housing"].map({"yes": 1, "no": 0})
df["loan_flag"] = df["loan"].map({"yes": 1, "no": 0})
df["deposit"] = df["deposit"].map({"yes": 1, "no": 0})
df.drop(columns=["default", "housing", "loan"], inplace=True)

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=[
    "age_group", "marital", "education", "poutcome", 
    "job", "contact", "month"
], drop_first=True)

Step 6: Standardization

standardization.py

feature_cols = [col for col in df.columns if col != "deposit"]
scaler = StandardScaler()
df[feature_cols] = scaler.fit_transform(df[feature_cols])
X = df[feature_cols].astype(float).to_numpy()
y = df["deposit"].to_numpy()

Step 7: Train the Model

train.py

w = np.zeros(X.shape[1])
b = 0
epochs = 1000
cost_values = []

w, b = gradient_descent(X, y, w, b, alpha=0.1, epochs=epochs, 
                        gradient_function=compute_gradients,
                        cost_values=cost_values)

Step 9: Model Evaluation

... gonna post about model evaluation later! That deserves its own article.

Predicted Probability Logistic Regression

ROC Curve Logistic Regression

Training Loss Logistic Regression

Feature Importance Logistic Regression

Precision Recall Logistic Regression

Confusion Matrix Logistic Regression

Learnings

Working through this project, I really came to appreciate how much of machine learning is about the data, not just the math. I started by wrangling the raw bank dataset, turning categorical variables into numbers, binning ages into groups, and dropping columns that didn’t add value. One-hot encoding for multi-class features was a must, and I saw how even a single irrelevant column could throw off the whole model. It’s clear now that careful feature engineering and cleaning are the foundation for any successful model.

One of the biggest lessons was the importance of feature scaling. Initially, my model’s weights would explode, or the loss would get stuck, all because my features were on wildly different scales (I forgot that lesson learned from the housing dataset in the last post). Once I standardized everything (sklearn.StandardScaler). I also learned that the learning rate is a delicate balance, too high and the model diverges, too low and it barely learns. Watching the loss curve in real time was a great way to tune this and see if my model was improving.

I encountered several numerical issues, including overflow in the sigmoid function and the problem of taking the log of zero. These bugs taught me always to clip my predicted probabilities and inputs to the sigmoid. Interpreting the loss curve became second nature: if it’s flat or bouncing around, something’s wrong; if it’s steadily dropping, I’m on the right track.

Most of all, this project was a reminder that machine learning is an iterative, hands-on process, like regular run of the mill product development. I had to debug, experiment, and check every step. It’s not just about getting a model to run, but about understanding every part and being able to explain why it works.