Logistic regression: Faceoff
Logistic Regression: Faceoff and Conceptual Understanding

Who ordered this?
As of this writing, Google search for "logistic regression tutorial" shows about 11.2M results. Why add another thing to this pile?
After reading a good number of articles, books and guides, I realized that most lack clear and intuitive explanations of how logistic regression works. Instead, they usually strive to be either practical, by showing how to run models, or as mathematically complete as possible, and as a consequence, basic concepts get buried underneath a forest of matrix algebra.
We will start by clearing up what seem to be common misconceptions. Logistic regression is not:
- linear regression but with sigmoid curve instead of a straight line
- classification algorithm (but can be used for this)
- sigmoid curve "fit" of a decision boundary separating two classes of points in the x-y plane
What is a logistic regression?
Logistic regression is a regression model that returns a probability of a binary outcome (0 or 1), assuming that log of the odds is a linear combination of one or more inputs. Odds is a ratio between probability of outcome happening (p) and the probability of the outcome not happening (1-p). When we have one input or predictor, this starting assumption is mathematically expressed as:

The goal behind logistic regression is to model cases when inputs are shifting the outcome probability progressively from 0 to 1. The probability of the outcome being 1, p, can be derived from the previous equation and expressed as a function of inputs:

In the last part we swapped from parameters β₁ and β₀ to k and x₀. Using k and x₀ will give us a clearer picture of the model as we go along. We will also stick to a single predictor variable x, as opposed to marching in with an army of matrices, so we can easily visualize logistic fits.
Logistic curve
We will begin by plotting the logistic curve, with parameters x₀ = 2.5 and k = 3, on an interval x between 0 and 5:
import numpy as np
import pandas as pd
import plotnine as p9
from scipy.stats import uniform, bernoulli
# From https://github.com/igor-sb/blog/blob/main/posts/logistic/plots.py
from logistic.plots import plot_naive_logistic_fit
def logistic(x, k, x0):
    return 1 / (1 + np.exp(-k*(x - x0)))
def create_smooth_logistic_curve_data(k, x0, n_points=100):
    df = pd.DataFrame({'x': np.linspace(0, 5, n_points)})
    df['p_x'] = logistic(df['x'], k, x0)
    return df
def create_sample_data(k, x0, n_points, seed=1):
    np.random.seed(seed)
    df = pd.DataFrame({
        'x': uniform.rvs(loc=0, scale=5, size=n_points)
    }).sort_values('x', ignore_index=True)
    p_x = logistic(df['x'], k, x0)
    df['y'] = bernoulli.rvs(p_x)
    return df
sample_df = create_sample_data(k=3, x0=2.5, n_points=30)
smooth_px_df = create_smooth_logistic_curve_data(k=3, x0=2.5)
plot_naive_logistic_fit(sample_df, smooth_px_df)
This logistic curve p(x) is described by two parameters:
- x₀ is the value of a predictor x for which the probability is 0.5 (mid-point): p(x = x₀) = 0.5, so tells us about the location of the mid-point.
- k is related to the slope of the probability at mid-point: (dp/dx){x = x₀} = k/4, so tells us about the steepness of the curve at that mid-point. The larger the k, the steeper the curve in the middle.
If we naively employed ordinary least squares to fit the curve p(x) to these points, we would find that all residuals would be less than 1 and most points on the "wrong side" of the mid-point would have residuals ~ 1. It would make more sense to assign a much larger cost to points that are large outliers.
Log-loss fit
Instead of trying to make ordinary least squares work to fit p(x) to the points, logistic regression proceeds differently:
- For teal points at y = 1, we will fit -log p(x) instead of p(x). Negative logarithm makes -log p(x) progressively larger; as p(x) approaches zero.
- For the red points at y = 0 we can do the same by using the probability that the outcome is zero, -log[1-p(x)].
We call these "log-losses". If we collapse all the points to y = 0, then for each point these two log-losses represent a cost (loss) of that point, for being some amount away from the log-loss curves. In order to utilize numpy vectorization, we will code these two together as a single log-loss function (this combo log-loss also goes by the name "Cross Entropy"):
def log_loss(p_x, y):
    return -y * np.log(p_x) - (1 - y) * np.log(1 - p_x)One way to think about Logistic Regression is a method that simultaneously fits: -log p(x) for y = 1 and –log[1-p(x)] for y = 0.
How do these two log-loss curves look?
To visualize them, we will plot the same data in the previous plot, but now with log-losses instead of probability:
def create_smooth_logloss_data(k, x0, n_points=100):
    x = np.linspace(0, 5, n_points)
    p_x = logistic(x, k, x0)
    return pd.DataFrame({
        'x': np.concatenate((x, x)),
        'y': np.concatenate(([0] * len(x), [1] * len(x))),
        'log_loss': np.concatenate((log_loss(p_x, 0), log_loss(p_x, 1))),
    })
def add_logloss(df, k, x0):
    p_x = logistic(df['x'], k, x0)
    return df.assign(log_loss = log_loss(p_x, df['y']))
def fit_data_to_logloss(sample_df, k, x0):
    sample_fit_df = add_logloss(sample_df, k, x0)
    logloss_df = create_smooth_logloss_data(k, x0)
    return (sample_fit_df, logloss_df)from logistic.plots import plot_logistic_fit
sample_fit_df, logloss_df = fit_data_to_logloss(sample_df, k=3, x0=2.5)
plot_logistic_fit(sample_fit_df, logloss_df)
Here we collapsed all points to y = 0, but use the colors as y labels, since the values of log-losses on their own represent the cost. Red points (y = 0) are fit to the red hockey stick curve: -log[1-p(x)]. Teal points (y = 1) are fit to the teal hockey stick curve: -log p(x). Sum of the vertical dashed lines represents the total log-loss that needs to be minimized for various k and x₀.
Unlike probability, log-loss curves have the property of penalizing big outliers proportionally more and they do not have residuals that cap out at 1.
Finding the minimal log-loss
How does changing k and x₀ affect this fit? To answer this, we can run fits with various combinations of k and x₀.
def fit_parameter_combinations(sample_df, combinations):
    sample_df_list = []
    logloss_df_list = []
    for k, x0 in combinations:
        sample_fit_df, logloss_df = fit_data_to_logloss(sample_df, k, x0)
        sample_fit_df['k'] = logloss_df['k'] = k
        sample_fit_df['x0'] = logloss_df['x0'] = x0
        sample_df_list.append(sample_fit_df)
        logloss_df_list.append(logloss_df)
    return (
        pd.concat(sample_df_list, ignore_index=True),
        pd.concat(logloss_df_list, ignore_index=True)
    )Changing x₀ moves the intersection point sideways:
# From https://github.com/igor-sb/blog/blob/main/posts/logistic/plots.py
from logistic.plots import plot_logistic_fit_panel
x0_dfs, x0_logloss_dfs = fit_parameter_combinations(
    sample_df,
    [(3, 1.5), (3, 2.5), (3, 3.5)]
)
plot_logistic_fit_panel(x0_dfs, x0_logloss_dfs, '~x0')
If x₀ is chosen away from the optimal point, the log-loss increases because increasing number of points gets fitted to the rising parts of the hockey sticks.
Changing k affects the steepness of the log-loss curves (note the different y axes):
k_dfs, k_logloss_dfs = fit_parameter_combinations(
    sample_df,
    [(0.5, 2.5), (3, 2.5), (7, 2.5)]
)
plot_logistic_fit_panel(k_dfs, k_logloss_dfs, '~k')
If k is too low (0.5), most points add small but significant amounts to the total log-loss. If k is too high (7.0), only the points on the "wrong side" contribute a significant amount to the total log-loss. In this case, it is the two teal points on the left of mid-point at x₀ = 2.5.
This brings up a question: what if there are no points on the "wrong side" of the mid-point, such as when the data is perfectly separated?
Perfectly separated data
It turns out, the logistic model cannot fit data that is perfectly separated!

