Linear Regression from Scratch vs. Scikit-Learn vs. Ordinary Least Squares (OLS)

In this notebook, we will implement linear regression from scratch using numpy and compare it with the linear regression model from scikit-learn and the ordinary least squares method.

Our Model

Data Preparation

Because we are using gradient descent to optimize the weights of our model, we need to normalize the data.

def prep_data(X, y):
    # normalize x to have mean=0, std=1
    X = (X - X.mean(axis=0)) / X.std(axis=0) + 1

    # Normalize y to have mean=1, std=1
    y = (y - y.mean()) / y.std() + 1
    return X, y

Fitting the Model

We use gradient descent to optimize the weights of our model.

Loss Function

We use mean squared error as the loss function.

\begin{equation}
L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{y}_i)^2
\end{equation}

Optimization

To optimize the weights of our model, we must get the gradient of the loss function with respect to the weights.

\begin{equation}
\frac{\partial L}{\partial W} = \frac{2}{N} X^T(XW - y)
\end{equation}

\begin{equation}
\frac{\partial L}{\partial b} = \frac{2}{N} (XW - y)
\end{equation}

def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.n_iters):
            z = np.dot(X, self.weights) + self.bias
            
            dw = (2 / n_samples) * np.dot(X.T, (z - y))
            db = (2 / n_samples) * np.sum(z - y)
            
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

We update the weights and biases using the following formulas:

\begin{equation}
W = W - \alpha \frac{\partial L}{\partial W}
\end{equation}

Where, hyperparameter, \alpha is the learning rate.

\begin{equation}
b = b - \alpha \frac{\partial L}{\partial b}
\end{equation}

Scoring the Model

We use the coefficient of determination, R^2 to score our model.

\begin{equation}
R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}
\end{equation}

def score(self, X, y):  # R^2 score
        y_pred = self.predict(X)
        u = ((y - y_pred) ** 2).sum()
        v = ((y - y.mean()) ** 2).sum()
        return 1 - u / (v + 1e-10)  # Add small constant to avoid division by zero

Ordinary Least Squares (OLS)

Just like our model, we prep our data exactly the same. However, for the fit function, we use the following formula to calculate the weights:

\begin{equation}
\theta = (X^T X)^{-1} X^T y
\end{equation}

def fit(self, X, y):
         # Add a column of ones to X for the bias term
        if X.ndim == 1:
            X = X.reshape(-1, 1)
        X_with_bias = np.column_stack([np.ones(X.shape[0]), X])
        
        # Compute the coefficients using the normal equation
        coeffs = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ y
        
        # Extract bias and weights
        self.bias = coeffs[0] 
        self.weights = coeffs[1:]

We have to an add a column of ones to X for the bias term because our X only has features and no bias term. Then we use the normal equation to calculate the weights and bias.

Scoring the Model

We score this model with an R^2 score as well.

Comparing Models

After implementing our linear regression model from scratch, the final R^2 score is effectively equivalent to the R^2 score from scikit-learn's linear regression model.

Scikit-learn's R^2 score: 0.49656835105076846

Our Model: 0.4965683510508899

OLS Model: 0.49656835105076835