Multi Layer Perceptron from Scratch
A multi layer perceptron trained on Cencus Income dataset using mini batch gradient descent.
Dataset
The dataset used is the Cencus Income dataset from the UCI Machine Learning Repository. The dataset contains 14 features and 1 target variable. The target variable is binary and indicates whether the income of an individual is greater than `50,000 or not.
Data Preparation
Before training on the data, we first shuffle, split, normalize, and one hot encode the data. We shuffle the data to prevent any bias in the training process. We split the data into training and testing sets to evaluate the model's performance. We normalize the data to ensure that the model converges faster. We one hot encode the target variable to convert it into a binary format.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
def prep_data(X_train, X_test, y_train, y_test):
# Combine train and test set to ensure same dummy variables
X_combined = pd.concat([X_train, X_test])
# Apply get_dummies to the combined dataset
X_combined = pd.get_dummies(X_combined, drop_first=True)
# Split the combined dataset back into train and test sets
X_train = X_combined.iloc[:len(X_train)]
X_test = X_combined.iloc[len(X_train):]
# Z-score normalization
X_train = (X_train - X_train.mean()) / X_train.std()
X_test = (X_test - X_test.mean()) / X_test.std()
# Handle potential NaN values resulting from normalization
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)
# Convert categorical target variable y to binary (0/1)
y_train = (y_train == '>50K').astype(int)
y_test = (y_test == '>50K').astype(int)
# Convert data to np arrays
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = prep_data(X_train, X_test, y_train, y_test)
Model Specification
Input (100, 64) \rightarrow
ReLU \rightarrow
Hidden Layer (64, 32) \rightarrow
ReLU \rightarrow
Output(32, 1) \rightarrow
Sigmoid
Training
Loss Function
We use binary cross entropy loss as the loss function.
\begin{equation}
L(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)
\end{equation}
Optimization
We use mini batch gradient descent to optimize the weights of the model. The model is trained for 100 epochs with a batch size of 32.
Gradient Descent Calculus
Forward Pass
Our forward pass looks as such, where X
is the input, W
is the weight matrix, b
is the bias, Z
is the linear transformation, A
is the activation output, and L
is the loss function. X
is of shape (batch_size, input_size), W
is of shape (input_size, output_size), b
is of shape (1, output_size), Z
is of shape (batch_size, output_size), A
is of shape (batch_size, output_size), and L
is a scalar.
Stochastic Gradient Descent vs Mini Batch Gradient Descent
In stochastic gradient descent, we calculate the loss for each sample in the dataset and update the weights and biases after each sample. In standard gradient descent we compute the loss for the entire dataset and update the weights and biases after each epoch.
However, in mini batch gradient descent, we calculate the loss for a batch of samples and update the weights and biases after each batch.
\begin{equation}
Z_1 = XW_1 + b_1
\end{equation}
\begin{equation}
A_1 = ReLU(Z_1)
\end{equation}
\begin{equation}
Z_2 = A_1W_2 + b_2
\end{equation}
\begin{equation}
A_2 = ReLU(Z_2)
\end{equation}
\begin{equation}
Z_3 = A_2W_3 + b_3
\end{equation}
\begin{equation}
A_3 = Sigmoid(Z_2)
\end{equation}
\begin{equation}
L(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)
\end{equation}
Backward Pass
We now will calculate the gradients of the loss function with respect to the weights and biases of the model in order to update them using gradient descent.
\begin{equation}
\frac{\partial L}{\partial W_3} = \frac{\partial L}{\partial A_3} \cdot \frac{\partial A_3}{\partial Z_3} \cdot \frac{\partial Z_3}{\partial W_3}
\end{equation}
\begin{equation}
\frac{\partial L}{\partial b_3} = \frac{\partial L}{\partial A_3} \cdot \frac{\partial A_3}{\partial Z_3} \cdot \frac{\partial Z_3}{\partial b_3}
\end{equation}
...
\begin{equation}
\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial A_3} \cdot \frac{\partial A_3}{\partial Z_3} \cdot \frac{\partial Z_3}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial A_1} \cdot \frac{\partial A_1}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial W_1}
\end{equation}
As you can see, the chain rule produces a "chain" of gradients. This equation begins to get increasingly complex as the number of layers in the network increases.
So, after we propagate backwards through the network, we pass back \frac{\partial L}{\partial A_i}
to the previous layer so we can keep track of this "chain" of gradients and propagate it back through the network.
Code for Backward Pass
def backward(self, dL_dA):
"""
Calculates the gradient of the loss with respect to weights, biases, and the previous layer's activations.
:param dL_dA: The gradient of the loss with respect to the layer's output.
:return dL_dA_prev: The gradient of the loss with respect to the previous layer's activations.
"""
if self.activation_function is None:
self.dL_dz = dL_dA
else:
self.dL_dz = dL_dA * self.activation_prime(self.z)
# Remember, z = w * a + b, so dz/dw = a
self.dL_dW = np.dot(self.inputs.T, self.dL_dz) # dL/dw = dL/da * da/dz * dz/dw
self.dL_dB = np.sum(self.dL_dz, axis=0, keepdims=True) # dL/db = dL/da * da/dz * dz/db
self.dL_dA_prev = np.dot(self.dL_dz, self.weights.T) # dL/dz * dz/dA[L-1]
return self.dL_dA_prev
Update Weights and Biases
After calculating the gradients of the loss function with respect to the weights and biases, we update the weights and biases using the following equations:
\begin{equation}
W = W - \alpha \cdot \frac{\partial L}{\partial W}
\end{equation}
\begin{equation}
b = b- \alpha \cdot \frac{\partial L}{\partial b}
\end{equation}
Where \alpha
is the learning rate.
Note: We also decay the learning rate in this example to prevent the model from overshooting the minimum.
learning_rate = initial_learning_rate * (1 / (1 + 0.01 * epoch))
Gradient Clipping
After backpropagation, we also clip the gradients of each layer to prevent exploding gradients. I added this after training as I ran into issues where the gradients were exploding.
def clip_gradients(self, max_norm):
"""
Clips the gradients to prevent exploding gradients using L2 norm clipping.
:param max_norm: The maximum allowable norm for the gradients.
"""
total_norm = np.linalg.norm(self.dL_dW) # Calculate the L2 norm of the gradients
if total_norm > max_norm:
self.dL_dW = self.dL_dW * (max_norm / total_norm) # Rescale the gradients
We calculate the total norm of the gradients and if it exceeds a certain threshold, we rescale the gradients to prevent them from exploding.
The norm of the gradients is calculated using the L2 norm. The L2 norm of a vector is the square root of the sum of the squares of the vector's elements.
Results
After training this model on 10 epochs, with a batch size of 64, and an initial learning rate of 0.01, we achieve around an 84% accuracy on the test set.