Naive Bayes Classifier from Scratch
A Naive Bayes classifier from scartch trained on Mushroom dataset from UCI Machine Learning Repository.
Dataset
This classifier is designed for datasets with categorical features and a discrete target variable. Each feature should have a limited set of unique values, allowing the model to compute conditional probabilities. This implementation is best suited for datasets such as the UCI Mushroom dataset or any dataset where the target variable is categorical and there are clear feature categories.
Data Preparation
Before using the classifier, ensure that the data is cleaned, all features are categorical, and each sample is labeled correctly. No scaling or normalization is needed, as Naive Bayes relies on categorical feature counts rather than distances.
Model Specification
The Naive Bayes classifier works by applying Bayes' Theorem:
P(y \mid X) = \frac{P(X \mid y) \cdot P(y)}{P(X)}
Where:
P(y \mid X)
is the posterior probability of classy
given the input featuresX
.P(X \mid y)
is the likelihood of the features given classy
.P(y)
is the prior probability of classy
.P(X)
is the probability of the featuresX
, which we omit because it is constant across classes.
We approximate the probability of a sample belonging to a particular class using:
\hat{y} = \arg\max_{y \in Y} \left( \log P(y) + \sum_{i=1}^{n} \log P(x_i \mid y) \right)
This means we compute the probability scores for each class and select the class with the highest score.
Training
Calculating Class Priors
The class prior probabilities, P(y)
, represent the relative frequency of each class in the dataset:
P(y) = \frac{\text{Number of samples in class } y}{\text{Total number of samples}}
for c in self.classes:
self.class_priors[c] = np.sum(y == c) / n_samples
Calculating Likelihoods
For each feature i
, the likelihood P(x_i \mid y)
is the probability of feature x_i
taking a specific value, given that the sample belongs to class y
. We use Laplace smoothing to handle cases where a feature value might not be present in the class:
P(x_i = v \mid y) = \frac{\text{Count of } x_i = v \text{ in class } y + \alpha}{\text{Number of samples in class } y + \alpha \cdot \text{Number of unique values of } x_i}
Where:
\alpha
is the smoothing parameter. If\alpha = 1
, it's called Laplace smoothing. If\alpha = 0
, no smoothing is applied.
for c in self.classes:
X_with_c = X[y == c]
num_of_X_with_c = X_with_c.shape[0]
for f in range(n_features):
feature_values = np.unique(X[:, f])
for val in feature_values:
count = np.sum(X_with_c[:, f] == val)
self.likelihoods[c][f][val] = (count + self.smoothing) / (num_of_X_with_c + len(feature_values) * self.smoothing)
Predicting
To predict the class of a new sample, we calculate the score for each class and select the class with the highest score. The score is the sum of the log prior probability and the log likelihood of each feature value in the sample:
def predict_sample(self, x):
best_class, best_score = None, -np.inf
for c in self.classes:
score = np.log(self.class_priors[c]) # Log priors
for f_index, val in enumerate(x): # For each feature value in the sample
score += np.log(self.likelihoods[c][f_index].get(val, self.smoothing / (len(self.likelihoods[c][f_index]) + self.smoothing)))
if score > best_score:
best_class, best_score = c, score
return best_class
Evaluation
Lastly, we use a simple accuracy metric to evaluate the model's performance. In this demo, we end up getting a near perfect accuracy as the dataset is perfect for naive bayes.