Naive Bayes

Naive Bayes is a simplistic probabilistic classifier based on Bayes theorem and strong independence assumptions. It is often applied to text classification tasks with word frequencies as features (i.e. spam classification).

Underlying Assumption

The primary assumption of the model (making it “naive”) is that each feature is independent of all other features, given the target label.

Model Formulation

The overarching goal of classification is to estimate the probability of a particular class label given the data. That is,

P(cx)=P(cx1,,xn),cCP(c | \mathbf{x}) = P(c | x_1,\dots,x_n), \forall{c \in C}

Naively, we could estimate this probability using the samples from a training set; for each data point x seen in our training set, count up the associated class labels:

P(cx)=P(xc)P(x)=i=1n[ ⁣[xi=xyi=c] ⁣]i=1n[ ⁣[xi=x] ⁣]P(c|\mathbf{x}) = \frac{P(\mathbf{x} \cap c)}{P(\mathbf{x})} = \frac{\sum_{i=1}^n{[\![ \mathbf{x}_i = \mathbf{x} \land y_i = c]\!]}}{\sum_{i=1}^n{[\![ \mathbf{x}_i = \mathbf{x}]\!]}}

However, this method will not work for most every real world scenario, as we often don’t have a lot of samples with the same x value (imagine a continuous setting, for example). This method estimates the correct probabilities, but is a very poor estimate without an extremely large amount of data. Instead, we can attempt to estimate these probabilities using Bayes rule:

P(cx)=P(xc)P(c)P(x)P(c|\mathbf{x}) = \frac{P(\mathbf{x}|c)P(c)}{P(\mathbf{x})}

The overall goal is to find the class label c that is most likely given the data. So for a given test value x, we want to use the above equation to find the value of c that maximizes P(c|x). As a result, we can ignore the denominator of the above equation since it does not depend on c and will thus be the same regardless of the value that c takes on. The value of c that maximizes P(c|x) also maximizes P(x|c)P(c). Now focusing on the numerator, we know that it is the joint probability over x and c:

P(xc)=P(xc)P(c)P(\mathbf{x} \cap c) = P(\mathbf{x}|c)P(c)

which can be expanded using the chain rule of probability:

P(xc)=P(x1,,xn,c)=P(c)P(x1c)P(xnx1,,xn1,c)P(\mathbf{x} \cap c) = P(x_1,\dots,x_n,c) = P(c)P(x_1|c)\cdots P(x_n|x_1,\dots,x_{n-1},c)

Depending on the number of features, this expanded joint probability can be very difficult to compute and estimate. So we employ a “naive” assumption of conditional independence, assuming that each feature is independent of all other features given the class label. This simplifies the prior joint expansion (since P(A|B) = P(A) when A and B are independent):

P(xc)=P(x1,,xn,c)=P(c)P(x1c)P(xnc)=P(c)i=1nP(xic)P(\mathbf{x} \cap c) = P(x_1,\dots,x_n,c) = P(c)P(x_1|c)\cdots P(x_n|c) = P(c)\prod_{i=1}^n{P(x_i|c)}

This gives us the final conditional probability , and we have a classifier .