For reprint, please keep the author's information:

Author: phinecos (Dongting Sanren)

Blog：http://phinecos.cnblogs.com/

Email：[email protected]

Preface

This paper is based on a recently read book, Tom M. Mitchell's machine learning. Chapter 6 of the book explains the theoretical knowledge of Bayesian learning in detail. In order to apply it to practice, it refers to many materials on the Internet, so as to get this article. This paper will be divided into two parts, the first part will introduce the theory of Bayesian learning (if you are not interested in the theory, please skip to the second part < < text classification algorithm based on Naive Bayesian classifier (next) >). The second part is about how to apply Bayesian classifier to Chinese text classification, with sample code attached.

Introduction

In the first chapter of probability theory and mathematical statistics, we have learned Bayes formula and total probability formula. Let's review them briefly

conditional probability

It is defined that a and B are two events, and P (a) > 0 is p (B ∣ a) = P (AB) / P (a) is the conditional probability of the occurrence of conditional event B under condition a.

If P (a) > 0, the multiplication formula has p (AB) = P (B ∣ a) P (a)

Total probability formula and Bayes formula

Define the sample space set s as test e, B1, B2 BN is a group of events of E, if bibj = Ф, I ≠ J, I, j = 1, 2 ,n; B1∪B2∪… ∪ BN = s is called B1, B2 , BN is a partition of sample space.

Theorem let the sample space of test e be the event with a as e, B1, B2 , BN is a partition of, and P (BI) > 0 (I = 1, 2 n) Then p (a) = P (a ∣ B1) P (B1) + P (a ∣ B2) + +P (a ∣ BN) P (BN) is called the total probability formula.

Theorem suppose that the sample space of test e is s, a is the event of E, B1, B2 , BN is a division of, then

P(Bi∣A)=P(A∣Bi)P(Bi)/∑P(B｜Aj)P(Aj)=P(B｜Ai)P(Ai)/P(B)

It is called Bayesian formula. Note: I, J are subscripts, and the sum is 1 to n

Let me give you a simple example.

Example 1

Considering a medical diagnosis problem, there are two possible hypotheses: (1) the patient has cancer. (2) The patient had no cancer. The sample data comes from a test, which also has two possible results: positive and negative. Let's say we already have a priori knowledge: only 0.008 of the population is ill. In addition, 98% of the patients with the disease are likely to return positive results in the laboratory test, and 97% of the patients without the disease are likely to return negative results.

The above data can be represented by the following probability formula:

P (cancer) = 0.008, P (no cancer) = 0.992

P = 0.98, P = 0.02

P (positive | no cancer) = 0.03, P (negative | no cancer) = 0.97

Suppose that there is a new patient who has returned to the positive laboratory test, will the patient be judged to have cancer? We can calculate the maximum posterior hypothesis:

P (positive) P (cancer) = 0.98 * 0.008 = 0.0078

P (positive | no cancer) * P (no cancer) = 0.03 * 0.992 = 0.0298

Therefore, it should be judged that there is no cancer.

Bayesian learning theory

Bayes is a learning algorithm based on probability, which can be used to calculate the explicit hypothesis probability. It is based on the prior probability of hypothesis, the probability of observing different data under given hypothesis and the observed data itself (we can see later, in fact, there are three things, ha ha).

We use p (H) to express the initial probability of h without training sample data, which is called the prior probability of H. it reflects the background knowledge that we have about the chance of H being a correct hypothesis. Of course, without this prior knowledge, we can simply assign each hypothesis to the same probability in the actual processing. Similarly, P (d) represents the prior probability of the training sample data d to be observed (that is, the probability of D when no certain assumption is established). Then there is p (D / h), which represents the probability of observing data d when h is assumed to hold. In machine learning, we are interested in P (H / D), that is, given a training sample data D, to judge the probability of hypothesis h, which is also known as a posterior probability, which reflects the confidence of hypothesis h after seeing the training sample data D. (Note: the posterior probability p (H / D) reflects the influence of training data D, while the prior probability p (H) is independent of D).

P (h|d) = P (d|h) P (H) / P (d). From the Bayesian formula, we can see that the posterior probability p (H / D) depends on the product of P (d|h) P (H). Hehe, this is the core idea of Bayesian classification algorithm. All we have to do is to consider the candidate hypothesis set h, and look for the most likely hypothesis H (H belongs to h) when given training data D.

Simply put, given a training sample data (the sample data has been manually classified), how should we learn from this sample data set, so that when we encounter new data, we can classify the new data into a certain category. As you can see, the above Bayesian theory is consistent with this task.

naive bayesian classification

Maybe you don't understand this theory very well. Let me give you another simple example to give you a quick understanding of the principle of this algorithm. (Note: this example is taken from table 3-2 in Chapter 3 of machine learning.)

Given the following training sample data, our learning goal is to judge whether your answer to the request of playtennis is yes or no according to the given weather conditions.

Day

Outlook

Temperature

Humidity

Wind

PlayTennis

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Strong

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cool

Normal

Weak

Yes

D10

Rain

Mild

Normal

Weak

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No

We can see that the sample data set here provides 14 training samples. We will use the data in this table and combine the naive Bayesian classifier to classify the following new examples:

(Outlook = sunny,Temprature = cool,Humidity = high,Wind = strong)

Our task is to predict the target value (yes or no) of playtennis

From the above formula, we can get:

You can get:

P(PlayTennis =yes) = 9/14 = 0.64,P(PlayTennis=no)=5/14 = 0.36

P(Wind=Stong| PlayTennis =yes)=3/9=0.33,p(Wind=Stong| PlayTennis =no)=3/5 = 0.6

Other data are similar and can be obtained by substituting:

P(yes)P(Sunny|yes)P(Cool|yes)P(high|yes)P(Strong|yes) = 0.0053

P(no)P(Sunny|no)P(Cool|no)P(high|no)P(Strong|no)=0.0206

Therefore, it should be classified into no category.

Bayesian text classification algorithm

Now, let's go to the main part of this article: how to apply Bayesian classifier to Chinese text classification?

According to the joint probability formula (full probability formula)

M -- the number of keywords in the training text set after the elimination of useless words and text preprocessing.