For reprint, please keep the author's information:
Author: phinecos (Dongting Sanren)
Blog:http://phinecos.cnblogs.com/
Email:[email protected]
Preface
This paper is based on a recently read book, Tom M. Mitchell's machine learning. Chapter 6 of the book explains the theoretical knowledge of Bayesian learning in detail. In order to apply it to practice, it refers to many materials on the Internet, so as to get this article. This paper will be divided into two parts, the first part will introduce the theory of Bayesian learning (if you are not interested in the theory, please skip to the second part < < text classification algorithm based on Naive Bayesian classifier (next) >). The second part is about how to apply Bayesian classifier to Chinese text classification, with sample code attached.
Introduction
In the first chapter of probability theory and mathematical statistics, we have learned Bayes formula and total probability formula. Let's review them briefly
conditional probability
It is defined that a and B are two events, and P (a) > 0 is p (B ∣ a) = P (AB) / P (a) is the conditional probability of the occurrence of conditional event B under condition a.
If P (a) > 0, the multiplication formula has p (AB) = P (B ∣ a) P (a)
Total probability formula and Bayes formula
Define the sample space set s as test e, B1, B2 BN is a group of events of E, if bibj = Ф, I ≠ J, I, j = 1, 2 ,n; B1∪B2∪… ∪ BN = s is called B1, B2 , BN is a partition of sample space.
Theorem let the sample space of test e be the event with a as e, B1, B2 , BN is a partition of, and P (BI) > 0 (I = 1, 2 n) Then p (a) = P (a ∣ B1) P (B1) + P (a ∣ B2) + +P (a ∣ BN) P (BN) is called the total probability formula.
Theorem suppose that the sample space of test e is s, a is the event of E, B1, B2 , BN is a division of, then
P(Bi∣A)=P(A∣Bi)P(Bi)/∑P(B|Aj)P(Aj)=P(B|Ai)P(Ai)/P(B)
It is called Bayesian formula. Note: I, J are subscripts, and the sum is 1 to n
Let me give you a simple example.
Example 1
Considering a medical diagnosis problem, there are two possible hypotheses: (1) the patient has cancer. (2) The patient had no cancer. The sample data comes from a test, which also has two possible results: positive and negative. Let's say we already have a priori knowledge: only 0.008 of the population is ill. In addition, 98% of the patients with the disease are likely to return positive results in the laboratory test, and 97% of the patients without the disease are likely to return negative results.
The above data can be represented by the following probability formula:
P (cancer) = 0.008, P (no cancer) = 0.992
P = 0.98, P = 0.02
P (positive | no cancer) = 0.03, P (negative | no cancer) = 0.97
Suppose that there is a new patient who has returned to the positive laboratory test, will the patient be judged to have cancer? We can calculate the maximum posterior hypothesis:
P (positive) P (cancer) = 0.98 * 0.008 = 0.0078
P (positive | no cancer) * P (no cancer) = 0.03 * 0.992 = 0.0298
Therefore, it should be judged that there is no cancer.
Bayesian learning theory
Bayes is a learning algorithm based on probability, which can be used to calculate the explicit hypothesis probability. It is based on the prior probability of hypothesis, the probability of observing different data under given hypothesis and the observed data itself (we can see later, in fact, there are three things, ha ha).
We use p (H) to express the initial probability of h without training sample data, which is called the prior probability of H. it reflects the background knowledge that we have about the chance of H being a correct hypothesis. Of course, without this prior knowledge, we can simply assign each hypothesis to the same probability in the actual processing. Similarly, P (d) represents the prior probability of the training sample data d to be observed (that is, the probability of D when no certain assumption is established). Then there is p (D / h), which represents the probability of observing data d when h is assumed to hold. In machine learning, we are interested in P (H / D), that is, given a training sample data D, to judge the probability of hypothesis h, which is also known as a posterior probability, which reflects the confidence of hypothesis h after seeing the training sample data D. (Note: the posterior probability p (H / D) reflects the influence of training data D, while the prior probability p (H) is independent of D).
P (h|d) = P (d|h) P (H) / P (d). From the Bayesian formula, we can see that the posterior probability p (H / D) depends on the product of P (d|h) P (H). Hehe, this is the core idea of Bayesian classification algorithm. All we have to do is to consider the candidate hypothesis set h, and look for the most likely hypothesis H (H belongs to h) when given training data D.
Simply put, given a training sample data (the sample data has been manually classified), how should we learn from this sample data set, so that when we encounter new data, we can classify the new data into a certain category. As you can see, the above Bayesian theory is consistent with this task.
naive bayesian classification
Maybe you don't understand this theory very well. Let me give you another simple example to give you a quick understanding of the principle of this algorithm. (Note: this example is taken from table 3-2 in Chapter 3 of machine learning.)
Given the following training sample data, our learning goal is to judge whether your answer to the request of playtennis is yes or no according to the given weather conditions.
Day
Outlook
Temperature
Humidity
Wind
PlayTennis
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Strong
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cool
Normal
Weak
Yes
D10
Rain
Mild
Normal
Weak
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
D14
Rain
Mild
High
Strong
No
We can see that the sample data set here provides 14 training samples. We will use the data in this table and combine the naive Bayesian classifier to classify the following new examples:
(Outlook = sunny,Temprature = cool,Humidity = high,Wind = strong)
Our task is to predict the target value (yes or no) of playtennis
From the above formula, we can get:
You can get:
P(PlayTennis =yes) = 9/14 = 0.64,P(PlayTennis=no)=5/14 = 0.36
P(Wind=Stong| PlayTennis =yes)=3/9=0.33,p(Wind=Stong| PlayTennis =no)=3/5 = 0.6
Other data are similar and can be obtained by substituting:
P(yes)P(Sunny|yes)P(Cool|yes)P(high|yes)P(Strong|yes) = 0.0053
P(no)P(Sunny|no)P(Cool|no)P(high|no)P(Strong|no)=0.0206
Therefore, it should be classified into no category.
Bayesian text classification algorithm
Now, let's go to the main part of this article: how to apply Bayesian classifier to Chinese text classification?
According to the joint probability formula (full probability formula)
M -- the number of keywords in the training text set after the elimination of useless words and text preprocessing.