算法优缺点
优点:在数据较少的情况下依然有效,可以处理多类别问题
缺点:对输入数据的准备方式敏感
适用数据类型:标称型数据
算法思想:
朴素贝叶斯
比如我们想判断一个邮件是不是垃圾邮件,那么我们知道的是这个邮件中的词的分布,那么我们还要知道:垃圾邮件中某些词的出现是多少,就可以利用贝叶斯定理得到。
朴素贝叶斯分类器中的一个假设是:每个特征同等重要
贝叶斯分类是一类分类算法的总称,这类算法均以贝叶斯定理为基础,故统称为贝叶斯分类。
函数
loaddataset()
创建数据集,这里的数据集是已经拆分好的单词组成的句子,表示的是某论坛的用户评论,标签1表示这个是骂人的
createvocablist(dataset)
找出这些句子中总共有多少单词,以确定我们词向量的大小
setofwords2vec(vocablist, inputset)
将句子根据其中的单词转成向量,这里用的是伯努利模型,即只考虑这个单词是否存在
bagofwords2vecmn(vocablist, inputset)
这个是将句子转成向量的另一种模型,多项式模型,考虑某个词的出现次数
trainnb0(trainmatrix,traincatergory)
计算p(i)和p(w[i]|c[1])和p(w[i]|c[0]),这里有两个技巧,一个是开始的分子分母没有全部初始化为0是为了防止其中一个的概率为0导致整体为0,另一个是后面乘用对数防止因为精度问题结果为0
classifynb(vec2classify, p0vec, p1vec, pclass1)
根据贝叶斯公式计算这个向量属于两个集合中哪个的概率高
#coding=utf-8
from numpy import *
def loaddataset():
postinglist=[[‘my’, ‘dog’, ‘has’, ‘flea’, ‘problems’, ‘help’, ‘please’],
[‘maybe’, ‘not’, ‘take’, ‘him’, ‘to’, ‘dog’, ‘park’, ‘stupid’],
[‘my’, ‘dalmation’, ‘is’, ‘so’, ‘cute’, ‘i’, ‘love’, ‘him’],
[‘stop’, ‘posting’, ‘stupid’, ‘worthless’, ‘garbage’],
[‘mr’, ‘licks’, ‘ate’, ‘my’, ‘steak’, ‘how’, ‘to’, ‘stop’, ‘him’],
[‘quit’, ‘buying’, ‘worthless’, ‘dog’, ‘food’, ‘stupid’]]
classvec = [0,1,0,1,0,1] #1 is abusive, 0 not
return postinglist,classvec
#创建一个带有所有单词的列表
def createvocablist(dataset):
vocabset = set([])
for document in dataset:
vocabset = vocabset | set(document)
return list(vocabset)
def setofwords2vec(vocablist, inputset):
retvocablist = [0] * len(vocablist)
for word in inputset:
if word in vocablist:
retvocablist[vocablist.index(word)] = 1
else:
print ‘word ‘,word ,’not in dict’
return retvocablist
#另一种模型
def bagofwords2vecmn(vocablist, inputset):
returnvec = [0]*len(vocablist)
for word in inputset:
if word in vocablist:
returnvec[vocablist.index(word)] += 1
return returnvec
def trainnb0(trainmatrix,traincatergory):
numtraindoc = len(trainmatrix)
numwords = len(trainmatrix[0])
pabusive = sum(traincatergory)/float(numtraindoc)
#防止多个概率的成绩当中的一个为0
p0num = ones(numwords)
p1num = ones(numwords)
p0denom = 2.0
p1denom = 2.0
for i in range(numtraindoc):
if traincatergory[i] == 1:
p1num +=trainmatrix[i]
p1denom += sum(trainmatrix[i])
else:
p0num +=trainmatrix[i]
p0denom += sum(trainmatrix[i])
p1vect = log(p1num/p1denom)#处于精度的考虑,否则很可能到限归零
p0vect = log(p0num/p0denom)
return p0vect,p1vect,pabusive
def classifynb(vec2classify, p0vec, p1vec, pclass1):
p1 = sum(vec2classify * p1vec) + log(pclass1) #element-wise mult
p0 = sum(vec2classify * p0vec) + log(1.0 – pclass1)
if p1 > p0:
return 1
else:
return 0
def testingnb():
listoposts,listclasses = loaddataset()
myvocablist = createvocablist(listoposts)
trainmat=[]
for postindoc in listoposts:
trainmat.append(setofwords2vec(myvocablist, postindoc))
p0v,p1v,pab = trainnb0(array(trainmat),array(listclasses))
testentry = [‘love’, ‘my’, ‘dalmation’]
thisdoc = array(setofwords2vec(myvocablist, testentry))
print testentry,’classified as: ‘,classifynb(thisdoc,p0v,p1v,pab)
testentry = [‘stupid’, ‘garbage’]
thisdoc = array(setofwords2vec(myvocablist, testentry))
print testentry,’classified as: ‘,classifynb(thisdoc,p0v,p1v,pab)
def main():
testingnb()
if __name__ == ‘__main__’:
main()