算法-c#-朴素贝叶斯算法在文本分类中的应用

xiaoxiao2021-02-28 47

算法-c#-朴素贝叶斯算法在文本分类中的应用一、朴素贝叶斯分类：公式： P(C|X) = P(X|C)P(C)/P(X) 其中： P(C|X)：后验概率 P(X|C)：似然概率（条件概率） P(C)：先验概率 P(X)：联合概率二、朴素贝叶斯文本分类文本分类就是求解：“待分类文本特征”，在训练样本中各分类下的“后验概率” 。三、朴素贝叶斯转换为文本分类的两个模型 1.多项式模型（词频模型）在多项式模型中，设某文档d=(t1,t2,…,tk)，tk是该文档中出现过的单词，允许重复，则：先验概率P(c)=“类c下单词总数”/“整个训练样本的单词总数” 条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|) V是训练样本的单词表（即抽取单词，单词出现多次，只算一个）， |V|则表示训练样本包含多少个不重复单词。在这里，m=|V|, p=1/|V|。 P(tk|c)可以看作是单词tk在证明d属于类c上提供了多大的证据，而P(c)则可以认为是类别c在整体上占多大比例(有多大可能性)。训练文本集 yes:[Chinese,Beijing,Chinese] yes:[Chinese,Chinese,Shanghai] yes:[Chinese,Macao] no:[Tokyo,Japan,Chinese] 给定一个新样本Chinese Chinese Chinese Tokyo Japan，对其进行分类。该文本用属性向量表示为d=(Chinese, Chinese, Chinese, Tokyo, Japan)，类别集合为Y={yes, no}。类yes下总共有8个单词，类no下总共有3个单词，训练样本单词总数为11，因此P(yes)=8/11, P(no)=3/11。类"条件概率"计算如下： P(Chinese | yes)=(5+1)/(8+6)=6/14=3/7 //类yes下单词Chinese在各个文档中出现过的次数之和+1/类yes下单词的总数(8)+总训练样本的不重复单词(6) P(Japan | yes)=P(Tokyo | yes)= (0+1)/(8+6)=1/14 P(Chinese|no)=(1+1)/(3+6)=2/9 P(Japan|no)=P(Tokyo| no) =(1+1)/(3+6)=2/9 分母中的8，是指yes类别下textc的长度，也即训练样本的单词总数， 6是指训练样本有Chinese,Beijing,Shanghai, Macao, Tokyo, Japan 共6个单词， 3是指no类下共有3个单词。有了以上类条件概率，开始计算后验概率， P(yes | d)=(3/7)^3×(1/14)×(1/14)×(8/11)=分子乘积/分母乘积=108/184877≈0.00029209 //Chinese Chinese Chinese Tokyo Japan P(no | d)= (2/9)^3×(2/9)×(2/9)×(3/11)=分子乘积/分母乘积=32/216513≈0.00014780 比较大小后，因此，这个文档属于类别china。 2.伯努利模型（文档模型） P(c)= 类c下文件总数/整个训练样本的文件总数 P(tk|c)=(类c下包含单词tk的文件数+1)/(类c的文档总数+2) 在这里，m=2, p=1/2。还是使用前面例子中的数据。类yes下总共有3个文件，类no下有1个文件，训练样本文件总数4；因此： P(yes)=3/4 P(Chinese | yes)=(3+1)/(3+2)=4/5 P(Japan | yes)=P(Tokyo | yes)=(0+1)/(3+2)=1/5 P(Beijing | yes)= P(Macao|yes)= P(Shanghai |yes)=(1+1)/(3+2)=2/5 P(Chinese|no)=(1+1)/(1+2)=2/3 P(Japan|no)=P(Tokyo| no) =(1+1)/(1+2)=2/3 P(Beijing| no)= P(Macao| no)= P(Shanghai | no)=(0+1)/(1+2)=1/3 有了以上类条件概率，开始计算后验概率， P(yes | d)=P(yes)×P(Chinese|yes) ×P(Japan|yes) ×P(Tokyo|yes)×(1-P(Beijing|yes)) ×(1-P(Shanghai|yes))×(1-P(Macao|yes)) =3/4×4/5×1/5×1/5×(1-2/5) ×(1-2/5)×(1-2/5)=81/15625≈0.005 P(no | d)= 1/4×2/3×2/3×2/3×(1-1/3)×(1-1/3)×(1-1/3)=16/729≈0.022 因此，这个文档不属于类别china。 3.两个模型的区别二者的计算粒度不一样，多项式模型以单词为粒度，伯努利模型以文件为粒度，因此二者的先验概率和类条件概率的计算方法都不同。计算后验概率时，对于一个文档d，多项式模型中，只有在d中出现过的单词，才会参与后验概率计算，伯努利模型中，没有在d中出现，但是在全局单词表中出现的单词，也会参与计算，不过是作为“反方”参与的。

四、测试代码（朴素贝叶斯算法+调用）

using System; using System.Collections; using System.Collections.Generic; using System.Linq; using System.Runtime.InteropServices; using System.Text; using System.Text.RegularExpressions; using System.Threading.Tasks; using Microsoft.VisualStudio.TestTools.UnitTesting; using Grass.Extend; using Quartz.Util; namespace DiscoverTest.Arithmetic { [TestClass] public class NbTest { [TestMethod] public void NbType() { //列别集 var types = new Dictionary<string, string> { {"yes","好天气"}, {"no","坏天气"}, }; var trainSet = new List<NaiveBayes.TrainItem> { new NaiveBayes.TrainItem("yes","晴朗,适中,舒适,微风,阴天"), new NaiveBayes.TrainItem("no","潮湿,低温,高温,强风,雨天"), }; Console.WriteLine(new string('~',60)); var content = "晴朗,高温,潮湿,微风"; var nb = new NaiveBayes(types, trainSet); Console.WriteLine("分词：{0}", nb.GetTermSegment(content).ToJsonSerialize()); var type = nb.GetClassify(content); Console.WriteLine("结果={0}，{2} ；待分类特征={1}；", type, content, types[type]); Console.WriteLine(new string('~', 60)); content = "雨天,低温,舒适,强风"; nb = new NaiveBayes(types, trainSet); type = nb.GetClassify(content); Console.WriteLine("结果={0}，{2} ；待分类特征={1}；", type, content, types[type]); Console.WriteLine(new string('~', 60)); content = "雨天,低温,潮湿,强风"; nb = new NaiveBayes(types, trainSet); type = nb.GetClassify(content); Console.WriteLine("结果={0}，{2} ；待分类特征={1}；", type, content, types[type]); Console.WriteLine(new string('~', 60)); content = "阴天,高温,舒适,微风"; nb = new NaiveBayes(types, trainSet); type = nb.GetClassify(content); Console.WriteLine("结果={0}，{2} ；待分类特征={1}；", type, content, types[type]); } } /// <summary> /// 朴素贝叶斯分类 /// </summary> public class NaiveBayes { /// <summary> /// /// </summary> /// <param name="types">类别字典</param> /// <param name="trainSet">训练文本集</param> public NaiveBayes(Dictionary<string, string> types , List<TrainItem> trainSet) { _typesScore = new Dictionary<string, double >(); _types = types; //训练集初始化 Trains = new TrainSetInfo(); trainSet.ForEach(x => Trains.Add(x)); } #region 变量 string _content { set; get; } List<string> KeyWordList = new List<string>(); Dictionary<string, string> _types { set; get; } /// <summary> /// 训练集 /// </summary> public TrainSetInfo Trains { set; get; } /// <summary> /// 类目得分 /// </summary> Dictionary<string, double > _typesScore { set; get; } /// <summary> /// 待分类文本关键词（可重复,重复词有助于分类） /// </summary> private List<string> _tempTermList = new List<string>(); #endregion #region 方法 /// <summary> /// 分词 /// </summary> /// <param name="content"></param> /// <returns></returns> public List<string> GetTermSegment(string content) { var lst = new List<string>(); Regex reg; MatchCollection ms; //遍历样本集关键词字典，对待分类文本进行分词 foreach (string term in this.Trains.TrainTermSet) { reg = new Regex(term); if (!reg.IsMatch(content)) continue; ms= reg.Matches(content); for (int i = 0; i < ms.Count; i++) { lst.Add(ms[i].Value); } } lst.Sort(); return lst; } /// <summary> /// 获取最终分类结果 /// </summary> /// <returns></returns> public string GetClassify(string content) { _tempTermList = GetTermSegment(content);//分词 /* * P(C|X)=P(X|C)P(C)/P(X)；后验概率=似然概率(条件概率)*先验概率/联合概率 * * 其中，P(X)联合概率，为常量，所以只需要计算 P(X|C)P(C) * * 公式：P(X|C)P(C) * 其中： * P(X|C)=P(x1|c1)P(x2|c1)...P(xn|c1) * P(x1|c1)="x1关键字在c1文档中出现过的次数之和+1"/"类c1下单词的总数(单词可重复)+总训练样本的不重复单词数" * P(c1)=类c1下总共有单词个数（可重复）/训练样本单词总数(可重复)， */ double likelihood = 0f;//似然概率 double prior = 0f;//先验概率 double probability = 0f; //后验概率 //1 计算每个列别的概率值 foreach (var type in _types.Keys) { //计算似然概率 P(X|c1)=P(x1|c1)P(x2|c1)...P(xn|c1) likelihood = GetLikelihood(type); //计算先验概率 P(c1) prior = GetPrior(type); //计算最中值：P(X|C)P(C) probability = likelihood*prior; //保存类的最终概率值 NoteTypeScore(type, probability); } //2 获取最大概率的类型code string typeCode = GetMaxSoreType(); if (string.Equals(typeCode, string.Empty)) return "-1"; return typeCode; } private string GetMaxSoreType() { //对字典中的值进行排序 Dictionary<string, double > soretDic = _typesScore .OrderByDescending(x => x.Value) .ToDictionary(x => x.Key, x => x.Value); Console.WriteLine("排序后：{0}",soretDic.ToJsonSerialize()); //返回第一个分数最高的类型code return soretDic.First().Key; } /// <summary> /// 记录类型得分 /// </summary> /// <param name="type"></param> /// <param name="sore"></param> private void NoteTypeScore(string type, double sore) { //if (_typesScore.ContainsKey(type)) //{ // _typesScore.Add(type,sore); // return; //} _typesScore[type] = sore; } /// <summary> /// 计算先验概率 /// </summary> /// <param name="type"></param> /// <returns></returns> private double GetPrior(string type) { /* * 先验概率P(c)=“类c下的单词总数”/“整个训练样本的单词总数” */ int typeCount = Trains.GetTrainTermCount(type); int allCount = Trains.GetTrainTermCount(); double result = typeCount*1.0 / allCount; return result; } /// <summary> /// 计算似然概率 /// </summary> /// <param name="type"></param> /// <returns></returns> private double GetLikelihood(string type) { /* * P(X|c1)=P(x1|c1)P(x2|c1)...P(xn|c1) * P(x1|c1)="x1关键字在c1文档中出现过的次数之和+1"/"类c1下单词的总数(单词可重复)+总训练样本的不重复单词数" * 注：引入Laplace校准，它的思想非常简单，就是对没类别下所有划分的计数加1，解决 P(x1|c1)=0 的情况 */ int typeTermCount = Trains.GetTrainTermCount(type); int allTermCount = Trains.TrainTermSet.Count; int sum = typeTermCount + allTermCount; double result = 1.0; int count = 0; //遍历待分类文本的关键字集合 _tempTermList.ForEach(x => { //计算 P(x1|c1) count = Trains.GetTrainTermCount(type, x)+1; result *= (count * 1.0 / sum); }); return result; } #endregion #region 结构 /// <summary> /// 训练集 /// </summary> public class TrainSetInfo { public TrainSetInfo() { TrainTypeGroup = new Dictionary<string, List<TrainItem>>(); TrainTermSet=new SortedSet<string>(); } #region 变量 / <summary> / 训练集总记录数 / </summary> //public int Size { private set; get; } /// <summary> /// 训练集（合并所有类型的关键字，关键字可重复） /// </summary> public Dictionary<string, List<TrainItem>> TrainTypeGroup { set; get; } / <summary> / 训练集数量 / </summary> //public Dictionary<string, int> TrainSize { set; get; } /// <summary> /// 关键词集（不重复） /// </summary> public SortedSet<string> TrainTermSet { set; get; } #endregion /// <summary> /// 获取训练集总的关键字数量 /// </summary> /// <param name="type">type为空将获取所有类型训练街</param> /// <returns></returns> public int GetTrainTermCount(string type="") { if (!string.IsNullOrEmpty(type)) return TrainTypeGroup[type].Count; int count = 0; TrainTypeGroup.All(x => { count+=x.Value.Count; return true; }); return count; } /// <summary> /// 获取训练集总的关键字数量 /// </summary> /// <param name="type">类型</param> /// <param name="term">在 type 下寻找的关键字</param> /// <returns></returns> public int GetTrainTermCount(string type,string term) { int count = 0; TrainTypeGroup[type].All(x => { //在可重复集合中寻找 count += x.TermList.FindAll(o => o.Equals(term)).Count; return true; }); return count; } /// <summary> /// 添加训练集文本 /// </summary> public void Add(TrainItem item) { //关键字 foreach (var term in item.TermSet) { TrainTermSet.Add(term);//训练集所有关键词 } //训练集分组 string type = item.Type; if (TrainTypeGroup.ContainsKey(type)) TrainTypeGroup[type].Add(item); else TrainTypeGroup.Add(type, new List<TrainItem> { item }); } } /// <summary> /// 训练项 /// </summary> public class TrainItem { /// <summary> /// /// </summary> /// <param name="type">类别</param> /// <param name="words">关键词序列（逗号分隔)</param> public TrainItem(string type, string words) { this.Type = type; this.TermSet = new SortedSet<string>(); this.TermList = new List<string>(); words.Split(',').All(x => { this.TermSet.Add(x); this.TermList.Add(x); return true; }); this.TermList.Sort(); } /// <summary> /// 类型码 /// </summary> public string Type { set; get; } /// <summary> /// 关键字（不重复） /// </summary> public SortedSet<string> TermSet { set; get; } /// <summary> /// 关键字（原始） /// </summary> public List<string> TermList { set; get; } } #endregion } }

转载请注明原文地址: https://www.6miu.com/read-2622722.html

技术

最新回复(0)