决策树

xiaoxiao2021-02-28 116

1基础概念 1什么是决策树2 信息的定义3熵香农熵4信息的增益 2决策树特点优点缺点适用数据类型 3机器实战代码4lensestxt数据

1、基础概念

1.1什么是决策树

决策树(Decision Tree）是在已知各种情况发生概率的基础上，通过构成决策树来求取净现值的期望值大于等于零的概率，评价项目风险，判断其可行性的决策分析方法，是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干，故称决策树。在机器学习中，决策树是一个预测模型，他代表的是对象属性与对象值之间的一种映射关系。Entropy = 系统的凌乱程度，使用算法ID3, C4.5和C5.0生成树算法使用熵。这一度量是基于信息学理论中熵的概念。决策树是一种树形结构，其中每个内部节点表示一个属性上的测试，每个分支代表一个测试输出，每个叶节点代表一种类别。分类树（决策树）是一种十分常用的分类方法。他是一种监管学习，所谓监管学习就是给定一堆样本，每个样本都有一组属性和一个类别，这些类别是事先确定的，那么通过学习得到一个分类器，这个分类器能够对新出现的对象给出正确的分类。这样的机器学习就被称之为监督学习。在这里给出的是ID3算法

1.2 信息的定义

如果待分类事物可能划分在多个分类当中，啧符号 xi 的信息定义为：

l(xi)=−log2p(xi)

1.3熵（香农熵）

熵定义为信息的期望

H=−∑i=1np(xi)log2p(xi)

1.4信息的增益

在划分数据集之前之后信息发生变化成为信息增益。计算每个特征值划分数据集的信息增益，获得信息增益最高的特征就是最好的选择

2、决策树特点

优点：

计算复杂度不高，输出结果易于理解，对中间值缺失不敏感，可以处理不相关的特征数据

缺点：

可能会产生过度匹配问题

适用数据类型：

数值型和标称型

3、机器实战代码

#encoding:utf-8 from math import log import treePlotter from win32ras import EnumEntries from _ast import operator def createDataSet(): dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] labels = ['no surfacing','flippers'] #change to discrete values return dataSet, labels #计算香农熵 def calcShannonEnt(dataSet): numEntries = len(dataSet)#样本数 labelCounts = {}#统计各个种类的数量 for featVec in dataSet: curlabel = featVec[-1] if curlabel not in labelCounts.keys(): labelCounts[curlabel] = 0 labelCounts[curlabel] += 1 shannonEnt = 0.0 for key in labelCounts.keys(): prob = float(labelCounts[key])/numEntries shannonEnt -= prob*log(prob,2)#计算熵 return shannonEnt def splitDataSet(dataSet, axis, value):#根据给定的特征划分数据集，axis代表第几列即划分数据集的特征，value代表种+类 retDataSet=[] for featVec in dataSet: if featVec[axis] == value: #去掉axis这一列 reducedFeatVec = featVec[:axis] reducedFeatVec.extend(featVec[axis+1:]) retDataSet.append(reducedFeatVec) return retDataSet def choseBestFeatureToSplit(dataSet):#选择最好的数据集划分方式 numFeatures = len(dataSet[0]) - 1#特征的个数 baseEntroy = calcShannonEnt(dataSet)#原始的熵值 bestFeautre = -1;#记录最好的特征 bestEntroy = 0.0#最好的信息增益 for i in range(numFeatures):#遍历每个特征值 featList = [example[i] for example in dataSet]#将此特征值的所有的样本值放到featList uniqueVals = set(featList)#该特征值得到所有分类 newEntory = 0.0 for value in uniqueVals:#划分所有分类 subDataSet = splitDataSet(dataSet, i, value) ''' 我的理解，这里的香农熵是整体里的部分（因为划分了uniqueVals里面这么多类）但是部分里面的香农熵计算出的数值却等同于整体的数值，为了降低这种地位，所以要乘上这部分在整体所占的比例 ''' prob = len(subDataSet)/float(len(dataSet)) newEntory += prob*calcShannonEnt(subDataSet)# infoGain = baseEntroy - newEntory#信息增益 if infoGain > bestEntroy:#寻找最优解 bestEntroy = infoGain bestFeautre = i return bestFeautre#返回最好的特征值 def majorCnt(classList):#投票，哪个种类多就是哪个类 classCount={} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sc = sorted(classCount.iteritems(), key = operator.itemgertter(1), reverse = True)#对词典的降序排序 return sc[0][0] def createTree(dataSet, label):#递归创造决策树 classList = [example[-1] for example in dataSet] if classList.count(classList[0])== len(classList):#1.如果所有的分类都是一样的则递归结束 return classList[0] if len(dataSet[0]) == 1:#如果特征向量只剩一个那么哪个种类多就是返回哪个种类 return majorCnt(classList) bestFeat = choseBestFeatureToSplit(dataSet)#最好的划分方式 # print bestFeat bestLabel = label[bestFeat]#在标签里面的名称 myTree = {bestLabel:{}} del(label[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues)#最好特征的所有分类 for value in uniqueVals:#根据分类递归创建 subLabels = label[:] myTree[bestLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value), subLabels) return myTree def classify(inputTree, featLable, testVec):#分类器（递归分类） fiststr = inputTree.keys()[0]#相当于根节点了 secondstr = inputTree[fiststr]#该节点所有的孩子 index = featLable.index(fiststr)#在类别表里的位置用于判断在实际数据集中该属性存储在哪个位置 #比如说‘no surfacing’在第一个位置还是第二个位置 featLabel就是干这个用的 for key in secondstr.keys():#遍历所有的孩子，寻找符合条件的孩子 if testVec[index] == key:#找到符合条件的 if type(secondstr[key]).__name__ =='dict':#如果孩子是词典类型继续递归 classable = classify(secondstr[key], featLable, testVec) else: classable = secondstr[key] return classable def retrieveTree(i): listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}, {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}} ] return listOfTrees[i] if __name__ == '__main__': a, b = createDataSet() # print createTree(a, b) # mytree = retrieveTree(0) # print classify(mytree, b, [0,1]) fr = open('lenses.txt') lenses=[inst.strip().split('\t') for inst in fr.readlines()] lensesLabel = ['age', 'prescipt','astigmatic','tearRate'] lensesTree = createTree(lenses, lensesLabel) print lensesTree print treePlotter.createPlot(lensesTree)

4、lenses.txt数据

young myope no reduced no lenses young myope no normal soft young myope yes reduced no lenses young myope yes normal hard young hyper no reduced no lenses young hyper no normal soft young hyper yes reduced no lenses young hyper yes normal hard pre myope no reduced no lenses pre myope no normal soft pre myope yes reduced no lenses pre myope yes normal hard pre hyper no reduced no lenses pre hyper no normal soft pre hyper yes reduced no lenses pre hyper yes normal no lenses presbyopic myope no reduced no lenses presbyopic myope no normal no lenses presbyopic myope yes reduced no lenses presbyopic myope yes normal hard presbyopic hyper no reduced no lenses presbyopic hyper no normal soft presbyopic hyper yes reduced no lenses presbyopic hyper yes normal no lenses

转载请注明原文地址: https://www.6miu.com/read-58818.html

技术

最新回复(0)