1基础概念
1什么是决策树2 信息的定义3熵香农熵4信息的增益 2决策树特点
优点缺点适用数据类型 3机器实战代码4lensestxt数据
1、基础概念
1.1什么是决策树
决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树。在机器学习中,决策树是一个预测模型,他代表的是对象属性与对象值之间的一种映射关系。Entropy = 系统的凌乱程度,使用算法ID3, C4.5和C5.0生成树算法使用熵。这一度量是基于信息学理论中熵的概念。 决策树是一种树形结构,其中每个内部节点表示一个属性上的测试,每个分支代表一个测试输出,每个叶节点代表一种类别。 分类树(决策树)是一种十分常用的分类方法。他是一种监管学习,所谓监管学习就是给定一堆样本,每个样本都有一组属性和一个类别,这些类别是事先确定的,那么通过学习得到一个分类器,这个分类器能够对新出现的对象给出正确的分类。这样的机器学习就被称之为监督学习。 在这里给出的是ID3算法
1.2 信息的定义
如果待分类事物可能划分在多个分类当中,啧符号
xi
的信息定义为:
l(xi)=−log2p(xi)
1.3熵(香农熵)
熵定义为信息的期望
H=−∑i=1np(xi)log2p(xi)
1.4信息的增益
在划分数据集之前之后信息发生变化成为信息增益。 计算每个特征值划分数据集的信息增益,获得信息增益最高的特征就是最好的选择
2、决策树特点
优点:
计算复杂度不高,输出结果易于理解,对中间值缺失不敏感,可以处理不相关的特征数据
缺点:
可能会产生过度匹配问题
适用数据类型:
数值型和标称型
3、机器实战代码
from math
import log
import treePlotter
from win32ras
import EnumEntries
from _ast
import operator
def createDataSet():
dataSet = [[
1,
1,
'yes'],
[
1,
1,
'yes'],
[
1,
0,
'no'],
[
0,
1,
'no'],
[
0,
1,
'no']]
labels = [
'no surfacing',
'flippers']
return dataSet, labels
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec
in dataSet:
curlabel = featVec[-
1]
if curlabel
not in labelCounts.keys():
labelCounts[curlabel] =
0
labelCounts[curlabel] +=
1
shannonEnt =
0.0
for key
in labelCounts.keys():
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob*log(prob,
2)
return shannonEnt
def splitDataSet(dataSet, axis, value):
retDataSet=[]
for featVec
in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+
1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def choseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[
0]) -
1
baseEntroy = calcShannonEnt(dataSet)
bestFeautre = -
1;
bestEntroy =
0.0
for i
in range(numFeatures):
featList = [example[i]
for example
in dataSet]
uniqueVals = set(featList)
newEntory =
0.0
for value
in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
'''
我的理解,这里的香农熵是整体里的部分(因为划分了uniqueVals里面这么多类)
但是部分里面的香农熵计算出的数值却等同于整体的数值,为了降低这种地位,所以要
乘上这部分在整体所占的比例
'''
prob = len(subDataSet)/float(len(dataSet))
newEntory += prob*calcShannonEnt(subDataSet)
infoGain = baseEntroy - newEntory
if infoGain > bestEntroy:
bestEntroy = infoGain
bestFeautre = i
return bestFeautre
def majorCnt(classList):
classCount={}
for vote
in classList:
if vote
not in classCount.keys(): classCount[vote] =
0
classCount[vote] +=
1
sc = sorted(classCount.iteritems(), key = operator.itemgertter(
1), reverse =
True)
return sc[
0][
0]
def createTree(dataSet, label):
classList = [example[-
1]
for example
in dataSet]
if classList.count(classList[
0])== len(classList):
return classList[
0]
if len(dataSet[
0]) ==
1:
return majorCnt(classList)
bestFeat = choseBestFeatureToSplit(dataSet)
bestLabel = label[bestFeat]
myTree = {bestLabel:{}}
del(label[bestFeat])
featValues = [example[bestFeat]
for example
in dataSet]
uniqueVals = set(featValues)
for value
in uniqueVals:
subLabels = label[:]
myTree[bestLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value), subLabels)
return myTree
def classify(inputTree, featLable, testVec):
fiststr = inputTree.keys()[
0]
secondstr = inputTree[fiststr]
index = featLable.index(fiststr)
for key
in secondstr.keys():
if testVec[index] == key:
if type(secondstr[key]).__name__ ==
'dict':
classable = classify(secondstr[key], featLable, testVec)
else:
classable = secondstr[key]
return classable
def retrieveTree(i):
listOfTrees =[{
'no surfacing': {
0:
'no',
1: {
'flippers': {
0:
'no',
1:
'yes'}}}},
{
'no surfacing': {
0:
'no',
1: {
'flippers': {
0: {
'head': {
0:
'no',
1:
'yes'}},
1:
'no'}}}}
]
return listOfTrees[i]
if __name__ ==
'__main__':
a, b = createDataSet()
fr = open(
'lenses.txt')
lenses=[inst.strip().split(
'\t')
for inst
in fr.readlines()]
lensesLabel = [
'age',
'prescipt',
'astigmatic',
'tearRate']
lensesTree = createTree(lenses, lensesLabel)
print lensesTree
print treePlotter.createPlot(lensesTree)
4、lenses.txt数据
young myope
no reduced
no lenses
young myope
no normal soft
young myope
yes reduced
no lenses
young myope
yes normal hard
young hyper
no reduced
no lenses
young hyper
no normal soft
young hyper
yes reduced
no lenses
young hyper
yes normal hard
pre myope
no reduced
no lenses
pre myope
no normal soft
pre myope
yes reduced
no lenses
pre myope
yes normal hard
pre hyper
no reduced
no lenses
pre hyper
no normal soft
pre hyper
yes reduced
no lenses
pre hyper
yes normal
no lenses
presbyopic myope
no reduced
no lenses
presbyopic myope
no normal
no lenses
presbyopic myope
yes reduced
no lenses
presbyopic myope
yes normal hard
presbyopic hyper
no reduced
no lenses
presbyopic hyper
no normal soft
presbyopic hyper
yes reduced
no lenses
presbyopic hyper
yes normal
no lenses