机器学习笔记一
算法学习的主要方法
k-临近算法线性回归朴素贝叶斯算法局部加权线性回归支持向量机Ridge回归决策树Lasso最小回归系数估计k-均值最大期望算法DBSCANParzen窗设计
机器学习主要步骤:
收集数据准备输入数据分析输入数据训练算法测试算法使用算法
k-近邻算法
算法思路:
算法很普通,对于输入的数据,与已有的数据样本进行匹配,根据匹配算法将匹配度最高的前k个数据取出,选择在k个中出现频率最高的数据结果分类作为输入数据的结果分类。
简而言之,就是找最像的。
算法步骤:
计算已知类别数据集中的点和当前点的距离按照距离递增次序排序选取与当前距离最小的k个点确定前k个点所在类别的出现频率返回预测的类别类型
举例:
数据值类别
1,1A1.1,1A2,2B1.9,2.1B1.1,1.1?
这里度量函数选择每一个数据点在二维平面之间的距离,k取2(一般不超过20)
对第一个点计算距离:
l1=sqrt(0.12+0.12)
根据
l
的大小排序,可得,根据这个预测方法得到预测点为A类型。
算法特点
优点:精度高,对异常值不敏感,无输入数据假定
缺点:计算复杂度高,空间复杂度高,无法’理解‘数据本质,无法给出基础的结构信息,无法知晓样本和典型实例样本具有什么特征。
适用数据范围:数值型和标称型
决策树
算法思路:
根据训练数据的各种特性将数据分类,然后根据熵(集合中数据的不一致性)决定划分的先后顺序,最后得到一颗树,类似与带终止模块的流程图,从上向下开始走。
划分数据集的大原则:将无序数据变得有序。
信息增益:在划分数据集前后信息发生的变化,为信息增益。
熵的概念:集合信息的度量方式称为香农熵,简称熵。 是信息的期望值。
l(xi)=−log2p(xi) =>
xi
的信息定义,
p(xi)
是选择分类的概率
H=−SGM[n,..1]p(xi)log2p(xi)
,SGM是求和符号,这个公式为何要有
(log2p(xi))
,反正记住这是香农大佬发明的衡量信息熵的公式,熵越高,表示数据越混乱。
基尼不纯度:度量被错误分类到其他分组的概率
算法步骤:
划分数据集,根据熵值的大小构建决策树,构建时,优先选择熵值小的划分然后就可以根据决策树进行比对,获得数据分类结果
举例:
from math
import log
import operator
import matplotlib.pylab
as plt
def calcShannonEnt(dataSet):
numberEntries = len(dataSet)
labelCounts = {}
for featVec
in dataSet:
curLabel = featVec[-
1]
if curLabel
not in labelCounts.keys():
labelCounts[curLabel] =
0
labelCounts[curLabel] +=
1
shannonEnt =
0.0
for key
in labelCounts:
prob = float(labelCounts[key]) / numberEntries;
shannonEnt += prob * log(prob,
2)
return shannonEnt
def createDataSet():
dataSet = [[
1,
1,
'yes'],
[
1,
1,
'yes'],
[
1,
0,
'no'],
[
0,
1,
'no'],
[
0,
1,
'no']]
labels = [
'no surfacing',
'flippers']
return dataSet, labels
def splitDataSet(dataSet, axis, values):
retDataSet = []
for featVec
in dataSet:
if featVec[axis] == values:
reducedFeatVec = featVec[: axis]
reducedFeatVec.extend(featVec[axis +
1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numberFeatures = len(dataSet[
0]) -
1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain =
0.0; bestFeature = -
1
for i
in range(numberFeatures):
featList = [example[i]
for example
in dataSet]
uniqueVals = set(featList)
newEntropy =
0.0
for value
in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet) / float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = newEntropy - baseEntropy
if infoGain > bestInfoGain:
bestInfoGain = infoGain
bestFeature = i
return bestFeature
def majoriatyCnt(calssList):
classCount = {}
for vote
in classList:
if vote
not in classCount.keys():
classCount[vote] =
0
classCount[vote] +=
1
SortedClassCount = sorted(classCount.iteritems(),
key=operator.itemgetter(), reverse=
True)
return SortedClassCount[
0][
0]
def createTree(dataSet, labels):
classList = [example[-
1]
for example
in dataSet]
if classList.count(classList[
0]) == len(classList):
return classList[
0]
if len(dataSet[
0]) ==
1:
return majoriatyCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel: {}}
del(labels[bestFeat])
featvalues = [example[bestFeat]
for example
in dataSet]
uniqueVals = set(featvalues)
for value
in uniqueVals:
subLabel = labels[:]
myTree[bestFeatLabel][value] = createTree(
splitDataSet(dataSet, bestFeat, value), subLabel)
return myTree
def getNumLeafs(myTree):
numLeafs =
0
keys = list( myTree )
firstStr = keys[
0]
secondDict = myTree[firstStr]
for key
in secondDict.keys():
if type( secondDict[key]).__name__ ==
'dict':
numLeafs += getNumLeafs( secondDict[key] )
else:
numLeafs +=
1
return numLeafs
def plotNode(nodeTxt , centerPt , parentPt , nodeType):
createPlot.axl.annotate( nodeTxt, xy= parentPt , xycoords=
"axes fraction" , xytext= centerPt , textcoords=
"axes fraction" ,va=
"center" , ha=
"center" , bbox = nodeType , arrowprops= arrow_args)
def getTreeDepth(myTree):
maxDepth =
0
keys = list(myTree)
firstStr = keys[
0]
secondDict = myTree[firstStr]
for key
in secondDict.keys():
if type( secondDict[key] ).__name__ ==
'dict':
thisDepth =
1 + getTreeDepth( secondDict[key] )
else:
thisDepth =
1
if thisDepth > maxDepth : maxDepth = thisDepth
return maxDepth
def plotMidText(cntrPt, parentPt , txtString):
xMid = ( parentPt[
0] - cntrPt[
0] )/
2.0 + cntrPt[
0]
yMid = ( parentPt[
1] - cntrPt[
1] )/
2.0 + cntrPt[
1]
createPlot.axl.text(xMid , yMid , txtString)
def plotTree(myTree , parentPt , nodeTxt):
numLeafs = getNumLeafs(myTree)
depth = getTreeDepth(myTree)
firstStr = list(myTree.keys())[
0]
cntrPt = ( plotTree.xOff + (
1.0 + float(numLeafs) )/
2.0 / plotTree.totalW , plotTree.yOff )
plotMidText( cntrPt , parentPt , nodeTxt )
plotNode( firstStr , cntrPt , parentPt , decisionNode )
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff -
1.0/plotTree.totalD
for key
in secondDict.keys():
if type( secondDict[key] ).__name__ ==
'dict':
plotTree(secondDict[key] , cntrPt , str(key) )
else:
plotTree.xOff = plotTree.xOff +
1.0/plotTree.totalW
plotNode ( secondDict[key] , ( plotTree.xOff , plotTree.yOff ), cntrPt , leafNode )
plotMidText ( (plotTree.xOff , plotTree.yOff), cntrPt , str(key) )
plotTree.yOff = plotTree.yOff +
1.0 / plotTree.totalD
def createPlot(inTree):
fig = plt.figure(
1 , facecolor =
'white' )
fig.clf()
axprops = dict( xticks=[] , yticks=[] )
createPlot.axl = plt.subplot(
111 , frameon=
False , **axprops)
plotTree.totalW = float( getNumLeafs(inTree) )
plotTree.totalD = float( getTreeDepth(inTree) )
plotTree.xOff = -
0.1 /plotTree.totalW; plotTree.yOff =
1;
plotTree( inTree ,(
0.5 ,
0.5) ,
'' )
plt.show()
decisionNode = dict( boxstyle =
"sawtooth" , fc =
"0.8" )
leafNode = dict( boxstyle =
"round4" ,fc =
"0.8" )
arrow_args=dict( arrowstyle =
"<-" )
dataSet,labels = createDataSet()
myTree = createTree(dataSet,labels)
print(myTree)
createPlot(myTree)
算法特点:
优点:在数据表示形式上特别容易理解。计算复杂度不高,输出结果易于理解,对中间值的缺少不敏感,可以处理不相关特征数据。缺点:可能会有过度匹配的问题,会产生大量的匹配节点,使分类繁杂。适用数据范围:数值型和标称型
算法精髓:
个人理解,此算法的精髓在于根据熵值划分集合,使得分类的集合可以按照某种特性分开,简单易于理解
机器学习的主要任务是分类
本说明–文章是学习《机器学习实战》-人民邮电出版社后个人的理解笔记或者摘抄,作为本人笔记,也作为他人的理解参考