机器学习笔记一

xiaoxiao2021-02-28 98

机器学习笔记一

算法学习的主要方法k-临近算法线性回归朴素贝叶斯算法局部加权线性回归支持向量机Ridge回归决策树Lasso最小回归系数估计k-均值最大期望算法DBSCANParzen窗设计

机器学习主要步骤：

收集数据准备输入数据分析输入数据训练算法测试算法使用算法

k-近邻算法

算法思路：

算法很普通，对于输入的数据，与已有的数据样本进行匹配，根据匹配算法将匹配度最高的前k个数据取出，选择在k个中出现频率最高的数据结果分类作为输入数据的结果分类。

简而言之，就是找最像的。

算法步骤：

计算已知类别数据集中的点和当前点的距离按照距离递增次序排序选取与当前距离最小的k个点确定前k个点所在类别的出现频率返回预测的类别类型

举例：

数据值类别1，1A1.1,1A2,2B1.9,2.1B1.1,1.1?

这里度量函数选择每一个数据点在二维平面之间的距离，k取2（一般不超过20）

对第一个点计算距离： l1=sqrt(0.12+0.12)

根据 l 的大小排序，可得，根据这个预测方法得到预测点为A类型。

算法特点

优点：精度高，对异常值不敏感，无输入数据假定缺点：计算复杂度高，空间复杂度高，无法’理解‘数据本质，无法给出基础的结构信息，无法知晓样本和典型实例样本具有什么特征。适用数据范围：数值型和标称型

决策树

算法思路：

根据训练数据的各种特性将数据分类，然后根据熵（集合中数据的不一致性）决定划分的先后顺序，最后得到一颗树，类似与带终止模块的流程图，从上向下开始走。

划分数据集的大原则：将无序数据变得有序。

信息增益：在划分数据集前后信息发生的变化，为信息增益。

熵的概念：集合信息的度量方式称为香农熵，简称熵。是信息的期望值。

l(xi)=−log2p(xi) => xi 的信息定义, p(xi) 是选择分类的概率

H=−SGM[n,..1]p(xi)log2p(xi) ,SGM是求和符号，这个公式为何要有 (log2p(xi)) ,反正记住这是香农大佬发明的衡量信息熵的公式，熵越高，表示数据越混乱。

基尼不纯度：度量被错误分类到其他分组的概率

算法步骤：

划分数据集，根据熵值的大小构建决策树，构建时，优先选择熵值小的划分然后就可以根据决策树进行比对，获得数据分类结果

举例：

from math import log import operator import matplotlib.pylab as plt def calcShannonEnt(dataSet): #计算熵值 numberEntries = len(dataSet) labelCounts = {} for featVec in dataSet: # print(featVec) curLabel = featVec[-1] if curLabel not in labelCounts.keys(): labelCounts[curLabel] = 0 labelCounts[curLabel] += 1 shannonEnt = 0.0 for key in labelCounts: prob = float(labelCounts[key]) / numberEntries; shannonEnt += prob * log(prob, 2) return shannonEnt def createDataSet(): dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] labels = ['no surfacing', 'flippers'] return dataSet, labels # axis is the split index, values is the split value at the index def splitDataSet(dataSet, axis, values): retDataSet = [] for featVec in dataSet: # print( featVec[axis] ) if featVec[axis] == values: reducedFeatVec = featVec[: axis] reducedFeatVec.extend(featVec[axis + 1:]) retDataSet.append(reducedFeatVec) return retDataSet def chooseBestFeatureToSplit(dataSet): #根据熵选择最佳的划分元素 numberFeatures = len(dataSet[0]) - 1 baseEntropy = calcShannonEnt(dataSet) bestInfoGain = 0.0; bestFeature = -1 for i in range(numberFeatures): featList = [example[i] for example in dataSet] uniqueVals = set(featList) newEntropy = 0.0 for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet) / float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) infoGain = newEntropy - baseEntropy if infoGain > bestInfoGain: bestInfoGain = infoGain bestFeature = i return bestFeature def majoriatyCnt(calssList): #返回出现频率最高的特征 classCount = {} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 SortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(), reverse=True) return SortedClassCount[0][0] def createTree(dataSet, labels): #构建决策树，使用递归的方式 classList = [example[-1] for example in dataSet] #print("classList: "); print(classList) if classList.count(classList[0]) == len(classList): #剩余为同样的元素 return classList[0] #print("dataSet"); print(dataSet[0]); print("len: "); print(len(dataSet[0])) if len(dataSet[0]) == 1: #到最底层了，无法递归 return majoriatyCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel: {}} del(labels[bestFeat]) featvalues = [example[bestFeat] for example in dataSet] uniqueVals = set(featvalues) for value in uniqueVals: subLabel = labels[:] myTree[bestFeatLabel][value] = createTree( splitDataSet(dataSet, bestFeat, value), subLabel) return myTree #以下代码为绘制决策树的代码，本人不是很特别懂，可以照猫画虎 def getNumLeafs(myTree): numLeafs = 0 keys = list( myTree ) #print("keys:");print(keys) firstStr = keys[0] #print("firstStr:");print(firstStr);print("MyTree:"); print(myTree) secondDict = myTree[firstStr] for key in secondDict.keys(): if type( secondDict[key]).__name__ == 'dict': #the key's value is collection! #print("secondDict");print(secondDict[key]) numLeafs += getNumLeafs( secondDict[key] ) else: numLeafs += 1 return numLeafs def plotNode(nodeTxt , centerPt , parentPt , nodeType): createPlot.axl.annotate( nodeTxt, xy= parentPt , xycoords= "axes fraction" , xytext= centerPt , textcoords= "axes fraction" ,va= "center" , ha= "center" , bbox = nodeType , arrowprops= arrow_args) def getTreeDepth(myTree): maxDepth = 0 keys = list(myTree) firstStr = keys[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type( secondDict[key] ).__name__ =='dict': thisDepth = 1 + getTreeDepth( secondDict[key] ) else: thisDepth = 1 if thisDepth > maxDepth : maxDepth = thisDepth return maxDepth def plotMidText(cntrPt, parentPt , txtString): xMid = ( parentPt[0] - cntrPt[0] )/2.0 + cntrPt[0] yMid = ( parentPt[1] - cntrPt[1] )/2.0 + cntrPt[1] createPlot.axl.text(xMid , yMid , txtString) def plotTree(myTree , parentPt , nodeTxt): numLeafs = getNumLeafs(myTree) depth = getTreeDepth(myTree) firstStr = list(myTree.keys())[0] cntrPt = ( plotTree.xOff + (1.0 + float(numLeafs) )/2.0 / plotTree.totalW , plotTree.yOff ) plotMidText( cntrPt , parentPt , nodeTxt ) plotNode( firstStr , cntrPt , parentPt , decisionNode ) secondDict = myTree[firstStr] plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD for key in secondDict.keys(): if type( secondDict[key] ).__name__ =='dict': plotTree(secondDict[key] , cntrPt , str(key) ) else: plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW plotNode ( secondDict[key] , ( plotTree.xOff , plotTree.yOff ), cntrPt , leafNode ) plotMidText ( (plotTree.xOff , plotTree.yOff), cntrPt , str(key) ) plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD def createPlot(inTree): fig = plt.figure( 1 , facecolor = 'white' ) fig.clf() axprops = dict( xticks=[] , yticks=[] ) createPlot.axl = plt.subplot(111 , frameon=False , **axprops) plotTree.totalW = float( getNumLeafs(inTree) ) plotTree.totalD = float( getTreeDepth(inTree) ) plotTree.xOff = -0.1 /plotTree.totalW; plotTree.yOff = 1; plotTree( inTree ,(0.5 ,0.5) ,'' ) plt.show() decisionNode = dict( boxstyle = "sawtooth" , fc = "0.8" ) leafNode = dict( boxstyle = "round4" ,fc = "0.8" ) arrow_args=dict( arrowstyle = "<-" ) dataSet,labels = createDataSet() # print( dataSet ) # shannonEnt = calcShannonEnt(dataSet) # print( shannonEnt ) # bestFeature = chooseBestFeatureToSplit( dataSet ) # print(bestFeature) # splitDataSet = splitDataSet(dataSet , bestFeature , 1) # print( splitDataSet ) myTree = createTree(dataSet,labels) print(myTree) createPlot(myTree)

算法特点：

优点：在数据表示形式上特别容易理解。计算复杂度不高，输出结果易于理解，对中间值的缺少不敏感，可以处理不相关特征数据。缺点：可能会有过度匹配的问题，会产生大量的匹配节点，使分类繁杂。适用数据范围：数值型和标称型

算法精髓：

个人理解，此算法的精髓在于根据熵值划分集合，使得分类的集合可以按照某种特性分开，简单易于理解

机器学习的主要任务是分类

本说明–文章是学习《机器学习实战》-人民邮电出版社后个人的理解笔记或者摘抄，作为本人笔记，也作为他人的理解参考

转载请注明原文地址: https://www.6miu.com/read-28483.html

技术

最新回复(0)