Machine Learning In Action - Chapter 10 k-means clustering

xiaoxiao2021-02-28  111

Chapter10 - k-means clustering

k-均值算法流程:

创建k个点作为起始质心(经常是随机选择) 当任意一个点的簇分配结果发生改变时 对数据集中的每个数据点 对每个质心 计算质心与数据点之间的距离 将数据点分配到距其最近的簇 对每一个簇,计算簇中所有点的均值并将均值作为质心

python实现

def randCent(dataSet, k): n = shape(dataSet)[1] # n features centroids = mat(zeros((k,n))) for j in range(n): minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centroids[:,j] = minJ + rangeJ * random.rand(k,1) return centroids def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2))) # 属于哪个族,距离的平方 centroids = createCent(dataSet, k) # k个族,k行n列 clusterChanged = True while clusterChanged: clusterChanged = False # 对每个实例找到该属于哪个族 for i in range(m): minDist = inf; minIndex = -1 for j in range(k): # 两个行向量的距离 distJI = distMeas(centroids[j,:],dataSet[i,:]) if distJI < minDist: minDist = distJI; minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 print centroids for cent in range(k): ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]] centroids[cent,:] = mean(ptsInClust, axis=0) # 对每个特征列计算均值 return centroids, clusterAssment

以上的算法可能会收敛于局部最小值,有一种改进的算法是二分K-均值算法

将所有点看成一个襄 当簇数目小于k时 对于每一个簇 计算总误差 在给定的簇上面进行K-均值聚类(k=2) 计算将该簇一分为二之后的总误差 选择使得误差最小的那个族进行划分操作

python实现

def biKmeans(dataSet, k, distMeas=distEclud): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2))) centroid0 = mean(dataSet, axis=0).tolist()[0] centList =[centroid0] # 质心列表 for j in range(m): clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2 while (len(centList) < k): lowestSSE = inf for i in range(len(centList)): ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:] centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2 , distMeas) # 当前族划分为两个族后,当前族中数据的误差 sseSplit = sum(splitClustAss[:,1]) # 其他族的误差 sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1]) print "sseSplit, and notSplit: ",sseSplit,sseNotSplit if (sseSplit + sseNotSplit) < lowestSSE: bestCentToSplit = i bestNewCents = centroidMat # 两个质心 bestClustAss = splitClustAss.copy() lowestSSE = sseSplit + sseNotSplit # 两个质心中的后一个的index是list的长度 bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) # 前一个是被分裂的那个index bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit print 'the bestCentToSplit is: ',bestCentToSplit print 'the len of bestClustAss is: ', len(bestClustAss) # 原来的质心改成2个中的第一个 centList[bestCentToSplit] = bestNewCents[0,:] # 在列表最后加上2个中的第二个 centList.append(bestNewCents[1,:]) # 将被分裂的那个族(属于哪个族,距离的平方)重新赋值,一一对应,没有问题 clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss for i in range(k): centList[i] = centList[i].tolist()[0] return mat(centList), clusterAssment
转载请注明原文地址: https://www.6miu.com/read-69223.html

最新回复(0)