sklearn学习笔记(2)交叉验证

xiaoxiao2021-02-27  133

link text 几种不同的CV策略生成器 cross_val_score中的参数cv可以接受不同的CV策略生成器作为参数,以此使用不同的CV算法。除了刚刚提到的KFold以及StratifiedKFold这两种对rawdata进行划分的方法之外,还有其他很多种划分方法,这里介绍几种sklearn中的CV策略生成器函数。 K-fold 最基础的CV算法,也是默认采用的CV策略​。主要的参数包括两个,一个是样本数目,一个是k-fold要划分的份数。

fromsklearn.model_selection import KFold X= np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y= np.array([1, 2, 3, 4]) kf= KFold(n_splits=2) kf.get_n_splits(X) print(kf) for train_index, test_index in kf.split(X): print("TRAIN:",train_index, "TEST:", test_index) X_train,X_test = X[train_index], X[test_index] y_train,y_test = y[train_index], y[test_index] ``` #这里kf.split(X)返回的是X中进行分裂后train和test的索引值,令X中数据集的索引为0,1,2,3;第一次分裂,先选择test,索引为0和1的数据集为test,剩下索引为2和3的数据集为train;第二次分裂,先选择test,索引为2和3的数据集为test,剩下索引为0和1的数据集为train。 Stratified k-fold 与k-fold类似,将数据集划分成k份,不同点在于,划分的k份中,每一份内各个类别数据的比例和原始数据集中各个类别的比例相同。 ```python classsklearn.model_selection.StratifiedKFold(n_splits=3,shuffle=False, random_state=None) [python] view plain copy from sklearn.model_selection importStratifiedKFold X= np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y= np.array([0, 0, 1, 1]) skf= StratifiedKFold(n_splits=2) skf.get_n_splits(X, y)#给出K折的折数,输出为2 print(skf) <div class="se-preview-section-delimiter"></div>

输出为:StratifiedKFold(n_splits=2,random_state=None, shuffle=False)

for train_index, test_index in skf.split(X, y): print("TRAIN:",train_index, "TEST:", test_index) X_train,X_test = X[train_index], X[test_index] y_train,y_test = y[train_index], y[test_index] <div class="se-preview-section-delimiter"></div>

输出:TRAIN: [1 3] TEST: [0 2]

TRAIN: [0 2] TEST: [1 3]

Leave-one-out 每个样本单独作为验证集,其余的N-1个样本作为训练集,所以LOO-CV会得到N个模型,用这N个模型最终的验证集的分类准确率的平均数作为此下LOO-CV分类器的性能指标。参数只有一个,即样本数目。

from sklearn.model_selection import LeaveOneOut X= [1, 2, 3, 4] loo= LeaveOneOut() for train, test in loo.split(X): print("%s%s" % (train, test)) <div class="se-preview-section-delimiter"></div>

结果:[1 2 3] [0]

[0 23] [1] [0 13] [2] [0 12] [3] Leave-P-out 每次从整体样本中去除p条样本作为测试集,如果共有n条样本数据,那么会生成(n p)个训练集/测试集对。和LOO,KFold不同,这种策略中p个样本中会有重叠。

from sklearn.model_selection import LeavePOut X= np.ones(4) lpo= LeavePOut(p=2) for train, test in lpo.split(X): print("%s%s" % (train, test)) <div class="se-preview-section-delimiter"></div>

结果:[2 3] [0 1]

[13] [0 2] [12] [0 3] [03] [1 2] [02] [1 3] [01] [2 3] Leave-one-label-out 这种策略划分样本时,会根据第三方提供的整数型样本类标号进行划分。每次划分数据集时,取出某个属于某个类标号的样本作为测试集,剩余的作为训练集。

from sklearn.model_selection import LeaveOneLabelOut labels = [1,1,2,2] Lolo=LeaveOneLabelOut(labels) for train, test in lolo: print("%s%s" % (train, test)) <div class="se-preview-section-delimiter"></div>

结果:[2 3] [0 1]

[01] [2 3] Leave-P-Label-Out 与Leave-One-Label-Out类似,但这种策略每次取p种类标号的数据作为测试集,其余作为训练集。

from sklearn.model_selection import LeavePLabelOut labels = [1,1,2,2,3,3] Lplo=LeaveOneLabelOut(labels,p=2) for train, test in lplo: print("%s%s" % (train, test))

结果:[4 5] [0 1 2 3]

[2 3] [0 1 4 5] [0 1] [2 3 4 5]

转载请注明原文地址: https://www.6miu.com/read-16469.html

最新回复(0)