R语言实验:决策树分类

xiaoxiao2025-07-06  7

 

1.数据预处理

数据清洗

                   缺失值处理:删除法。

setwd("G:/!!aaclassnew/R/20181025") data=read.csv(file = "bank-data.csv",header = TRUE) View(data) n=sum(is.na(data))#缺失值个数 print(n) sub=which(is.na(data$income))#缺失值所在行数(456-461) print(sub) new_data=data[-sub,]#缺失值处理 print(new_data)

数据集成

              去除无用属性:删除“ID”属性。

new_data=new_data[,-1]#删除第二列[,2];删除第二行[2,] print(new_data)

数据变换       

                  离散化:把“Children”属性转换成分类型的两个值“YES”和“NO”;把income属性按照节点12640.3;17390.1;29622;43228.2离散化。

for (i in 1:594) { if(new_data$children[i]==0){ new_data$children[i]="NO"; }else{ new_data$children[i]="YES"; } } print(new_data)

for(i in 1:594){ if(new_data$income[i]<12640.3){ new_data$income[i]=1; }else if(new_data$income[i]<17390.1){ new_data$income[i]=2; }else if(new_data$income[i]<29622){ new_data$income[i]=3; }else if(new_data$income[i]<=43228.2){ new_data$income[i]=4; }else{ new_data$income[i]=5; } } print(new_data)

2. 决策树:数据将“bank-data.csv”文件的600条数据中前500条数据作为训练数据集,并保存为文件;后100条数据作为测试数据集。

practice_data=new_data[c(1:500),] print(practice_data) write.csv(practice_data,file = "practice_datafile.csv",row.names = FALSE) test_data=new_data[c(501:594),] print(test_data)

编程实现每一步节点的选择,然后用软件(VISIO,PS或者直接手画拍照均可,推荐用VISIO)画出建立的决策树模型 data2=read.csv(file = "practice_datafile.csv",header = TRUE) View(data2) library(rpart) fit=rpart(income ~ age+sex+region+married+children+car+save_act+current_act+mortgage+pep,method = "class", data=data2,control = rpart.control(minsplit = 1) , parms = list(split="information")) print(fit)

 

> summary(fit) Call: rpart(formula = income ~ age + sex + region + married + children + car + save_act + current_act + mortgage + pep, data = data2, method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 1)) n= 500 CP nsplit rel error xerror xstd 1 0.04394427 0 1.0000000 1.0000000 0.03486308 2 0.02572347 3 0.8681672 0.8971061 0.03570696 3 0.02411576 4 0.8424437 0.8810289 0.03578361 4 0.01768489 6 0.7942122 0.8745981 0.03581018 5 0.01286174 8 0.7588424 0.9003215 0.03568987 6 0.01125402 12 0.7073955 0.9035370 0.03567219 7 0.01000000 14 0.6848875 0.9067524 0.03565393 Variable importance age save_act children pep region married car sex mortgage 71 9 7 5 3 2 1 1 1 Node number 1: 500 observations, complexity param=0.04394427 predicted class=3 expected loss=0.622 P(node) =1 class counts: 45 77 189 119 70 probabilities: 0.090 0.154 0.378 0.238 0.140 left son=2 (194 obs) right son=3 (306 obs) Primary splits: age < 37.5 to the left, improve=121.585200, (0 missing) save_act splits as LR, improve= 32.714180, (0 missing) pep splits as LR, improve= 16.722620, (0 missing) car splits as LR, improve= 7.492084, (0 missing) region splits as LRLL, improve= 4.601014, (0 missing) Surrogate splits: save_act splits as LR, agree=0.622, adj=0.026, (0 split) Node number 2: 194 observations, complexity param=0.02411576 predicted class=3 expected loss=0.5721649 P(node) =0.388 class counts: 45 55 83 11 0 probabilities: 0.232 0.284 0.428 0.057 0.000 left son=4 (127 obs) right son=5 (67 obs) Primary splits: age < 30.5 to the left, improve=25.331970, (0 missing) pep splits as LR, improve= 6.907268, (0 missing) car splits as LR, improve= 4.373355, (0 missing) region splits as LRLR, improve= 3.342221, (0 missing) current_act splits as LR, improve= 1.781451, (0 missing) Surrogate splits: region splits as LLLR, agree=0.66, adj=0.015, (0 split) Node number 3: 306 observations, complexity param=0.04394427 predicted class=4 expected loss=0.6470588 P(node) =0.612 class counts: 0 22 106 108 70 probabilities: 0.000 0.072 0.346 0.353 0.229 left son=6 (139 obs) right son=7 (167 obs) Primary splits: age < 50.5 to the left, improve=34.858180, (0 missing) save_act splits as LR, improve=24.427330, (0 missing) pep splits as LR, improve= 8.277378, (0 missing) region splits as LRLL, improve= 6.204070, (0 missing) car splits as LR, improve= 2.411510, (0 missing) Surrogate splits: region splits as RRLL, agree=0.585, adj=0.086, (0 split) sex splits as RL, agree=0.578, adj=0.072, (0 split) save_act splits as LR, agree=0.562, adj=0.036, (0 split) pep splits as LR, agree=0.559, adj=0.029, (0 split) mortgage splits as RL, agree=0.549, adj=0.007, (0 split) Node number 4: 127 observations, complexity param=0.02411576 predicted class=2 expected loss=0.6377953 P(node) =0.254 class counts: 39 46 42 0 0 probabilities: 0.307 0.362 0.331 0.000 0.000 left son=8 (80 obs) right son=9 (47 obs) Primary splits: age < 25.5 to the left, improve=7.949958, (0 missing) pep splits as LR, improve=4.822180, (0 missing) car splits as RL, improve=2.153993, (0 missing) region splits as RLLL, improve=1.223215, (0 missing) current_act splits as LR, improve=1.106185, (0 missing) Node number 5: 67 observations predicted class=3 expected loss=0.3880597 P(node) =0.134 class counts: 6 9 41 11 0 probabilities: 0.090 0.134 0.612 0.164 0.000 Node number 6: 139 observations, complexity param=0.02572347 predicted class=3 expected loss=0.5539568 P(node) =0.278 class counts: 0 19 62 52 6 probabilities: 0.000 0.137 0.446 0.374 0.043 left son=12 (54 obs) right son=13 (85 obs) Primary splits: age < 42.5 to the left, improve=6.925978, (0 missing) region splits as LRLL, improve=5.217350, (0 missing) mortgage splits as LR, improve=5.208289, (0 missing) save_act splits as LR, improve=5.032890, (0 missing) pep splits as LR, improve=3.223694, (0 missing) Surrogate splits: save_act splits as LR, agree=0.662, adj=0.13, (0 split) Node number 7: 167 observations, complexity param=0.04394427 predicted class=5 expected loss=0.6167665 P(node) =0.334 class counts: 0 3 44 56 64 probabilities: 0.000 0.018 0.263 0.335 0.383 left son=14 (36 obs) right son=15 (131 obs) Primary splits: save_act splits as LR, improve=20.568730, (0 missing) age < 62.5 to the left, improve=10.528110, (0 missing) pep splits as LR, improve= 6.160460, (0 missing) children splits as LR, improve= 3.415047, (0 missing) current_act splits as LR, improve= 2.966695, (0 missing) Node number 8: 80 observations, complexity param=0.01286174 predicted class=1 expected loss=0.6 P(node) =0.16 class counts: 32 31 17 0 0 probabilities: 0.400 0.388 0.213 0.000 0.000 left son=16 (53 obs) right son=17 (27 obs) Primary splits: pep splits as LR, improve=4.625912, (0 missing) region splits as RLLL, improve=3.142163, (0 missing) married splits as LR, improve=2.899320, (0 missing) age < 19.5 to the left, improve=2.470894, (0 missing) current_act splits as LR, improve=1.816017, (0 missing) Node number 9: 47 observations predicted class=3 expected loss=0.4680851 P(node) =0.094 class counts: 7 15 25 0 0 probabilities: 0.149 0.319 0.532 0.000 0.000 Node number 12: 54 observations predicted class=3 expected loss=0.3888889 P(node) =0.108 class counts: 0 6 33 15 0 probabilities: 0.000 0.111 0.611 0.278 0.000 Node number 13: 85 observations predicted class=4 expected loss=0.5647059 P(node) =0.17 class counts: 0 13 29 37 6 probabilities: 0.000 0.153 0.341 0.435 0.071 Node number 14: 36 observations, complexity param=0.01286174 predicted class=4 expected loss=0.4166667 P(node) =0.072 class counts: 0 1 14 21 0 probabilities: 0.000 0.028 0.389 0.583 0.000 left son=28 (17 obs) right son=29 (19 obs) Primary splits: car splits as LR, improve=3.958279, (0 missing) children splits as LR, improve=3.042364, (0 missing) age < 52.5 to the left, improve=2.290985, (0 missing) sex splits as RL, improve=1.218503, (0 missing) pep splits as LR, improve=1.171938, (0 missing) Surrogate splits: married splits as RL, agree=0.639, adj=0.235, (0 split) age < 56.5 to the left, agree=0.611, adj=0.176, (0 split) sex splits as RL, agree=0.611, adj=0.176, (0 split) current_act splits as RL, agree=0.611, adj=0.176, (0 split) region splits as RLRL, agree=0.583, adj=0.118, (0 split) Node number 15: 131 observations, complexity param=0.01768489 predicted class=5 expected loss=0.5114504 P(node) =0.262 class counts: 0 2 30 35 64 probabilities: 0.000 0.015 0.229 0.267 0.489 left son=30 (62 obs) right son=31 (69 obs) Primary splits: pep splits as LR, improve=8.469909, (0 missing) age < 61.5 to the left, improve=8.012590, (0 missing) current_act splits as LR, improve=3.684888, (0 missing) region splits as LLRL, improve=2.561434, (0 missing) mortgage splits as RL, improve=2.195836, (0 missing) Surrogate splits: children splits as LR, agree=0.733, adj=0.435, (0 split) mortgage splits as RL, agree=0.618, adj=0.194, (0 split) married splits as RL, agree=0.580, adj=0.113, (0 split) age < 61.5 to the left, agree=0.557, adj=0.065, (0 split) region splits as LRRR, agree=0.550, adj=0.048, (0 split) Node number 16: 53 observations, complexity param=0.01125402 predicted class=1 expected loss=0.5283019 P(node) =0.106 class counts: 25 22 6 0 0 probabilities: 0.472 0.415 0.113 0.000 0.000 left son=32 (19 obs) right son=33 (34 obs) Primary splits: children splits as LR, improve=3.6638990, (0 missing) current_act splits as LR, improve=2.6320070, (0 missing) age < 21.5 to the left, improve=2.5513420, (0 missing) region splits as RRLL, improve=2.1774120, (0 missing) mortgage splits as RL, improve=0.9724508, (0 missing) Surrogate splits: age < 24.5 to the right, agree=0.66, adj=0.053, (0 split) Node number 17: 27 observations, complexity param=0.01286174 predicted class=3 expected loss=0.5925926 P(node) =0.054 class counts: 7 9 11 0 0 probabilities: 0.259 0.333 0.407 0.000 0.000 left son=34 (10 obs) right son=35 (17 obs) Primary splits: married splits as LR, improve=4.075579, (0 missing) region splits as RLRR, improve=3.389759, (0 missing) current_act splits as RL, improve=2.009758, (0 missing) mortgage splits as LR, improve=1.968611, (0 missing) age < 24.5 to the left, improve=1.913873, (0 missing) Surrogate splits: region splits as RRRL, agree=0.667, adj=0.1, (0 split) children splits as LR, agree=0.667, adj=0.1, (0 split) Node number 28: 17 observations predicted class=3 expected loss=0.4117647 P(node) =0.034 class counts: 0 1 10 6 0 probabilities: 0.000 0.059 0.588 0.353 0.000 Node number 29: 19 observations predicted class=4 expected loss=0.2105263 P(node) =0.038 class counts: 0 0 4 15 0 probabilities: 0.000 0.000 0.211 0.789 0.000 Node number 30: 62 observations, complexity param=0.01768489 predicted class=3 expected loss=0.6451613 P(node) =0.124 class counts: 0 2 22 17 21 probabilities: 0.000 0.032 0.355 0.274 0.339 left son=60 (21 obs) right son=61 (41 obs) Primary splits: children splits as RL, improve=11.301300, (0 missing) age < 53 to the left, improve= 4.503433, (0 missing) current_act splits as LR, improve= 4.345984, (0 missing) region splits as RLRL, improve= 2.877549, (0 missing) car splits as RL, improve= 2.330261, (0 missing) Node number 31: 69 observations predicted class=5 expected loss=0.3768116 P(node) =0.138 class counts: 0 0 8 18 43 probabilities: 0.000 0.000 0.116 0.261 0.623 Node number 32: 19 observations, complexity param=0.01125402 predicted class=2 expected loss=0.4210526 P(node) =0.038 class counts: 8 11 0 0 0 probabilities: 0.421 0.579 0.000 0.000 0.000 left son=64 (4 obs) right son=65 (15 obs) Primary splits: region splits as RRRL, improve=4.2332330, (0 missing) age < 21.5 to the left, improve=1.9960510, (0 missing) sex splits as RL, improve=0.8541775, (0 missing) current_act splits as LR, improve=0.4374060, (0 missing) car splits as LR, improve=0.1404615, (0 missing) Node number 33: 34 observations predicted class=1 expected loss=0.5 P(node) =0.068 class counts: 17 11 6 0 0 probabilities: 0.500 0.324 0.176 0.000 0.000 Node number 34: 10 observations predicted class=1 expected loss=0.5 P(node) =0.02 class counts: 5 4 1 0 0 probabilities: 0.500 0.400 0.100 0.000 0.000 Node number 35: 17 observations predicted class=3 expected loss=0.4117647 P(node) =0.034 class counts: 2 5 10 0 0 probabilities: 0.118 0.294 0.588 0.000 0.000 Node number 60: 21 observations, complexity param=0.01286174 predicted class=3 expected loss=0.4761905 P(node) =0.042 class counts: 0 1 11 9 0 probabilities: 0.000 0.048 0.524 0.429 0.000 left son=120 (17 obs) right son=121 (4 obs) Primary splits: age < 62 to the left, improve=4.042513, (0 missing) region splits as RRLR, improve=4.020326, (0 missing) sex splits as LR, improve=1.593345, (0 missing) married splits as LR, improve=1.350827, (0 missing) mortgage splits as RL, improve=1.033341, (0 missing) Node number 61: 41 observations predicted class=5 expected loss=0.4878049 P(node) =0.082 class counts: 0 1 11 8 21 probabilities: 0.000 0.024 0.268 0.195 0.512 Node number 64: 4 observations predicted class=1 expected loss=0 P(node) =0.008 class counts: 4 0 0 0 0 probabilities: 1.000 0.000 0.000 0.000 0.000 Node number 65: 15 observations predicted class=2 expected loss=0.2666667 P(node) =0.03 class counts: 4 11 0 0 0 probabilities: 0.267 0.733 0.000 0.000 0.000 Node number 120: 17 observations predicted class=3 expected loss=0.3529412 P(node) =0.034 class counts: 0 1 11 5 0 probabilities: 0.000 0.059 0.647 0.294 0.000 Node number 121: 4 observations predicted class=4 expected loss=0 P(node) =0.008 class counts: 0 0 0 4 0 probabilities: 0.000 0.000 0.000 1.000 0.000

 

 

然后根据该模型编程实现对测试数据的分类

 

 

 

 

 

转载请注明原文地址: https://www.6miu.com/read-5032625.html

最新回复(0)