A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service.
The data set is Churn . The fields are as follows:
State
discrete.
account length
continuous.
area code
continuous.
phone number
discrete.
international plan
discrete.
voice mail plan
discrete.
number vmail messages
continuous.
total day minutes
continuous.
total day calls
continuous.
total day charge
continuous.
total eve minutes
continuous.
total eve calls
continuous.
total eve charge
continuous.
total night minutes
continuous.
total night calls
continuous.
total night charge
continuous.
total intl minutes
continuous.
total intl calls
continuous.
total intl charge
continuous.
number customer service calls
continuous.
churn
Discrete
查看数据概览 ## state account.length area.code phone.number ## WV : 158 Min. : 1.0 Min. :408.0 327-1058: 1 ## MN : 125 1st Qu.: 73.0 1st Qu.:408.0 327-1319: 1 ## AL : 124 Median :100.0 Median :415.0 327-2040: 1 ## ID : 119 Mean :100.3 Mean :436.9 327-2475: 1 ## VA : 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053: 1 ## OH : 116 Max. :243.0 Max. :510.0 327-3587: 1 ## (Other):4240 (Other) :4994 ## international.plan voice.mail.plan number.vmail.messages ## no :4527 no :3677 Min. : 0.000 ## yes: 473 yes:1323 1st Qu.: 0.000 ## Median : 0.000 ## Mean : 7.755 ## 3rd Qu.:17.000 ## Max. :52.000 ## ## total.day.minutes total.day.calls total.day.charge total.eve.minutes ## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0 ## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 ## Median :180.1 Median :100 Median :30.62 Median :201.0 ## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6 ## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 ## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7 ## ## total.eve.calls total.eve.charge total.night.minutes total.night.calls ## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00 ## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 ## Median :100.0 Median :17.09 Median :200.4 Median :100.00 ## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92 ## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 ## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00 ## ## total.night.charge total.intl.minutes total.intl.calls total.intl.charge ## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000 ## 1st Qu.: 7.510 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300 ## Median : 9.020 Median :10.30 Median : 4.000 Median :2.780 ## Mean : 9.018 Mean :10.26 Mean : 4.435 Mean :2.771 ## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240 ## Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400 ## ## number.customer.service.calls churn ## Min. :0.00 False.:4293 ## 1st Qu.:1.00 True. : 707 ## Median :1.00 ## Mean :1.57 ## 3rd Qu.:2.00 ## Max. :9.00 ##
从数据概览中我们可以发现没有缺失数据,同时可以发现电话号 地区代码是没有价值的变量,可以删去
从上面的结果中,我们可以看到churn为no的样本数目要远远大于churn为yes的样本,因此所有样本中churn占多数。
从上面的结果中,我们可以看到除了emailcode和areacode之外,其他数值变量近似符合正态分布。
## account.length area.code number.vmail.messages total.day.minutes ## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : 0.0 ## 1st Qu.: 73.0 1st Qu.:408.0 1st Qu.: 0.000 1st Qu.:143.7 ## Median :100.0 Median :415.0 Median : 0.000 Median :180.1 ## Mean :100.3 Mean :436.9 Mean : 7.755 Mean :180.3 ## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2 ## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5 ## total.day.calls total.day.charge total.eve.minutes total.eve.calls ## Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0 ## 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0 ## Median :100 Median :30.62 Median :201.0 Median :100.0 ## Mean :100 Mean :30.65 Mean :200.6 Mean :100.2 ## 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0 ## Max. :165 Max. :59.76 Max. :363.7 Max. :170.0 ## total.eve.charge total.night.minutes total.night.calls total.night.charge ## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000 ## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510 ## Median :17.09 Median :200.4 Median :100.00 Median : 9.020 ## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018 ## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560 ## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770 ## total.intl.minutes total.intl.calls total.intl.charge ## Min. : 0.00 Min. : 0.000 Min. :0.000 ## 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300 ## Median :10.30 Median : 4.000 Median :2.780 ## Mean :10.26 Mean : 4.435 Mean :2.771 ## 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240 ## Max. :20.00 Max. :20.000 Max. :5.400 ## number.customer.service.calls ## Min. :0.00 ## 1st Qu.:1.00 ## Median :1.00 ## Mean :1.57 ## 3rd Qu.:2.00 ## Max. :9.00从结果中我们可以看到两者之间存在显著的正相关线性关系。
construct a distribution of the variable with a churn overlay
construct a histogram of the variable with a churn overlay
Find a pair of numeric variables which are interesting with respect to churn.
从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。特别是churn为no的变量之间存在相关关系。
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3082150 0.0735760 4.189 2.85e-05 *** ## stateAL 0.0151188 0.0462343 0.327 0.743680 ## stateAR 0.0894792 0.0490897 1.823 0.068399 . ## stateAZ 0.0329566 0.0494195 0.667 0.504883 ## stateCA 0.1951511 0.0567439 3.439 0.000588 *** ## international.plan yes 0.3059341 0.0151677 20.170 < 2e-16 *** ## voice.mail.plan yes -0.1375056 0.0337533 -4.074 4.70e-05 *** ## number.vmail.messages 0.0017068 0.0010988 1.553 0.120402 ## total.day.minutes 0.3796323 0.2629027 1.444 0.148802 ## total.day.calls 0.0002191 0.0002235 0.981 0.326781 ## total.day.charge -2.2207671 1.5464583 -1.436 0.151056 ## total.eve.minutes 0.0288233 0.1307496 0.220 0.825533 ## total.eve.calls -0.0001585 0.0002238 -0.708 0.478915 ## total.eve.charge -0.3316041 1.5382391 -0.216 0.829329 ## total.night.minutes 0.0083224 0.0695916 0.120 0.904814 ## total.night.calls -0.0001824 0.0002225 -0.820 0.412290 ## total.night.charge -0.1760782 1.5464674 -0.114 0.909355 ## total.intl.minutes -0.0104679 0.4192270 -0.025 0.980080 ## total.intl.calls -0.0063448 0.0018062 -3.513 0.000447 *** ## total.intl.charge 0.0676460 1.5528267 0.044 0.965254 ## number.customer.service.calls 0.0566474 0.0033945 16.688 < 2e-16 *** ## total.day.minutes1medium 0.0502681 0.0160228 3.137 0.001715 ** ## total.day.minutes1short 0.2404020 0.0322293 7.459 1.02e-13 ***从结果中看,我们可以发现 state total.intl.calls 、number.customer.service.calls 、 total.day.minutes1medium 、 total.day.minutes1short 的变量有重要的影响。
从测试集的结果,我们可以看到准确度达到86%。
我们可以发现 ,total.day.calls和total.day.charge之间存在一定的相关关系。特别是churn为no的变量之间存在相关关系。同时我们可以发现 state total.intl.calls 、number.customer.service.calls 、 total.day.minutes1medium、 total.day.minutes1short 的变量有重要的影响。同时我们可以发现,total.day.calls和total.day.charge之间存在一定的相关关系。最后从knn模型结果中,我们可以发现从训练集的结果中,我们可以看到准确度有80%,从测试集的结果,我们可以看到准确度达到86%。说明模型有很好的预测效果。
相关文章:
Python中用PyTorch机器学习分类预测银行客户流失模型
决策树算法建立电信客户流失模型
【大数据部落】(数据挖掘)如何用大数据做用户异常行为
拓端研究室 拓端数据 拓端(http://tecdat.cn )创立于2016年,自成立以来,就定位为提供专业的统计分析与数据挖掘服务的提供商,致力于充分挖掘数据的价值,为客户定制个性化的数据方案与报告等。欢迎关注微信公众号:拓端数据部落、拓端数据。