包载入
library(vcd) library(car)逻辑回归是讲一件事的发生概率和一个预测变量的指数方程联系的模型,能针对验证性研究给出有明确有效的回应。
#此处引入的数据集是关于不同销售策略下,客户转化率的问题 amu_log <- read.csv("http://r-marketing.r-forge.r-project.org/data/rintro-chapter9.csv") names(amu_log) <- tolower(names(amu_log)) head(amu_log) #以非捆绑销售的为基准,对因子层级进行设定 amu_log$promo <- factor(amu_log$promo, levels = c("NoBundle","Bundle")) #可以看出,pass和promo两个经典逻辑回归响应变量和解释变量基础上,还有其他变量channel的存在。 with(amu_log,table(pass,promo))结果:
promo pass NoBundle Bundle NoPass 812 755 YesPass 670 919拟合模型:
amu_glm <- glm(pass ~ promo + channel,data = amu_log, family = binomial) summary(amu_glm)结果:
Call: glm(formula = pass ~ promo + channel, family = binomial, data = amu_log) Deviance Residuals: Min 1Q Median 3Q Max -1.9079 -0.9883 0.5946 0.7637 2.3272 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.07860 0.13167 -15.787 < 2e-16 *** promoBundle -0.56022 0.09031 -6.203 5.54e-10 *** channelMail 2.17617 0.14651 14.854 < 2e-16 *** channelPark 3.72176 0.15964 23.313 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4375.0 on 3155 degrees of freedom Residual deviance: 3490.2 on 3152 degrees of freedom AIC: 3498.2 Number of Fisher Scoring iterations: 4 #计算相关系数发生比 exp(coef(amu_glm)) #计算系数的置信区间发生比 exp(confint(amu_glm))结果:
(Intercept) promoBundle channelMail channelPark 0.1251054 0.5710846 8.8125066 41.3371206 Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.09577568 0.1606189 promoBundle 0.47793969 0.6810148 channelMail 6.65770550 11.8328173 channelPark 30.42959274 56.9295369逻辑回归模型的系数解读,可使用其系数判断方向,使用exp(系数)判断大小。以本案为例,在公园内直接售票能比不这么做提高30~56倍的售出票的概率。
#对于他们之间的关系,除了table()函数外,可以通过更好可视化的马赛克图来展示。 library(vcd) doubledecker(table(amu_log))不同渠道绑定和非绑定的总体对比趋势是不同的,说明不同变量间存在交互效应。为此,我们在以上模型的基础上,加入promo和channel的交互效应,看是否有更好地解释力。
amu_glm2 <- glm(pass ~ promo + channel + promo:channel,data = amu_log, family = binomial) summary(amu_glm2) #coef()计算模型系数,exp()则获得发生比 exp(coef(amu_glm2)) #计算系数的置信区间发生比 exp(confint(amu_glm2))结果:
Call: glm(formula = pass ~ promo + channel + promo:channel, family = binomial, data = amu_log) Deviance Residuals: Min 1Q Median 3Q Max -1.9577 -0.9286 0.5642 0.7738 2.4259 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.8883 0.1977 -14.608 < 2e-16 *** promoBundle 2.1071 0.2783 7.571 3.71e-14 *** channelMail 3.1440 0.2133 14.743 < 2e-16 *** channelPark 4.6455 0.2510 18.504 < 2e-16 *** promoBundle:channelMail -2.9808 0.3003 -9.925 < 2e-16 *** promoBundle:channelPark -2.8115 0.3278 -8.577 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4375.0 on 3155 degrees of freedom Residual deviance: 3393.5 on 3150 degrees of freedom AIC: 3405.5 Number of Fisher Scoring iterations: 5 (Intercept) promoBundle channelMail 0.05567010 8.22400712 23.19677586 channelPark promoBundle:channelMail promoBundle:channelPark 104.11186672 0.05074998 0.06011603 Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.03688720 0.08032263 promoBundle 4.78970184 14.31465957 channelMail 15.54800270 35.97860059 channelPark 64.74364028 173.57861021 promoBundle:channelMail 0.02795867 0.09102369 promoBundle:channelPark 0.03135437 0.11360965统计模型确认了channel和promo存在有显著性的交互效应,但promo的有效程度依然取决于不同的channel。
可以看出,本案例数据集的变量存在明显的偏态和成对高度相关。
尽管对于不同变量有针对性的变换方式,但这些方式失效,或者变量数量众多时,可以使用通用性的变换函数Box-Cox变换。
library(forecast) #进行Box-Cox变换,需要首先计算变量对应lambda值 BoxCox.lambda(x) #使用该lambda值进行变换 BoxCox(x,BoxCox.lambda(x)) #针对变换值进行标准化处理 scale(BoxCox(x,BoxCox.lambda(x))) dstrTrsf <- function(x){ library(forecast) return(scale(BoxCox(x,BoxCox.lambda(x)))) } #选出可以进行变换的数据集 spend_trsf <- spendOnline[spendOnline$online.spend >0,-1] numcol <- which(colnames(spend_trsf) != "email") #lapply进行逐列变换 spend_trsf[,numcol] <- lapply(spend_trsf[,numcol],dstrTrsf) gpairs(spend_trsf)对比变换前后的分布状态,可以看出,偏态现象明显减少。
共线性会产生什么后果呢?
spend_m <- lm(online.spend ~ . , data = spend_trsf) summary(spend_m)结果:
Call: lm(formula = online.spend ~ ., data = spend_trsf) Residuals: Min 1Q Median 3Q Max -0.39286 -0.05252 -0.00098 0.05428 0.26479 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.245e-03 1.080e-02 -0.300 0.764 age 9.942e-05 4.549e-03 0.022 0.983 credit.score -2.556e-03 4.498e-03 -0.568 0.570 emailyes 7.457e-03 1.182e-02 0.631 0.528 distance.to.store -2.051e-03 4.998e-03 -0.410 0.682 online.visits -1.002e-03 1.269e-02 -0.079 0.937 online.trans 9.918e-01 1.260e-02 78.712 <2e-16 *** store.trans -2.427e-02 4.580e-02 -0.530 0.596 store.spend 2.521e-02 4.547e-02 0.554 0.580 sat.service 5.756e-03 5.239e-03 1.099 0.273 sat.selection 3.152e-03 5.230e-03 0.603 0.547 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0866 on 407 degrees of freedom (214 observations deleted due to missingness) Multiple R-squared: 0.9924, Adjusted R-squared: 0.9922 F-statistic: 5316 on 10 and 407 DF, p-value: < 2.2e-16虽然获得了很高的r方,但却只有一个变量和响应变量有显著性关系,即online.trans(在线交易次数)这个和响应变量可从主观就可判断高度相关的变量。很明显这种模型虽然统计学上很完美了,但却无法使用。
对于模型变量共线性有3种方法: * 忽略高度相关变量; * 通过提取主成分(或计算高度相关的预测变量组因子得分)消除变量间的相关性; * 使用对共线性有抗性的模型建构方法,如随机森林等建模方式。
虽然可通过变量分布矩阵大致看出存在共线性的变量,但仍需要更为精确方法car包的vif()函数。
library(car) vif(spend_m)结果:
age credit.score email distance.to.store 1.095210 1.112131 1.047490 1.293468 online.visits online.trans store.trans store.spend 8.513212 8.586219 114.637694 112.395696 sat.service sat.selection 1.525943 1.520874一般而言,vif值大于5.0说明需要对共线性采取措施了。
结果:
Call: lm(formula = online.spend ~ . - online.trans - store.trans, data = spend_trsf) Residuals: Min 1Q Median 3Q Max -1.35579 -0.12595 0.05827 0.18121 1.03482 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0871472 0.0431647 -2.019 0.0441 * age -0.0340390 0.0181661 -1.874 0.0617 . credit.score -0.0092356 0.0179997 -0.513 0.6082 emailyes 0.1100993 0.0471908 2.333 0.0201 * distance.to.store -0.0001036 0.0194782 -0.005 0.9958 online.visits 0.9340015 0.0176964 52.779 <2e-16 *** store.spend 0.0093869 0.0190377 0.493 0.6222 sat.service -0.0117930 0.0209974 -0.562 0.5747 sat.selection 0.0045268 0.0210155 0.215 0.8296 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.348 on 409 degrees of freedom (214 observations deleted due to missingness) Multiple R-squared: 0.8767, Adjusted R-squared: 0.8743 F-statistic: 363.5 on 8 and 409 DF, p-value: < 2.2e-16虽然vif值都满足了,但R方也随之下降了。
结果:
Call: lm(formula = online.spend ~ email + age + credit.score + distance.to.store + sat.service + sat.selection + online + store, data = spend_trsf) Residuals: Min 1Q Median 3Q Max -0.83525 -0.08244 0.01280 0.09384 0.72979 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0361934 0.0238145 -1.520 0.1293 emailyes 0.0467442 0.0260512 1.794 0.0735 . age -0.0171626 0.0100018 -1.716 0.0869 . credit.score -0.0040981 0.0099080 -0.414 0.6794 distance.to.store -0.0001683 0.0107783 -0.016 0.9875 sat.service -0.0026340 0.0115625 -0.228 0.8199 sat.selection 0.0029578 0.0115701 0.256 0.7984 online -0.7024761 0.0069807 -100.632 <2e-16 *** store -0.0027882 0.0074487 -0.374 0.7084 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1916 on 409 degrees of freedom (214 observations deleted due to missingness) Multiple R-squared: 0.9626, Adjusted R-squared: 0.9619 F-statistic: 1317 on 8 and 409 DF, p-value: < 2.2e-16 email age credit.score distance.to.store 1.040119 1.081569 1.102583 1.228938 sat.service sat.selection online store 1.518655 1.520658 1.032112 1.232759vif值满足需求。