@Macux
2015-12-01T06:49:26.000000Z
字数 3917
阅读 1362
R语言_学习笔记
1、适用环境
当想通过连续型 或 类别型变量来预测二值型结果变量(例如,是/否、通过/失败)。
2、R的实现
> library(car)
> library(lmtest)
> library(zoo)
> library(sandwich)
> library(survival)
> library(splines)
> library(AER)
> attach(Affairs)
> Affairs$ynaffair[Affairs$affairs > 0] <- 1
> Affairs$ynaffair[Affairs$affairs ==0] <- 0
> Affairs$ynaffair <- factor(Affairs$ynaffair,levels=c(0,1),labels=c("No","Yes"))
> fit.all<- glm(ynaffair~.-affairs,data=Affairs,family=binomial())
> summary(fit.all)
#从结果看出,存在一些不必要的变量,现去除不必要的自变量
> fit.reduce <- glm(ynaffair~.-affairs-gender-children-education-occupation,data=Affairs,family=binomial())
3、比较两个模型
#使用卡方检验比较两个模型
> anova(fit.all,fit.reduce,test="Chisq")
Analysis of Deviance Table
Model 1: ynaffair ~ (affairs + gender + age + yearsmarried + children +
religiousness + education + occupation + rating) - affairs
Model 2: ynaffair ~ (affairs + gender + age + yearsmarried + children +
religiousness + education + occupation + rating) - affairs -
gender - children - education - occupation
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 592 609.51
2 596 615.36 -4 -5.8474 0.2108
P=0.2108》0.05,表明两个模型的拟合程度一样好。因此我们选择更简单的模型进行解释。
4、解释模型参数
> summary(fit.reduce)
Call:
glm(formula = ynaffair ~ . - affairs - gender - children - education -
occupation, family = binomial(), data = Affairs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6278 -0.7550 -0.5701 -0.2624 2.3998
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.93083 0.61032 3.164 0.001558 **
age -0.03527 0.01736 -2.032 0.042127 *
yearsmarried 0.10062 0.02921 3.445 0.000571 ***
religiousness -0.32902 0.08945 -3.678 0.000235 ***
rating -0.46136 0.08884 -5.193 2.06e-07 ***
#指数化参数结果
exp(coef(fit.reduce))
(Intercept) age yearsmarried religiousness rating
6.8952321 0.9653437 1.1058594 0.7196258 0.6304248
指数化后,我们可以做如下解释:
(1)、随着婚龄的增加(值大于1)和年龄、宗教信仰与婚姻评分的降低(值都小于1),婚外情的优势比将上升。
(2)、参数值大于1,正相关;反之,负相关。具体大小,表明对响应变量的影响程度(正or负增长的快慢)
5、计算优势比尺度上得到系数95%的置信区间
#在优势比尺度上得到系数95%的置信区间
> exp(confint(fit.reduce))
2.5 % 97.5 %
(Intercept) 2.1255764 23.3506030
age 0.9323342 0.9981470
yearsmarried 1.0448584 1.1718250
religiousness 0.6026782 0.8562807
rating 0.5286586 0.7493370
6、评价某个预测变量对结果概率的影响
例如,我现在想知道“婚姻评分”对婚外情概率的影响!
> testdata <- data.frame(rating=c(1,2,3,4,5),age=mean(Affairs$age),yearsmarried=mean(Affairs$yearsmarried),religiousness=mean(Affairs$religiousness))
> testdata$prob <- predict(fit.reduce,newdata=testdata,type="response")
7、过度离势
#检测是否存在过度离势。若p>0.05,则不存在过度离势。
> fit.od <- glm(ynaffair~.-affairs-gender-children-education-occupation,data=Affairs,
family=quasibinomial())
> pchisq(summary(fit.od)$dispersion*fit.reduce$df.residual,fit.reduce$df.residual,lower=F)
[1] 0.340122
p>0.05,表明不存在过度离势。
8、扩展
1、适用环境
当通过一系列连续型和/或类别型预测变量来预测计数型响应变量时,泊松回归将是非常有用的工具。
2、R的实现
> library(fit.models)
> library(lattice)
> library(MASS)
> library(robustbase)
> library(rrcov)
> library(pcaPP)
> library(mvtnorm)
> library(robust)
> data(breslow.dat)
> fit.po <- glm(sumY~Base+Age+Trt,data=breslow.dat,family=poisson())
3、解释模型参数
> summary(fit.po)
Call:
glm(formula = sumY ~ Base + Age + Trt, family = poisson(), data = breslow.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.0569 -2.0433 -0.9397 0.7929 11.0061
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9488259 0.1356191 14.370 < 2e-16 ***
Base 0.0226517 0.0005093 44.476 < 2e-16 ***
Age 0.0227401 0.0040240 5.651 1.59e-08 ***
Trtprogabide -0.1527009 0.0478051 -3.194 0.0014 **
同样,为了方便解释,我们还是对系数指数化:
> exp(coef(fit.po))
(Intercept) Base Age Trtprogabide
7.0204403 1.0229102 1.0230007 0.8583864
可以看到:
(1)、保持其他变量不变,年龄增加一岁,期望的癫痫发病数将乘以1.023。这意味着年龄的增加与较高的癫痫发病数相关联。
(2)、一单位Trt的变化,期望的癫痫发病数将乘以0.86。这意味着保持基础癫痫发病数和年龄不变,服药组相对于安慰剂组癫痫发病数降低14%。
4、过度离势
> library(qcc)
> qcc.overdispersion.test(breslow.dat$sumY,type="poisson")
Overdispersion test Obs.Var/Theor.Var Statistic p-value
poisson data 62.87013 3646.468 0
p<0.05,表示存在过度离势。
5、解决过度离势问题
> fit.op <- glm(sumY~Base+Age+Trt,data=breslow.dat,family=quasipoisson())