The following code snippet investigates the consequences of not using a logarithmic transformation for the nihills
dataset analysis.
The second model differs from the first in having a \(\mathsf{dist} \times \mathsf{climb}\) interaction term, additional to linear terms in \(\mathsf{dist}\) and \(\mathsf{climb}\).
Fit the two models:
nihills.lm <- lm(time ~ dist+climb, data=nihills)
nihills2.lm <- lm(time ~ dist+climb+dist:climb, data=nihills)
anova(nihills.lm, nihills2.lm)
Using the F-test result, make a tentative choice of model, and proceed to examine diagnostic plots.
Are there any problematic observations? What happens if these points are removed?
Refit both of the above models, and check the diagnostics again.
# LIBRARIES:
library(lattice)
library(DAAG)
# DATA:
data(nihills)
### ORIGINAL DATASET ###
# Fit models
lm(time ~ dist+climb, data=nihills)
nihills_lm1 <- lm(time ~ dist+climb+dist:climb, data=nihills)
nihills_lm2 <-
# ANOVA
anova(nihills_lm1, nihills_lm2)
## Analysis of Variance Table
##
## Model 1: time ~ dist + climb
## Model 2: time ~ dist + climb + dist:climb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 20 0.189361
## 2 19 0.039361 1 0.15 72.406 6.623e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Diagnostic plots
par(mfrow=c(2,2))
plot(nihills_lm1)
par(mfrow=c(2,2))
plot(nihills_lm2)
### ONE PROBLEMATIC DATAPOINT REMOVED ###
# Removal
c("Seven Sevens")
rows_to_remove <- nihills[!(row.names(nihills) %in% rows_to_remove),]
nihills_pruned <-
# Fit models
lm(time ~ dist+climb, data=nihills_pruned)
nihills_lm1_pruned <- lm(time ~ dist+climb+dist:climb, data=nihills_pruned)
nihills_lm2_pruned <-
# ANOVA
anova(nihills_lm1_pruned, nihills_lm2_pruned)
## Analysis of Variance Table
##
## Model 1: time ~ dist + climb
## Model 2: time ~ dist + climb + dist:climb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 0.040533
## 2 18 0.035509 1 0.0050248 2.5472 0.1279
# Diagnostic plots
par(mfrow=c(2,2))
plot(nihills_lm1_pruned)
par(mfrow=c(2,2))
plot(nihills_lm2_pruned)
### TWO PROBLEMATIC DATAPOINTS REMOVED ###
# Removal
c("Seven Sevens", "Annalong Horseshoe")
rows_to_remove <- nihills[!(row.names(nihills) %in% rows_to_remove),]
nihills_pruned_more <-
# Fit models
lm(time ~ dist+climb, data=nihills_pruned_more)
nihills_lm1_pruned_more <- lm(time ~ dist+climb+dist:climb, data=nihills_pruned_more)
nihills_lm2_pruned_more <-
# ANOVA
anova(nihills_lm1_pruned_more, nihills_lm2_pruned_more)
## Analysis of Variance Table
##
## Model 1: time ~ dist + climb
## Model 2: time ~ dist + climb + dist:climb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 18 0.036592
## 2 17 0.035506 1 0.0010858 0.5199 0.4807
# Diagnostic plots
par(mfrow=c(2,2))
plot(nihills_lm1_pruned_more)
Apply the lm.ridge()
function to the litters
data, using the generalized cross-validation (GCV) criterion to choose the tuning parameter. (GCV is an approximation to cross-validation.)
brainwt
to bodywt
and lsize
and compare with the results obtained using lm()
.predict.lm()
.Let’s begin by taking a quick look at the data
library(DAAG)
library(MASS) # for lm.ridge()
head(DAAG::litters)
## lsize bodywt brainwt
## 1 3 9.447 0.444
## 2 3 9.780 0.436
## 3 4 9.155 0.417
## 4 4 9.613 0.429
## 5 5 8.850 0.425
## 6 5 9.610 0.434
plot(DAAG::litters) # Some linters may complain: that's fine.
From this basic plot we can immediately tell that:
brainwt
is positively correlated to bodywt
and negatively correlated to lsize
, although the correlation appears to be very noisy;bodywt
and lsize
;we may expect some problems arising from the collinearity evidenced in the latter, when fitting brainwt
against bodywt
and lsize
.
Let’s see how the linear model and ridge regression do:
# fit the vanilla linear model
lm(brainwt~bodywt+lsize, data=litters)
litters.lm =summary(litters.lm)
##
## Call:
## lm(formula = brainwt ~ bodywt + lsize, data = litters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0230005 -0.0098821 0.0004512 0.0092036 0.0180760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.178247 0.075323 2.366 0.03010 *
## bodywt 0.024306 0.006779 3.586 0.00228 **
## lsize 0.006690 0.003132 2.136 0.04751 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01195 on 17 degrees of freedom
## Multiple R-squared: 0.6505, Adjusted R-squared: 0.6094
## F-statistic: 15.82 on 2 and 17 DF, p-value: 0.0001315
# fit the ridge linear regression (selecting lambda by GCV)
::select(MASS::lm.ridge(brainwt~bodywt+lsize, data=DAAG::litters,
MASSlambda = seq(0,1,0.001)))
## modified HKB estimator is 0
## modified L-W estimator is 0
## smallest value of GCV at 0.118
.118
best.lambda = MASS::lm.ridge(brainwt~bodywt+lsize, data=DAAG::litters,
litters.ridge =lambda=best.lambda)
litters.ridge
## bodywt lsize
## 0.203442601 0.022050278 0.005661579
# estimate training MSE for comparing the models
# function that computes the predictions of a lm.ridge() model
function(model = litters.ridge, littersdf = DAAG::litters){
predridge <- coef(model)
coeffs =return(coeffs[1] + coeffs[2] * littersdf$bodywt + coeffs[3] * littersdf$lsize)
}
DAAG::litters$brainwt - predridge()
litters.ridge.residuals <-
print(paste("lm MSE: ",
sum(litters.lm$residuals**2) / length(litters.lm$residuals)))
## [1] "lm MSE: 0.000121435029328059"
print(paste("ridge MSE: ",
sum(litters.ridge.residuals**2) / length(litters.ridge.residuals)))
## [1] "ridge MSE: 0.000122235480911714"
The coefficients estimated by ridge regression for both predictors are slightly smaller than those fitted by the linear model without regularization, which is something that we expect to see in ridge regression, although the intercept’s value slightly increases.
We can compare the model looking at the training-set MSE, which actually shows a very small difference in favour of the model without regularization.
Let’s now check our models’ prediction on a the novel datapoint \((7,10)\). We’ll also estimate the percentile bootstrap 95% confidence interval for both methods around \((7,10)\) and compare them with the 95% CI reported by lm()
:
# estimate models on the new datapoint
data.frame(bodywt=7, lsize=10)
newpoint =
predict(litters.lm, newdata = newpoint)
yhat.lm <- predridge(litters.ridge,newpoint)
yhat.ridge <-
# estimate percentile bootstrap CI for lm and lm.ridge at newpoint
nrow(DAAG::litters)
n <- 1000
B =
1:B
lm.bs = 1:B
ridge.bs =
for(b in 1:B){
sample (1:n,n, replace = TRUE)
idxs = DAAG::litters[idxs,]
bootdf =
lm(brainwt~ bodywt + lsize, data = bootdf)
b_lm = predict(b_lm, newdata = newpoint)
lm.bs[b] <-
MASS::lm.ridge(brainwt~ bodywt + lsize, data = bootdf, lambda = best.lambda)
b_ridge = predridge(b_ridge, newpoint)
ridge.bs[b] <-
}
quantile(lm.bs,probs=c(0.025,0.975))
lm.ci =
quantile(ridge.bs,probs=c(0.025,0.975))
ridge.ci =
print("lm estimate + bootstrap 95% CI:")
## [1] "lm estimate + bootstrap 95% CI:"
print(paste(yhat.lm, "(", lm.ci[1],",",lm.ci[2],")"))
## [1] "0.415294682299934 ( 0.405605381833641 , 0.423419098631072 )"
print("ridge estimate + bootstrap 95% CI:")
## [1] "ridge estimate + bootstrap 95% CI:"
print(paste(yhat.ridge, "(", ridge.ci[1],",",ridge.ci[2],")"))
## [1] "0.414410339014346 ( 0.404817646253022 , 0.422267752173342 )"
print("lm() estimate and CI")
## [1] "lm() estimate and CI"
predict(litters.lm, newdata = newpoint,interval="confidence", level=.95)
## fit lwr upr
## 1 0.4152947 0.4062582 0.4243312
predict(litters.lm, newdata = newpoint,interval="confidence", level=.95)[2:3]
lm.cibase =
#plot the three intervals
plot(as.factor(c("lm+bs","lm+wald","ridge+bs")),c(yhat.lm,yhat.lm,yhat.ridge),
ylim=c(.404,.425)) # Some linters may complain: not an error.
lines(c(1,1),lm.ci, col="red")
lines(c(2,2),lm.cibase, col="red")
lines(c(3,3),ridge.ci, col="red")
We can see from the graph that the bootstrap CI is slightly asymmetric, but once again the difference in the estimates is very small.
The dataframe \(\mathsf{table.b3}\) in the \(\mathsf{MPV}\) package contains data on gas mileage and \(11\) other variables for a sample of \(32\) automobiles.
Construct a scatterplot of \(y\) (\(\mathsf{mpg}\)) versus x1
(\(\mathsf{displacement}\)). Is the relationship between these variables non-linear?
library(MPV)
library(lattice)
function(fitted_model, x, y, xlab, subtitle){
plot_fit <-plot(x, y, main = "fitted model", xlab = xlab, sub = subtitle) # Some linters may complain: not an error.
lines(x, as.vector(fitted_model$fitted.values), col ="red")
text(13,3, "y = b0+b1*x", col="red")
}
MPV::table.b3
dataset <-
# exploring data
str(dataset)
## 'data.frame': 32 obs. of 12 variables:
## $ y : num 18.9 17 20 18.2 20.1 ...
## $ x1 : num 350 350 250 351 225 440 231 262 89.7 96.9 ...
## $ x2 : num 165 170 105 143 95 215 110 110 70 75 ...
## $ x3 : num 260 275 185 255 170 330 175 200 81 83 ...
## $ x4 : num 8 8.5 8.25 8 8.4 8.2 8 8.5 8.2 9 ...
## $ x5 : num 2.56 2.56 2.73 3 2.76 2.88 2.56 2.56 3.9 4.3 ...
## $ x6 : num 4 4 1 2 1 4 2 2 2 2 ...
## $ x7 : num 3 3 3 3 3 3 3 3 4 5 ...
## $ x8 : num 200 200 197 200 194 ...
## $ x9 : num 69.9 72.9 72.2 74 71.8 69 65.4 65.4 64 65 ...
## $ x10: num 3910 3860 3510 3890 3365 ...
## $ x11: num 1 1 1 1 0 1 1 1 0 0 ...
par(mfrow=c(1,1))
# scatter plot
plot(dataset$x1, dataset$y, main="Scatterplot x1 versus y",
xlab="x_1 (displacement)", ylab="y (mpg)", pch=19)
Just by looking at the scatterplot, we can notice a non-linear relationship between the response \(y\) and the predictor x1
: these variables seem to be negatively correlated (i.e., as x1
increases, \(y\) decreases), laying approximately on a convex smooth curve. We may consider some possible transformation to the data in order to re-linearize their relationship.
No strange points are (yet) detected.
Use the xyplot()
function, and x1
(type of transmission) as a group variable. Is a linear model reasonable for these data?
# xy plot
par(mfrow=c(1,1))
::xyplot(y~x1,data=dataset,
latticegroups = x11,
pch = 19,
xlab="x1",
ylab="y",
main="Splitted Scatter Plot",
col = c("green", "blue"),
key=list(space="right",
lines=list(col=c("green","blue"), lty=c(3,2), lwd=6),
text=list(c(" group x11 = 0"," group x11 = 1"))
))
Just by looking at the plot, we can notice that the data might be adequately partitioned according to the trasmission method (predictor x1
) in order to subsequently fit two separate linear models. In fact, the observations belonging to x_11 = 0
are located in the x1
range \([100, 200]\), whereas the others in \([200, 500]\). Additionally, while still being noticeable a probably not-perfectly-linear relationship among the response and the x1
predictor across both the partitions, the average slope of a hypotetical linear fit is greatly different among the two.
To conclude, after having performed the split, it is reasonable to consider the data explained by two independent LMs.
Still, no strange points are (yet) detected.
Fit the model relating y to x1
and x1
which gives two lines having possibly different slopes and intercepts. Check the diagnostics. Are there any influential observations? Are there any influential outliers?
# fit models according the group x11: y = beta_0 + beta_1 * x1 + eps
1 <- lm(y ~ x1, data = dataset[dataset$x11 == 0, ])
fitted_model_x1_2 <- lm(y ~ x1, data = dataset[dataset$x11 == 1, ])
fitted_model_x1_
# diagnostics
summary(fitted_model_x1_1) # group x11 = 0
##
## Call:
## lm(formula = y ~ x1, data = dataset[dataset$x11 == 0, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2712 -1.4512 0.2958 2.8288 3.5412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.91963 3.86454 11.106 1.07e-05 ***
## x1 -0.11677 0.02804 -4.165 0.00421 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.626 on 7 degrees of freedom
## Multiple R-squared: 0.7125, Adjusted R-squared: 0.6715
## F-statistic: 17.35 on 1 and 7 DF, p-value: 0.004214
summary(fitted_model_x1_2) # group x11 = 1
##
## Call:
## lm(formula = y ~ x1, data = dataset[dataset$x11 == 1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8562 -0.8703 -0.0274 1.2012 4.6599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.455918 2.207520 13.343 1.00e-11 ***
## x1 -0.035127 0.006266 -5.606 1.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.097 on 21 degrees of freedom
## Multiple R-squared: 0.5995, Adjusted R-squared: 0.5804
## F-statistic: 31.43 on 1 and 21 DF, p-value: 1.454e-05
# residuals plots group x11 = 0
par(mfrow=c(2, 2))
plot(fitted_model_x1_1, main = "Analisys for model y ~ x1 (x11 = 0)")
# residuals plots group x11 = 1
par(mfrow=c(2, 2))
plot(fitted_model_x1_2, main = "Analisys for model y ~ x1 (x11 = 1)")
# prepare the modified model without point 5
dataset[-5, ]
dataset_b <-1_b <- lm(y ~ x1, data = dataset_b[dataset_b$x11 == 0, ])
fitted_model_x1_
# fitted model plots
par(mfrow=c(1, 3))
plot_fit(fitted_model_x1_1,
$x11 == 0, ]$x1,
dataset[dataset$x11 == 0, ]$y,
dataset[datasetxlab = "x1",
subtitle = "group x11 = 0")
# the plot of the fitted model shows us that the linear regressor could be a
# reasonable regressor, moreover the observation with large x1 could be the
# critical point of the residual vs laverage analysis. Check if it is an
# outlier and/or an influencer by looking the next plot
plot_fit(fitted_model_x1_1_b,
$x11 == 0, ]$x1,
dataset_b[dataset_b$x11 == 0, ]$y,
dataset_b[dataset_bxlab = "x1",
subtitle = "group x11 = 0 (without point 5)")
# by looking the plot it's clear that point 5 is almost an influencer point.
plot_fit(fitted_model_x1_2,
$x11 == 1, ]$x1,
dataset[dataset$x11 == 1, ]$y,
dataset[datasetxlab = "x1",
subtitle = "group x11 = 1")
# the fitted model looks quite good!
According to the summary function for the model fitted on group x11 = 0
, we may say that:
Residuals are spread in quite a large range and seem to be not totally symmetrical (mostly clumped on the left side of the median).
W.r.t. x1
, the estimate (i.e. the slope of the linear regressor) has a negative value, but quite close to zero. Moreover, the t statistic is, in absolute value, quite small but still statistically significant.
W.r.t. the adjusted \(R^2\), the variability of the model is explained at 67% just by considering the x1
predictor. As expected, additional predictors further reduce the unexplained variance.
W.r.t. the F-statistic, its value is not far from \(1\). In any case, its statistical significance suggests us to reject the null hypothesis against the statistical irrelevance of all estimated coefficients, intercept excluded.
In conclusion, we may accept this simple linear model, but taking into account that it may not be able to explain completely the data.
For what concerns the data partition x11 = 1
we can make the following observations:
Residuals are spread in a reasonably compact range, and seem to be symmetrical around the median.
W.r.t. x1
, the estimate (i.e. the slope of the linear regressor) has a negative value, but quite close to zero. Moreover, the t statistic is, in absolute value, quite small but still statistically significant.
W.r.t. the adjusted \(R^2\), the variability of the model is explained at 58% just by considering the x1
predictor. As expected, additional predictors further reduce the unexplained variance.
W.r.t. the F-statistic, its value is not far from \(1\). In any case, its statistical significance suggests us to reject the null hypothesis against the statistical irrelevance of all estimated coefficients, intercept excluded.
In conclusion, we may accept this simple linear model, but taking into account that it may not be able to explain completely the data.
For what concerns the residuals for the fitted models we may say something more by looking at the proper diagnostic plots, especially for group x11 = 0
:
Residuals vs fitted: the plot suggests us a weak pattern among the residuals, but they look quite spread. For that reason we cannot intuitively infer the hypotetical non-linear relationship among them to correct the model via data-transformation.
QQ-plot: the residuals seem to be normally-distributed, even if point 12 is far from the straight theoretical line: it could be an outlier or a point carrying a large amount of information still unexplained by the model.
Scale-location: the trend-line suggests us the lack of significant heteroscedasticity. However, inasmuch the number of observations is small, we blindly accept such result; it is unlikely, however, that this may corrupt the efficacy of our model.
Residuals vs leverage: point 5 shows a relevant Cook distance from zero, meaning that it has a strong influence on the model. It could be an outlier or an influential point (or both).
Upon further inspection, we can say that point 5 seems to be both an outlier and an influential observation because – additionally – its residual is significantly large compared to the others, and it is the only point located in that portion of the x1
domain.
Similarly, for the group x11 = 1
:
Residuals vs fitted: the plot suggests us no clear relationship among the residuals. For that reason we cannot intuitively infer a hypotetical non-linear relationship among them to correct the model via data-transformation.
QQ-plot: the residuals seem to be not-clearly-normally-distributed, especially for what concerns the point outside the \([-1, 1]\) range. Normality assumption is quite weakly verified.
Scale-location: the trend-line suggests us the lack of significant heteroscedasticity.
Residuals vs leverage: no point has a Cook distance greater than \(1\) from \(0\), even if the point \(17\) is close to the boundary.
Plot the residuals against the variable x7
(number of transmission speeds), again using x1
as a group variable. Is there anything striking about this plot?
# fit models according the group x11: y = beta_0 + beta_1 * x7 + eps
1 <- lm(y ~ x7, data = dataset[dataset$x11 == 0, ])
fitted_model_x7_2 <- lm(y ~ x7, data = dataset[dataset$x11 == 1, ])
fitted_model_x7_
# residuals plots group x11 = 0
par(mfrow=c(2, 2))
plot(fitted_model_x7_1, main = "Analisys for model y ~ x7 (x11 = 0)")
# the residuals are spreaded in just three integer values that are all
# multiple of three, moreover the qq-plot shows how the normality
# assumption is quite weak.
# residuals plots group x11 = 1
par(mfrow=c(2, 2))
plot(fitted_model_x7_2, main = "Analisys for model y ~ x7 (x11 = 1)")
# x7 is the same for all points in group x11 = 1, then they have the
# same fitted value, the scal-location plot is not reasonable and the
# qqplot is quite good but just because assumed the same value.
# fitted model plots
par(mfrow=c(1, 2))
plot_fit(fitted_model_x7_1,
$x11 == 0, ]$x7,
dataset[dataset$x11 == 0, ]$y,
dataset[datasetxlab = "x7",
subtitle = "group x11 = 0")
plot_fit(fitted_model_x7_2,
$x11 == 1, ]$x7,
dataset[dataset$x11 == 1, ]$y,
dataset[datasetxlab = "x7",
subtitle = "group x11 = 1")
# x7 versus residuals
par(mfrow=c(1,2))
plot(dataset[dataset$x11 == 0, ]$x7,
1$residuals, main="Scatterplot x7 versus Residuals",
fitted_model_x7_xlab="x7", ylab="Residuals", pch=19)
plot(dataset[dataset$x11 == 1, ]$x7,
2$residuals, main="Scatterplot x7 versus Residuals",
fitted_model_x7_xlab="x7", ylab="Residuals", pch=19)
The residual plot for the group x11 = 0
shows us that the points are spread just over three different integer values of x7
and doesn’t seem to exist a clear relationsip among their relative abundance.
Considering the residual plot for the group x11 = 1
, all points are located at a single value for x7
, suggesting that x7
is not a suitable predictor at all for our model and making pratically inapplicable any statistical tool in such regard. In the building of a model for such data, it can be safely discarded as useless.
The following code is designed to explore effects that can result from the omission of explanatory variables among the predictors for a model:
library(lattice)
library(DAAG)
set.seed(42)
runif(10) # predictor which will be missing
x1 <- rbinom(10, 1, 1-x1) # observed predictor which depends on missing predictor
x2 <-
5*x1 + x2 + rnorm(10,sd=.1) # simulated model; coef of x2 is positive
y <-
lm(y ~ factor(x2)) # model fitted to observed data
y.lm <-coef(y.lm)
## (Intercept) factor(x2)1
## 3.5611964 -0.3207703
# effect of missing variable:
lm(y ~ x1 + factor(x2)) # coefficient of x2 has wrong sign
y.lm2 <-coef(y.lm2) # correct model
## (Intercept) x1 factor(x2)1
## -0.330766 5.479133 1.031886
What happens if x2
is generated according to x2 <- rbinom(10, 1, x1)
? What happens if x2
is generated according to x2 <- rbinom(10, 1, .5)
?
First of all, theoretically speaking, if we consider only the \(x_{2}\) predictor for the explanation of the response variable \(y\), we get that \(e=y-y_{estimated}=5\times x_{1} + v\) where \(x_{1} \thicksim U(0,1)\) and \(v \thicksim \mathcal{N}(0, 0.01)\).
We can deduce that the expected value \(E(e) \neq 0\) and, as matter of fact, the invalidation of the zero-mean hypothesis for the stochastic component may cause the model to be biased.
1 <- rbinom(10, 1, x1)
x2_1 <- 5*x1 + x2_1 + rnorm(10,sd=.1)
y_1.lm <- lm(y_1 ~ factor(x2_1)) # model fitted to observed data y_
1.lm2 <- lm(y_1 ~ x1 + factor(x2_1)) y_
2 <- rbinom(10, 1, .5)
x2_2 <- 5*x1 + x2_2 + rnorm(10,sd=.1)
y_2.lm <- lm(y_2 ~ factor(x2_2)) # model fitted to observed data y_
2.lm2 <- lm(y_2~ x1 + factor(x2_2)) y_
anova(y.lm,y.lm2)
## Analysis of Variance Table
##
## Model 1: y ~ factor(x2)
## Model 2: y ~ x1 + factor(x2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8 14.5522
## 2 7 0.1132 1 14.439 892.86 1.212e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(y_1.lm,y_1.lm2)
## Analysis of Variance Table
##
## Model 1: y_1 ~ factor(x2_1)
## Model 2: y_1 ~ x1 + factor(x2_1)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8 11.0268
## 2 7 0.0612 1 10.966 1255.2 3.705e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(y_2.lm,y_2.lm2)
## Analysis of Variance Table
##
## Model 1: y_2 ~ factor(x2_2)
## Model 2: y_2 ~ x1 + factor(x2_2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8 13.8624
## 2 7 0.0396 1 13.823 2443.2 3.631e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By ANOVA-testing, we can notice a difference in term of unexplained RSS between the one-variable model (x2
-only) and the two-variables model, as theoretically expected. However, the \(F\)-score significance of the x1
predictor in the two-variables model – much higher than that of the one-variable one shows us that the addition of x1
to x2
among the predictors reduced overall unexplained variance much better than the addition of x2
alone over the intercept-only model. This is further confirmed by looking at the spread of partial residuals around their mean for the one-variable model vs the two-variables model w.r.t. x2
.
cor(x1,x2) # negative correlation
## [1] -0.4584511
coef(y.lm)
## (Intercept) factor(x2)1
## 3.5611964 -0.3207703
coef(y.lm2)
## (Intercept) x1 factor(x2)1
## -0.330766 5.479133 1.031886
cor(x1,x2_1) # positive correlation
## [1] 0.5406829
coef(y_1.lm)
## (Intercept) factor(x2_1)1
## 2.641332 2.336463
coef(y_1.lm2)
## (Intercept) x1 factor(x2_1)1
## -0.0187044 5.0444277 0.9626073
cor(x1,x2_2) # almost zero correlation
## [1] -0.09769845
coef(y_2.lm)
## (Intercept) factor(x2_2)1
## 3.2390396 0.7012496
coef(y_2.lm2)
## (Intercept) x1 factor(x2_2)1
## 0.1353746 4.7872666 0.9897907
After a deeper analysis of the slope coefficient for the x2
predictor w.r.t. its correlation with x1
, we get that the sign of the correlation dominates that of the original generative process in such cases in which correlation among predictors is significant (1st, 2nd) and the most explanatory variable is omitted (x2
-only models).
In cases with the same generative process, but both variables present among the chosen predictors, this is generally no longer always true and signs are determined in good agreement with the original generative model.
In the third case, lack of significant correlation among the two predictors, and the peculiar generative process, makes the specific determination of the sign of the slope for x2
dominated by random noise. Additionally, such coefficients do not differ much among the one- and the two- variable models. This last observation is further put in evidence by the last (3rd) row of partial residual plots. Indeed, we can see that there is no significant difference in the mean of the partial residuals around x2
between the full and the censored model.
par(mfrow=c(1,3))
termplot(y.lm, partial.resid=TRUE, smooth=panel.smooth,
col.res="gray30",main = "y ~ factor(x2)", ylim = c(-3,2))
termplot(y.lm2, partial.resid=TRUE, smooth=panel.smooth,
col.res="gray30",main = "y ~ x1 + factor(x2)",ylim = c(-3,2))
par(mfrow=c(1,3))
termplot(y_1.lm, partial.resid=TRUE, smooth=panel.smooth,
col.res="blue",main = "y_1 ~ factor(x2_1)",ylim = c(-3,2))
termplot(y_1.lm2, partial.resid=TRUE, smooth=panel.smooth,
col.res="blue",main = "y_1 ~ x1 + factor(x2_1)",ylim = c(-3,2))
par(mfrow=c(1,3))
termplot(y_2.lm, partial.resid=TRUE, smooth=panel.smooth,
col.res="green",main = "y_2 ~ factor(x2_2)",ylim = c(-3,2))
termplot(y_2.lm2, partial.resid=TRUE, smooth=panel.smooth,
col.res="green",main = "y ~ x1_2 + factor(x2_2)",ylim = c(-3,2))
It is given as a dataset the numbers of occasions when inhibition (i.e. no flow of current across a membrane) occurred within \(120\)s, for different concentrations of the protein peptide-C (data are used with the permission of Claudia Haarmann, who obtained these data in the course of her PhD research). The outcome yes
implies that inhibition has occurred.
Use logistic regression to model the probability of inhibition as a function of protein concentration.
# LIBRARIES
library(ggplot2)
# DATA:
c(0.1, 0.5, 1.0, 10.0, 20.0, 30.0, 50.0, 70.0, 80.0, 100.0, 150.0)
conc <- c(7, 1, 10, 9, 2, 9, 13, 1, 1, 4, 3 )
no <- c(0, 0, 3, 4, 0, 6, 7, 0, 0, 1, 7 )
yes <-
# COMPOSITES
yes/(yes+no)
yes_probab <- no/(yes+no)
no_probab <-
# Exploratory plot:
plot(conc, yes_probab) # No obvious pattern...
plot(conc, no_probab) # ... emerges in the data! :(
# Blind-fitting with total-count weighting
glm(yes_probab ~ conc, # At least yes_probab it's increasing on average
fit_out <-family=binomial(link="logit"),
weights = (yes+no) # and let the magic happen!
)
data.frame(conc = seq(min(conc), max(conc), len=100))
synthdata <-$yes_probab = predict(fit_out, newdata = synthdata, type="response")
synthdata
plot(yes_probab~conc)
lines(yes_probab~conc, synthdata)
Following an orthodox interpretation of the exercise text, we are required to fit a logistic model that tries to explain or predict the probability of inhibition as a function of concentration – assuming that the counts shown in the datatable refer to different, independent \(120\text{s}\)-long observations.
Upon visual analysis of the scatterplot of the data, it appears that a global trend in the data is hard to be noticed – or anyway extremely difficult to model in functional form –, and both up- and down-swinging trends are present as the concentration increases (apart from the last point, which is the highest in probability of the whole dataset).
The best we can do in this case is to fit a logistic regression model on untransformed data, directly for the continuous variable that computes the probability of inhibition, using the total number of yes/no observations per concentration as logistic regression weights.
The result is acceptable, but still corrupted by noise.
In the dataset (an artificial one of \(3121\) patients, that is similar to a subset of the data analyzed in Stiell et al., 2001) minor.head.injury
, obtain a logistic regression model relating clinically.important.brain.injury
to the other variables. Patients whose risk is sufficiently high will be sent for CT. Using a risk threshold of \(0.025\) (\(2.5\%\)), turn the result into a decision rule for use of CT.
library(DAAG)
# load dataset
DAAG::head.injury
data <-
# exploring data
str(data) # they are binary data...
## 'data.frame': 3121 obs. of 11 variables:
## $ age.65 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ amnesia.before : num 1 0 0 0 0 1 0 0 0 1 ...
## $ basal.skull.fracture : num 0 0 0 0 0 0 0 0 0 0 ...
## $ GCS.decrease : num 0 0 0 0 0 0 0 0 0 0 ...
## $ GCS.13 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ GCS.15.2hours : num 0 0 0 1 1 0 0 0 0 0 ...
## $ high.risk : num 0 0 0 0 0 0 0 0 1 0 ...
## $ loss.of.consciousness : num 0 0 0 0 0 0 0 0 0 0 ...
## $ open.skull.fracture : num 0 0 0 0 0 0 0 0 0 0 ...
## $ vomiting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ clinically.important.brain.injury: num 0 0 0 0 0 0 0 0 0 0 ...
# perform glm
glm(clinically.important.brain.injury ~ .,
model_glm <-family = binomial(link = "logit"),
data = data)
#summary
summary(model_glm)
##
## Call:
## glm(formula = clinically.important.brain.injury ~ ., family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2774 -0.3511 -0.2095 -0.1489 3.0028
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.4972 0.1629 -27.611 < 2e-16 ***
## age.65 1.3734 0.1827 7.518 5.56e-14 ***
## amnesia.before 0.6893 0.1725 3.996 6.45e-05 ***
## basal.skull.fracture 1.9620 0.2064 9.504 < 2e-16 ***
## GCS.decrease -0.2688 0.3680 -0.730 0.465152
## GCS.13 1.0613 0.2820 3.764 0.000168 ***
## GCS.15.2hours 1.9408 0.1663 11.669 < 2e-16 ***
## high.risk 1.1115 0.1591 6.984 2.86e-12 ***
## loss.of.consciousness 0.9554 0.1959 4.877 1.08e-06 ***
## open.skull.fracture 0.6304 0.3151 2.001 0.045424 *
## vomiting 1.2334 0.1961 6.290 3.17e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1741.6 on 3120 degrees of freedom
## Residual deviance: 1201.3 on 3110 degrees of freedom
## AIC: 1223.3
##
## Number of Fisher Scoring iterations: 6
# turn the result into a decision rule for use of CT
1 <- function(patients, threshold){
judge_function_ length(model_glm$coefficients) # number of estimated coefficients
n_coeff <-
model_glm$coefficients[1] + patients %*% model_glm$coefficients[2:n_coeff]
p <- exp(p) / (1 + exp(p))
logistic_prob <-
return(logistic_prob < threshold) # return the if condition on the threshold
}
2 <- function(patients, threshold){
judge_function_log(threshold/(1 - threshold))
logit <- model_glm$coefficients[1]
intercept <-
data.frame("coeff" = model_glm$coefficients[-1], "patient" = t(patients)); print(CT_data)
CT_data <- data.frame("Patient_risk" = patients %*% model_glm$coefficients[-1],
CT_rule <-"CT_threshold" = logit - intercept,
"CT" = patients %*% model_glm$coefficients[-1] > logit - intercept)
rownames(CT_rule) <- colnames(CT_data[, -1]); print(CT_rule)
}
# number of columns without response column
length(model_glm$coefficients[-1])
n <-
# check correctness of the judge function
sapply(data,'[[', 1)[-11]; # a patient took from the dataset
x <-
# first patient
set.seed(69)
c(x, sample(c(0,1), replace = TRUE, size = n))
x <-
# second patient
set.seed(50)
c(x, sample(c(0,1), replace = TRUE, size = n))
x <-
# patients table
matrix(x, nrow = 3, ncol = n, byrow = TRUE)
x <-
# judge_function_1(patients = x, threshold = 0.025)
judge_function_2(patients = x, threshold = 0.025)
## coeff patient.1 patient.2 patient.3
## age.65 1.3734497 0 1 1
## amnesia.before 0.6892694 1 0 0
## basal.skull.fracture 1.9619970 0 0 1
## GCS.decrease -0.2687866 0 1 0
## GCS.13 1.0612978 0 1 0
## GCS.15.2hours 1.9407526 0 1 1
## high.risk 1.1114875 0 0 1
## loss.of.consciousness 0.9554175 0 0 0
## open.skull.fracture 0.6304400 0 0 0
## vomiting 1.2334196 0 0 1
## Patient_risk CT_threshold CT
## patient.1 0.6892694 0.8336726 FALSE
## patient.2 4.1067135 0.8336726 TRUE
## patient.3 7.6211065 0.8336726 TRUE
Almost all of the variables used as predictors are highly significant according to their coefficient p-values, with – in any case – some differences w.r.t. their overall relevance for the final result.
The difference between the null model and the full one is large, positively suggesting its effective use.
Finally, a table is provided to summarize the estimated coefficients and the patient information; additionally, another table is provided to compare symptoms, risk level for each patient, and the model-predicted decision whether to sent them for a CT check.
In order to operationalize such model and provide a simple decision rule for emergency unit responders, it is possible to transform the logistic model in a linearly-predicted logit model operating at fixed probability cutoff.
Since all the predictors are binary variables carrying information about the presence (\(1\)) or not (\(0\)) of a given symptom, it is possible to assign to each symptom a score corresponding to their model multiplicative coefficient. Finally, the ER unit workers just need to sum all the symptom scores according to their presence in a given patient showing the hospital.
If the overall sum exceeds a precomputed threshold, the patient needs to be sent to CT.
Such threshold \(T_p\) depends on the risk cutoff \(p\) in that: \[ T_p = \text{logit}(p) \ - \ \text{model intercept} \ = \ \text{log}\left(\frac{p}{1-p}\right) - \ \text{model intercept} \]
For our model, at \(p = 0.025\), \(T_p = 4.4972 - 3.6635 = 0.8337\). Coefficients are shown in the tables adove.
Consider again the \(\mathsf{moths}\) dataset of Section 8.4.
# LIBRARIES
library(lattice)
library(DAAG)
# TABLE
rbind(Number = table(moths[,4]), sapply(split(DAAG::moths[,-4], DAAG::moths$habitat),
2, sum)) apply,
## Bank Disturbed Lowerside NEsoak NWsoak SEsoak SWsoak Upperside
## Number 1 7 9 6 3 7 3 5
## meters 21 49 191 254 65 193 116 952
## A 0 8 41 14 71 37 20 28
## P 4 33 17 14 19 6 48 8
# IND. TRANSECTS / MEANS
::dotplot(habitat~P, data=DAAG::moths, xlab="Number of moths (species P)",
lattice
panel = function(x, y, ...)
{::panel.dotplot(x, y, pch = 1, col = "black", ...)
lattice::panel.average(x, y, pch = 3, cex = 1.25, type = "p", col = "gray45")
lattice
},
key = list(text = list(c("Individual transects", "Mean")),
points = list(pch = c(1,3), cex = c(1,1.25), col = c("black", "gray45")),
columns = 2))
# QUASIPOISSON GLM, BANK REFLEVEL
glm(P ~ habitat + log(meters), family = quasipoisson, data = moths)
pmoths_qpois <-summary(pmoths_qpois)
##
## Call:
## glm(formula = P ~ habitat + log(meters), family = quasipoisson,
## data = moths)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9479 -1.3155 -0.4167 0.5660 2.9719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3039 0.9584 -0.317 0.75326
## habitatDisturbed 0.8794 0.8623 1.020 0.31547
## habitatLowerside -0.5569 0.8837 -0.630 0.53301
## habitatNEsoak -0.8780 0.9105 -0.964 0.34212
## habitatNWsoak 0.4537 0.8746 0.519 0.60753
## habitatSEsoak -1.6266 1.0278 -1.583 0.12336
## habitatSWsoak 1.0814 0.8348 1.295 0.20444
## habitatUpperside -2.1342 1.0481 -2.036 0.05007 .
## log(meters) 0.5552 0.1759 3.157 0.00347 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.527554)
##
## Null deviance: 217.819 on 40 degrees of freedom
## Residual deviance: 76.924 on 32 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
# Analysis with tighter convergence criterion: omitted as no 0-count observed!
# QUASIPOISSON GLM, UPPERSIDE REFLEVEL (Pareto-symplex optimal significance)
$habitat_rel <- relevel(DAAG::moths$habitat, ref="Upperside")
moths glm(P ~ habitat_rel + log(meters), family = quasipoisson, data = moths)
pmoths_qpois_rel <-summary(pmoths_qpois_rel)
##
## Call:
## glm(formula = P ~ habitat_rel + log(meters), family = quasipoisson,
## data = moths)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9479 -1.3155 -0.4167 0.5660 2.9719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4381 1.0813 -2.255 0.03112 *
## habitat_relBank 2.1342 1.0481 2.036 0.05007 .
## habitat_relDisturbed 3.0137 0.8509 3.542 0.00124 **
## habitat_relLowerside 1.5773 0.7757 2.033 0.05038 .
## habitat_relNEsoak 1.2562 0.7508 1.673 0.10406
## habitat_relNWsoak 2.5879 0.7711 3.356 0.00205 **
## habitat_relSEsoak 0.5076 0.9200 0.552 0.58497
## habitat_relSWsoak 3.2156 0.6675 4.817 3.38e-05 ***
## log(meters) 0.5552 0.1759 3.157 0.00347 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.527554)
##
## Null deviance: 217.819 on 40 degrees of freedom
## Residual deviance: 76.924 on 32 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
# THE ACTUAL EXERCISE:
# (a): POISSON GLM, UPPERSIDE REFLEVEL
$habitat_rel <- relevel(DAAG::moths$habitat, ref="Upperside")
moths glm(P ~ habitat_rel + log(meters), family = poisson, data = moths)
pmoths_pois_rel <-summary(pmoths_pois_rel) # Shrinkage of C.I.s as expected!
##
## Call:
## glm(formula = P ~ habitat_rel + log(meters), family = poisson,
## data = moths)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9479 -1.3155 -0.4167 0.5660 2.9719
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4381 0.6801 -3.585 0.000337 ***
## habitat_relBank 2.1342 0.6593 3.237 0.001207 **
## habitat_relDisturbed 3.0137 0.5352 5.631 1.79e-08 ***
## habitat_relLowerside 1.5773 0.4879 3.233 0.001226 **
## habitat_relNEsoak 1.2562 0.4723 2.660 0.007816 **
## habitat_relNWsoak 2.5879 0.4850 5.336 9.52e-08 ***
## habitat_relSEsoak 0.5076 0.5787 0.877 0.380404
## habitat_relSWsoak 3.2156 0.4199 7.659 1.88e-14 ***
## log(meters) 0.5552 0.1106 5.018 5.21e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 217.819 on 40 degrees of freedom
## Residual deviance: 76.924 on 32 degrees of freedom
## AIC: 187.15
##
## Number of Fisher Scoring iterations: 6
# (b): QUASIPOISSON GLM, UPPERSIDE REFLEVEL
$habitat_rel <- relevel(DAAG::moths$habitat, ref="Upperside")
moths glm(P ~ habitat_rel + log(meters), family = quasipoisson, data = moths)
pmoths_qpois_rel <-summary(pmoths_qpois_rel) # Significant transect length!
##
## Call:
## glm(formula = P ~ habitat_rel + log(meters), family = quasipoisson,
## data = moths)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9479 -1.3155 -0.4167 0.5660 2.9719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4381 1.0813 -2.255 0.03112 *
## habitat_relBank 2.1342 1.0481 2.036 0.05007 .
## habitat_relDisturbed 3.0137 0.8509 3.542 0.00124 **
## habitat_relLowerside 1.5773 0.7757 2.033 0.05038 .
## habitat_relNEsoak 1.2562 0.7508 1.673 0.10406
## habitat_relNWsoak 2.5879 0.7711 3.356 0.00205 **
## habitat_relSEsoak 0.5076 0.9200 0.552 0.58497
## habitat_relSWsoak 3.2156 0.6675 4.817 3.38e-05 ***
## log(meters) 0.5552 0.1759 3.157 0.00347 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.527554)
##
## Null deviance: 217.819 on 40 degrees of freedom
## Residual deviance: 76.924 on 32 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
# Distinguishability analysis with deviances: omitted as no "problematic" habitat!
As we can see from the printed results – and as also noted in Exercise DAAG 8.6 – standard error intervals estimated from a Poisson regression model are narrower w.r.t. Quasi-Poisson ones. This can be explained as a positive dispersion coefficient (like the one estimated in this case for the quasi-Poisson model) inflates error bounds.
[paragraphs that follow]
By omitting the portions of data analysis directly determined by the observed \(0\)-count for the Bank habitat (in the case of A moths), the analysis is compact and contained in the code snipped above.
With a difference w.r.t. the analysis shown in the book, the reference level among different habitats has been chosen in order to maximize overall significance for the coefficients in the linear model contained in the (quasi-)Poisson one. Since no clear dominance exists in such choice, the closest to Pareto-front has been chosen (symplex-area method). Regardless of such criterion, all observations contained in the following are invariant to such choice.
From the summary of the fitted GLM we can observe that the \(log(\text{length})\) term has a significant coefficient (\(p < 0.004\)). The effect of transect length on moth count is therefore relevant (definitely more relevant in comparison to the A moths). This can show that – w.r.t. A moths, P moths enjoy longer flights and/or pass significantly many (more) times across transects during their aerial wanderings.
The function \(\mathsf{poissonsim()}\) allows for experimentation with Poisson regression.
In particular, \(\mathsf{poissonsim()}\) can be used to simulate Poisson responses with log-rates equal to \(a + bx\), where \(a\) and \(b\) are fixed values by default.
Simulate \(100\) Poisson responses using the model \(log( \lambda) = 2 − 4x\) for \(x = 0, 0.01, 0.02,\dots, 1.0\).
Fit a Poisson regression model to these data, and compare the estimated coefficients with the true coefficients. How well does the estimated model predict future observations?
Simulate \(100\) Poisson responses using the model \(log( \lambda) = 2 − bx\) where \(b\) is normally distributed with mean \(4\) and standard deviation \(5\). [Use the argument \(\mathsf{slope.sd=5}\) in the \(\mathsf{poissonsim()}\) function.] How do the results using the poisson and quasipoisson families differ?
# LIBRARIES:
library(MASS) # confint for GLMs
library(lattice)
library(DAAG)
seq(from=0.00, to=1.00, by=0.01)
x_grid <-
### POISSON DATA ###
# Generation
DAAG::poissonsim(x_grid, a = 2, b = -4)
poiss_a <-
# Fit (Poisson)
glm(poiss_a$y~poiss_a$x, family = poisson)
poiss_a_fit <-
# Summary
summary(poiss_a_fit)
##
## Call:
## glm(formula = poiss_a$y ~ poiss_a$x, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9684 -0.8463 -0.2382 0.3697 2.7572
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8495 0.1130 16.37 <2e-16 ***
## poiss_a$x -3.6008 0.3423 -10.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 232.249 on 100 degrees of freedom
## Residual deviance: 86.612 on 99 degrees of freedom
## AIC: 264.6
##
## Number of Fisher Scoring iterations: 5
# C.I.s
confint(poiss_a_fit)
## 2.5 % 97.5 %
## (Intercept) 1.622811 2.066056
## poiss_a$x -4.291993 -2.948529
### OVERDISPERSED POISSON DATA ###
# Generation
DAAG::poissonsim(x_grid, a = 2, b = -4, slope.sd = 5)
poiss_b <-
# Fit (Poisson)
glm(poiss_b$y~poiss_b$x, family = poisson)
poiss_b_fit <-
# Summary
summary(poiss_b_fit)
##
## Call:
## glm(formula = poiss_b$y ~ poiss_b$x, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -14.025 -7.044 -4.260 -0.634 57.982
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.11704 0.06578 16.98 <2e-16 ***
## poiss_b$x 3.47148 0.08421 41.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12740 on 100 degrees of freedom
## Residual deviance: 10540 on 99 degrees of freedom
## AIC: 10789
##
## Number of Fisher Scoring iterations: 7
# C.I.s
confint(poiss_b_fit)
## 2.5 % 97.5 %
## (Intercept) 0.9866542 1.244533
## poiss_b$x 3.3075740 3.637705
# Fit (Quasi-Poisson)
glm(poiss_b$y~poiss_b$x, family = quasipoisson)
qpoiss_b_fit <-
# Summary
summary(qpoiss_b_fit)
##
## Call:
## glm(formula = poiss_b$y ~ poiss_b$x, family = quasipoisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -14.025 -7.044 -4.260 -0.634 57.982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.117 1.048 1.066 0.2892
## poiss_b$x 3.471 1.342 2.587 0.0111 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 253.9653)
##
## Null deviance: 12740 on 100 degrees of freedom
## Residual deviance: 10540 on 99 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 7
# C.I.s
confint(qpoiss_b_fit)
## 2.5 % 97.5 %
## (Intercept) -1.349649 2.847922
## poiss_b$x 1.080532 6.451156
Preliminarily to a more in-depth analysis, it is worth noting that – w.r.t. to the point we are currently answering to – we are performing a Poisson regression fit to data generated according to a Poisson regression generative model.
Far from being that obvious, this allows us to preliminarily state that what we are trying to accomplish is a posteriori (information-theoretically) the most efficient fitting procedure to predict the expected count numbers from the only available predictor (\(x\)) of the synthetic phenomenon we are dealing with.
A more precise consistency check involves the analysis of the standard error estimates (obtained by the means of the integrated fitting routine \(\mathsf{glm}\)) and approximated-normal confidence intervals obtained via profile likelihood estimation thanks to the MASS::confint()
function.
As we can see from the results shown above, the estimated coefficients are always included in the \(95\%\) symmetric C.I.s around the estimate. This is also true for the \(\text{estimate}\ \pm\ SE\) interval for the slope coefficient and sometimes also for the intercept. In cases the latter is not, anyway, the difference is always of minor entity (and concern).
To analyze the robustness of the fitted model for \(x \rightarrow +\infty\) (i.e. future values) it is possible to study the (estimated) rate \(\hat{\lambda}\) of the (estimated) Poisson distribution from which we assume our counts to be sampled from.
In fact, \(\hat{\lambda} = e^{\hat{a} + x\hat{b}}\), and it is sufficient to consider that \(\hat{a} > 0\) and \(\hat{b} < 0\) for now. Furthermore, as \(x \rightarrow +\infty, \ \hat{\lambda} \rightarrow 0\) and \(E_{pois}[\text{counts}] \propto \lambda\).
From that – or also via interval-bound propagation – it is possible to show that, as long as the previous inequalities hold true, the difference in predicted counts converges to zero as \(x \rightarrow +\infty\) regardless of the specific value of the estimated parameters \(\hat{a} > 0\) and \(\hat{b} < 0\). This makes future estimations robust to stochastic noise.
As far as the second phenomenon and model are concerned, the following are the similarities and differences among the Poisson-family and Quasi-Poisson family regressions.
As expected, point-estimates for the (quasi-)Poisson regression coefficients are the same in both cases, since Quasi-Poisson regression model is the quasi-likelihood model associated to the Poisson one, with non-locked dispersion;
Both the confidence intervals and the standard errors shown for the Quasi-Poisson model are broader w.r.t. the Poisson ones;
The AIC is not shown for the Quasi-Poisson model. Settling a seemingly popular debate, such behaviour is a design choice of R Core Team for any quasi-likelihood model fitted using the \(\mathsf{glm}\) function.
Overall, it can be said that Quasi-Poisson regression accounts for (in this case) increased dispersion of the data via broadening of C.I.s.
Comments
After having fit both Model 1 (i.e. the linear model that tries to explain record
time
for male athletes from thedist
ance andclimb
variables, without considering the interaction term) and Model 2 (the one considering also interaction betweendist
andclimb
) on the entire, un-transformed,nihills
dataset, we can look at Fisher-test ANOVA for model selection, being Model 1 nested inside Model 2.With a value of \(F = 72.406\) and a p-value \(<10^{-7}\) Model 2 is evidently (according to the F-test) the model of choice in such setting.
Upon inspection, however, it appears clear that the relatively poor behaviour of Model 1 as opposed to Model 2 is strongly determined by the influence of few data-points (one, mainly!).
In fact, considering for the moment Model 1:
Now, looking at Model 2 diagnostic plots, we can see that:
We can therefore conclude that the additional interaction term (and associated degree of freedom for the model) mainly serves the purpose of correcting the model behaviour for taking into account the appearantly deviant behaviour of Seven Sevens datapoint in Model 1.
For that reason, we try to re-fit the model on the (still, untransformed)
nihills
dataset, after removal of the Seven Sevens datapoint.This time, Fisher-test ANOVA suggests to avoid the additional interaction term and to accept the simpler model, proving our previous assumption as probably correct.
Now, at the price of some additional residuals-non-normality even at central quantiles, and without solving the still present problem of heteroscedasticity, the model is less over-determined by the influence of single-datapoints.
Still, the behaviour previously observed for the Seven Sevens datapoint can now be found – though with a smaller effect overall – in the behaviour of the Annalong Horseshoe datapoint. We still observe high (and unique) fitted value, a residual among the highest and a high Cook distance from zero.
Repeating the fit on the dataset additionally devoided of the Annalong Horseshoe datapoint, previously-evidenced conclusions are even more amplified and confirmed: the final model (still, the simpler one) obtained this way is no longer over-determined by single datapoints and more balanced overall w.r.t. high-residual, high-leverage points.
The interesting result to be noted with such regard is that – being the Seven Sevens and Annalong Horseshoe datapoints those whose outlying associated \(\mathsf{time}\) value is the farthest from the linear trends time/dist and time/climb – the lack of proper logarithmic transformation of the dataset both produces generally inaccurate models and also may trick the analyst into preferring an over-parametrized model whose effect is just that to amplify the importance of such outlying value(s).