Exercises from DAAG, Chapter 6

Exercise 6

Text

The following code snippet investigates the consequences of not using a logarithmic transformation for the nihills dataset analysis.

The second model differs from the first in having a \(\mathsf{dist} \times \mathsf{climb}\) interaction term, additional to linear terms in \(\mathsf{dist}\) and \(\mathsf{climb}\).

  1. Fit the two models:
    nihills.lm <- lm(time ~ dist+climb, data=nihills)
    nihills2.lm <- lm(time ~ dist+climb+dist:climb, data=nihills)
    anova(nihills.lm, nihills2.lm)

  2. Using the F-test result, make a tentative choice of model, and proceed to examine diagnostic plots.
    Are there any problematic observations? What happens if these points are removed?

Refit both of the above models, and check the diagnostics again.

Solution

# LIBRARIES:
library(lattice)
library(DAAG)

# DATA:
data(nihills)


### ORIGINAL DATASET ###


# Fit models
nihills_lm1 <- lm(time ~ dist+climb,            data=nihills)
nihills_lm2 <- lm(time ~ dist+climb+dist:climb, data=nihills)

# ANOVA
anova(nihills_lm1, nihills_lm2)
## Analysis of Variance Table
## 
## Model 1: time ~ dist + climb
## Model 2: time ~ dist + climb + dist:climb
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1     20 0.189361                                  
## 2     19 0.039361  1      0.15 72.406 6.623e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Diagnostic plots
par(mfrow=c(2,2))
plot(nihills_lm1)

par(mfrow=c(2,2))
plot(nihills_lm2)

### ONE PROBLEMATIC DATAPOINT REMOVED ###


# Removal
rows_to_remove <- c("Seven Sevens")
nihills_pruned <- nihills[!(row.names(nihills) %in% rows_to_remove),]


# Fit models
nihills_lm1_pruned <- lm(time ~ dist+climb,            data=nihills_pruned)
nihills_lm2_pruned <- lm(time ~ dist+climb+dist:climb, data=nihills_pruned)

# ANOVA
anova(nihills_lm1_pruned, nihills_lm2_pruned)
## Analysis of Variance Table
## 
## Model 1: time ~ dist + climb
## Model 2: time ~ dist + climb + dist:climb
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1     19 0.040533                           
## 2     18 0.035509  1 0.0050248 2.5472 0.1279
# Diagnostic plots
par(mfrow=c(2,2))
plot(nihills_lm1_pruned)

par(mfrow=c(2,2))
plot(nihills_lm2_pruned)

### TWO PROBLEMATIC DATAPOINTS REMOVED ###


# Removal
rows_to_remove      <- c("Seven Sevens", "Annalong Horseshoe")
nihills_pruned_more <- nihills[!(row.names(nihills) %in% rows_to_remove),]


# Fit models
nihills_lm1_pruned_more <- lm(time ~ dist+climb,            data=nihills_pruned_more)
nihills_lm2_pruned_more <- lm(time ~ dist+climb+dist:climb, data=nihills_pruned_more)

# ANOVA
anova(nihills_lm1_pruned_more, nihills_lm2_pruned_more)
## Analysis of Variance Table
## 
## Model 1: time ~ dist + climb
## Model 2: time ~ dist + climb + dist:climb
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1     18 0.036592                           
## 2     17 0.035506  1 0.0010858 0.5199 0.4807
# Diagnostic plots
par(mfrow=c(2,2))
plot(nihills_lm1_pruned_more)

Comments

After having fit both Model 1 (i.e. the linear model that tries to explain record time for male athletes from the distance and climb variables, without considering the interaction term) and Model 2 (the one considering also interaction between dist and climb) on the entire, un-transformed, nihills dataset, we can look at Fisher-test ANOVA for model selection, being Model 1 nested inside Model 2.

With a value of \(F = 72.406\) and a p-value \(<10^{-7}\) Model 2 is evidently (according to the F-test) the model of choice in such setting.

Upon inspection, however, it appears clear that the relatively poor behaviour of Model 1 as opposed to Model 2 is strongly determined by the influence of few data-points (one, mainly!).

In fact, considering for the moment Model 1:

  • From the Residuals vs Fitted graph, it appears clearly that tracks Seven Sevens and Annalong Horseshoe have both highest residuals (though of opposite sign) and highest fitted values. This evidence of heteroscedasticity is further confirmed by the Scale-Location plot.
  • From the Residuals vs Leverage plot we know that the Seven Sevens track is also a potential influential point. Such findings, together with the observation that Seven Sevens is the only point in the dataset with such comparably high fitted value from the model, make it a highly likely over-influential point for the fitted model.

Now, looking at Model 2 diagnostic plots, we can see that:

  • The Seven Sevens track has become one of the datapoints with the smallest residuals, but with a fitted value comparable with that previously obtained.
  • Such datapoint still exhibits high leverage and uniqueness among the datapoints with such high fitted value: it is still a candidate to be an over-influential point.

We can therefore conclude that the additional interaction term (and associated degree of freedom for the model) mainly serves the purpose of correcting the model behaviour for taking into account the appearantly deviant behaviour of Seven Sevens datapoint in Model 1.

For that reason, we try to re-fit the model on the (still, untransformed) nihills dataset, after removal of the Seven Sevens datapoint.

This time, Fisher-test ANOVA suggests to avoid the additional interaction term and to accept the simpler model, proving our previous assumption as probably correct.

Now, at the price of some additional residuals-non-normality even at central quantiles, and without solving the still present problem of heteroscedasticity, the model is less over-determined by the influence of single-datapoints.

Still, the behaviour previously observed for the Seven Sevens datapoint can now be found – though with a smaller effect overall – in the behaviour of the Annalong Horseshoe datapoint. We still observe high (and unique) fitted value, a residual among the highest and a high Cook distance from zero.

Repeating the fit on the dataset additionally devoided of the Annalong Horseshoe datapoint, previously-evidenced conclusions are even more amplified and confirmed: the final model (still, the simpler one) obtained this way is no longer over-determined by single datapoints and more balanced overall w.r.t. high-residual, high-leverage points.

The interesting result to be noted with such regard is that – being the Seven Sevens and Annalong Horseshoe datapoints those whose outlying associated \(\mathsf{time}\) value is the farthest from the linear trends time/dist and time/climb – the lack of proper logarithmic transformation of the dataset both produces generally inaccurate models and also may trick the analyst into preferring an over-parametrized model whose effect is just that to amplify the importance of such outlying value(s).

Exercise 8

Text

Apply the lm.ridge() function to the litters data, using the generalized cross-validation (GCV) criterion to choose the tuning parameter. (GCV is an approximation to cross-validation.)

  1. In particular, estimate the coefficients of the model relating brainwt to bodywt and lsize and compare with the results obtained using lm().
  2. Using both ridge and ordinary regression, estimate the mean brain weight when litter size is 10 and body weight is 7. Use the bootstrap, with case-resampling, to compute approximate 95% percentile confidence intervals using each method. Compare with the interval obtained using predict.lm().

Solution

Let’s begin by taking a quick look at the data

library(DAAG)
library(MASS) # for lm.ridge()

head(DAAG::litters)
##   lsize bodywt brainwt
## 1     3  9.447   0.444
## 2     3  9.780   0.436
## 3     4  9.155   0.417
## 4     4  9.613   0.429
## 5     5  8.850   0.425
## 6     5  9.610   0.434
plot(DAAG::litters)   # Some linters may complain: that's fine.

From this basic plot we can immediately tell that:

  • brainwt is positively correlated to bodywt and negatively correlated to lsize, although the correlation appears to be very noisy;
  • There’s a significant negative correlation between bodywt and lsize;

we may expect some problems arising from the collinearity evidenced in the latter, when fitting brainwt against bodywt and lsize.

Let’s see how the linear model and ridge regression do:

# fit the vanilla linear model
litters.lm = lm(brainwt~bodywt+lsize, data=litters)
summary(litters.lm)
## 
## Call:
## lm(formula = brainwt ~ bodywt + lsize, data = litters)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0230005 -0.0098821  0.0004512  0.0092036  0.0180760 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 0.178247   0.075323   2.366  0.03010 * 
## bodywt      0.024306   0.006779   3.586  0.00228 **
## lsize       0.006690   0.003132   2.136  0.04751 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01195 on 17 degrees of freedom
## Multiple R-squared:  0.6505, Adjusted R-squared:  0.6094 
## F-statistic: 15.82 on 2 and 17 DF,  p-value: 0.0001315
# fit the ridge linear regression (selecting lambda by GCV)
MASS::select(MASS::lm.ridge(brainwt~bodywt+lsize, data=DAAG::litters,
                lambda = seq(0,1,0.001)))
## modified HKB estimator is 0 
## modified L-W estimator is 0 
## smallest value of GCV  at 0.118
best.lambda = .118
litters.ridge = MASS::lm.ridge(brainwt~bodywt+lsize, data=DAAG::litters,
                         lambda=best.lambda)
litters.ridge
##                  bodywt       lsize 
## 0.203442601 0.022050278 0.005661579
# estimate training MSE for comparing the models

# function that computes the predictions of a lm.ridge() model
predridge <- function(model = litters.ridge, littersdf = DAAG::litters){
  coeffs = coef(model)
  return(coeffs[1] + coeffs[2] * littersdf$bodywt + coeffs[3] * littersdf$lsize)
}

litters.ridge.residuals <- DAAG::litters$brainwt - predridge()

print(paste("lm MSE: ",
            sum(litters.lm$residuals**2) / length(litters.lm$residuals)))
## [1] "lm MSE:  0.000121435029328059"
print(paste("ridge MSE: ",
            sum(litters.ridge.residuals**2) / length(litters.ridge.residuals)))
## [1] "ridge MSE:  0.000122235480911714"

The coefficients estimated by ridge regression for both predictors are slightly smaller than those fitted by the linear model without regularization, which is something that we expect to see in ridge regression, although the intercept’s value slightly increases.

We can compare the model looking at the training-set MSE, which actually shows a very small difference in favour of the model without regularization.

Let’s now check our models’ prediction on a the novel datapoint \((7,10)\). We’ll also estimate the percentile bootstrap 95% confidence interval for both methods around \((7,10)\) and compare them with the 95% CI reported by lm():

# estimate models on the new datapoint
newpoint = data.frame(bodywt=7, lsize=10)

yhat.lm <- predict(litters.lm, newdata = newpoint)
yhat.ridge <- predridge(litters.ridge,newpoint)

# estimate percentile bootstrap CI for lm and lm.ridge at newpoint

n <- nrow(DAAG::litters)
B = 1000

lm.bs = 1:B
ridge.bs = 1:B

for(b in 1:B){
  idxs = sample (1:n,n, replace = TRUE)
  bootdf = DAAG::litters[idxs,]
  
  b_lm = lm(brainwt~ bodywt + lsize, data = bootdf)
  lm.bs[b] <- predict(b_lm, newdata = newpoint)
  
  b_ridge = MASS::lm.ridge(brainwt~ bodywt + lsize, data = bootdf, lambda = best.lambda)
  ridge.bs[b] <- predridge(b_ridge, newpoint)
}


lm.ci = quantile(lm.bs,probs=c(0.025,0.975))

ridge.ci = quantile(ridge.bs,probs=c(0.025,0.975))

print("lm estimate + bootstrap 95% CI:")
## [1] "lm estimate + bootstrap 95% CI:"
print(paste(yhat.lm, "(", lm.ci[1],",",lm.ci[2],")"))
## [1] "0.415294682299934 ( 0.405605381833641 , 0.423419098631072 )"
print("ridge estimate + bootstrap 95% CI:")
## [1] "ridge estimate + bootstrap 95% CI:"
print(paste(yhat.ridge, "(", ridge.ci[1],",",ridge.ci[2],")"))
## [1] "0.414410339014346 ( 0.404817646253022 , 0.422267752173342 )"
print("lm() estimate and CI")
## [1] "lm() estimate and CI"
predict(litters.lm, newdata = newpoint,interval="confidence", level=.95)
##         fit       lwr       upr
## 1 0.4152947 0.4062582 0.4243312
lm.cibase = predict(litters.lm, newdata = newpoint,interval="confidence", level=.95)[2:3]

#plot the three intervals 
plot(as.factor(c("lm+bs","lm+wald","ridge+bs")),c(yhat.lm,yhat.lm,yhat.ridge),       
     ylim=c(.404,.425))   # Some linters may complain: not an error.


lines(c(1,1),lm.ci, col="red")
lines(c(2,2),lm.cibase, col="red")
lines(c(3,3),ridge.ci, col="red")

We can see from the graph that the bootstrap CI is slightly asymmetric, but once again the difference in the estimates is very small.

Exercise 10

Text

The dataframe \(\mathsf{table.b3}\) in the \(\mathsf{MPV}\) package contains data on gas mileage and \(11\) other variables for a sample of \(32\) automobiles.

Question: a

Construct a scatterplot of \(y\) (\(\mathsf{mpg}\)) versus x1 (\(\mathsf{displacement}\)). Is the relationship between these variables non-linear?

Solution

library(MPV)
library(lattice)

plot_fit <- function(fitted_model, x, y, xlab, subtitle){
  plot(x, y, main = "fitted model", xlab = xlab, sub = subtitle) # Some linters may complain: not an error.
  lines(x, as.vector(fitted_model$fitted.values), col ="red")
  text(13,3, "y = b0+b1*x", col="red")
}
dataset <- MPV::table.b3

# exploring data
str(dataset)
## 'data.frame':    32 obs. of  12 variables:
##  $ y  : num  18.9 17 20 18.2 20.1 ...
##  $ x1 : num  350 350 250 351 225 440 231 262 89.7 96.9 ...
##  $ x2 : num  165 170 105 143 95 215 110 110 70 75 ...
##  $ x3 : num  260 275 185 255 170 330 175 200 81 83 ...
##  $ x4 : num  8 8.5 8.25 8 8.4 8.2 8 8.5 8.2 9 ...
##  $ x5 : num  2.56 2.56 2.73 3 2.76 2.88 2.56 2.56 3.9 4.3 ...
##  $ x6 : num  4 4 1 2 1 4 2 2 2 2 ...
##  $ x7 : num  3 3 3 3 3 3 3 3 4 5 ...
##  $ x8 : num  200 200 197 200 194 ...
##  $ x9 : num  69.9 72.9 72.2 74 71.8 69 65.4 65.4 64 65 ...
##  $ x10: num  3910 3860 3510 3890 3365 ...
##  $ x11: num  1 1 1 1 0 1 1 1 0 0 ...
par(mfrow=c(1,1))

# scatter plot
plot(dataset$x1, dataset$y, main="Scatterplot x1 versus y",
     xlab="x_1 (displacement)", ylab="y (mpg)", pch=19)

Just by looking at the scatterplot, we can notice a non-linear relationship between the response \(y\) and the predictor x1: these variables seem to be negatively correlated (i.e., as x1 increases, \(y\) decreases), laying approximately on a convex smooth curve. We may consider some possible transformation to the data in order to re-linearize their relationship.

No strange points are (yet) detected.

Question: b

Use the xyplot() function, and x1 (type of transmission) as a group variable. Is a linear model reasonable for these data?

Solution

# xy plot
par(mfrow=c(1,1))
lattice::xyplot(y~x1,data=dataset,
       groups = x11,
       pch = 19,
       xlab="x1",
       ylab="y",
       main="Splitted Scatter Plot",
       col = c("green", "blue"),
       key=list(space="right",
                lines=list(col=c("green","blue"), lty=c(3,2), lwd=6),
                text=list(c(" group x11 = 0"," group x11 = 1"))
       ))

Just by looking at the plot, we can notice that the data might be adequately partitioned according to the trasmission method (predictor x1) in order to subsequently fit two separate linear models. In fact, the observations belonging to x_11 = 0 are located in the x1 range \([100, 200]\), whereas the others in \([200, 500]\). Additionally, while still being noticeable a probably not-perfectly-linear relationship among the response and the x1 predictor across both the partitions, the average slope of a hypotetical linear fit is greatly different among the two.

To conclude, after having performed the split, it is reasonable to consider the data explained by two independent LMs.

Still, no strange points are (yet) detected.

Question: c

Fit the model relating y to x1 and x1 which gives two lines having possibly different slopes and intercepts. Check the diagnostics. Are there any influential observations? Are there any influential outliers?

Solution

# fit models according the group x11: y = beta_0 + beta_1 * x1 + eps
fitted_model_x1_1 <- lm(y ~ x1, data = dataset[dataset$x11 == 0, ])
fitted_model_x1_2 <- lm(y ~ x1, data = dataset[dataset$x11 == 1, ])

# diagnostics
summary(fitted_model_x1_1) # group x11 = 0
## 
## Call:
## lm(formula = y ~ x1, data = dataset[dataset$x11 == 0, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2712 -1.4512  0.2958  2.8288  3.5412 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.91963    3.86454  11.106 1.07e-05 ***
## x1          -0.11677    0.02804  -4.165  0.00421 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.626 on 7 degrees of freedom
## Multiple R-squared:  0.7125, Adjusted R-squared:  0.6715 
## F-statistic: 17.35 on 1 and 7 DF,  p-value: 0.004214
summary(fitted_model_x1_2) # group x11 = 1
## 
## Call:
## lm(formula = y ~ x1, data = dataset[dataset$x11 == 1, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8562 -0.8703 -0.0274  1.2012  4.6599 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.455918   2.207520  13.343 1.00e-11 ***
## x1          -0.035127   0.006266  -5.606 1.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.097 on 21 degrees of freedom
## Multiple R-squared:  0.5995, Adjusted R-squared:  0.5804 
## F-statistic: 31.43 on 1 and 21 DF,  p-value: 1.454e-05
# residuals plots group x11 = 0
par(mfrow=c(2, 2))
plot(fitted_model_x1_1, main = "Analisys for model y ~ x1 (x11 = 0)")

# residuals plots group x11 = 1
par(mfrow=c(2, 2))
plot(fitted_model_x1_2, main = "Analisys for model y ~ x1 (x11 = 1)")

# prepare the modified model without point 5
dataset_b <- dataset[-5, ]
fitted_model_x1_1_b <- lm(y ~ x1, data = dataset_b[dataset_b$x11 == 0, ])

# fitted model plots
par(mfrow=c(1, 3))
plot_fit(fitted_model_x1_1, 
         dataset[dataset$x11 == 0, ]$x1, 
         dataset[dataset$x11 == 0, ]$y, 
         xlab = "x1", 
         subtitle = "group x11 = 0")

# the plot of the fitted model shows us that the linear regressor could be a 
# reasonable regressor, moreover the observation with large x1 could be the 
# critical point of the residual vs laverage analysis. Check if it is an
# outlier and/or an influencer by looking the next plot

plot_fit(fitted_model_x1_1_b, 
         dataset_b[dataset_b$x11 == 0, ]$x1, 
         dataset_b[dataset_b$x11 == 0, ]$y, 
         xlab = "x1", 
         subtitle = "group x11 = 0 (without point 5)")

# by looking the plot it's clear that point 5 is almost an influencer point.

plot_fit(fitted_model_x1_2, 
         dataset[dataset$x11 == 1, ]$x1, 
         dataset[dataset$x11 == 1, ]$y,
         xlab = "x1", 
         subtitle = "group x11 = 1")

# the fitted model looks quite good!

According to the summary function for the model fitted on group x11 = 0, we may say that:

  • Residuals are spread in quite a large range and seem to be not totally symmetrical (mostly clumped on the left side of the median).

  • W.r.t. x1, the estimate (i.e. the slope of the linear regressor) has a negative value, but quite close to zero. Moreover, the t statistic is, in absolute value, quite small but still statistically significant.

  • W.r.t. the adjusted \(R^2\), the variability of the model is explained at 67% just by considering the x1 predictor. As expected, additional predictors further reduce the unexplained variance.

  • W.r.t. the F-statistic, its value is not far from \(1\). In any case, its statistical significance suggests us to reject the null hypothesis against the statistical irrelevance of all estimated coefficients, intercept excluded.

In conclusion, we may accept this simple linear model, but taking into account that it may not be able to explain completely the data.

For what concerns the data partition x11 = 1 we can make the following observations:

  • Residuals are spread in a reasonably compact range, and seem to be symmetrical around the median.

  • W.r.t. x1, the estimate (i.e. the slope of the linear regressor) has a negative value, but quite close to zero. Moreover, the t statistic is, in absolute value, quite small but still statistically significant.

  • W.r.t. the adjusted \(R^2\), the variability of the model is explained at 58% just by considering the x1 predictor. As expected, additional predictors further reduce the unexplained variance.

  • W.r.t. the F-statistic, its value is not far from \(1\). In any case, its statistical significance suggests us to reject the null hypothesis against the statistical irrelevance of all estimated coefficients, intercept excluded.

In conclusion, we may accept this simple linear model, but taking into account that it may not be able to explain completely the data.

For what concerns the residuals for the fitted models we may say something more by looking at the proper diagnostic plots, especially for group x11 = 0:

  • Residuals vs fitted: the plot suggests us a weak pattern among the residuals, but they look quite spread. For that reason we cannot intuitively infer the hypotetical non-linear relationship among them to correct the model via data-transformation.

  • QQ-plot: the residuals seem to be normally-distributed, even if point 12 is far from the straight theoretical line: it could be an outlier or a point carrying a large amount of information still unexplained by the model.

  • Scale-location: the trend-line suggests us the lack of significant heteroscedasticity. However, inasmuch the number of observations is small, we blindly accept such result; it is unlikely, however, that this may corrupt the efficacy of our model.

  • Residuals vs leverage: point 5 shows a relevant Cook distance from zero, meaning that it has a strong influence on the model. It could be an outlier or an influential point (or both).

Upon further inspection, we can say that point 5 seems to be both an outlier and an influential observation because – additionally – its residual is significantly large compared to the others, and it is the only point located in that portion of the x1 domain.

Similarly, for the group x11 = 1:

  • Residuals vs fitted: the plot suggests us no clear relationship among the residuals. For that reason we cannot intuitively infer a hypotetical non-linear relationship among them to correct the model via data-transformation.

  • QQ-plot: the residuals seem to be not-clearly-normally-distributed, especially for what concerns the point outside the \([-1, 1]\) range. Normality assumption is quite weakly verified.

  • Scale-location: the trend-line suggests us the lack of significant heteroscedasticity.

  • Residuals vs leverage: no point has a Cook distance greater than \(1\) from \(0\), even if the point \(17\) is close to the boundary.

Question: d

Plot the residuals against the variable x7 (number of transmission speeds), again using x1 as a group variable. Is there anything striking about this plot?

Solution

# fit models according the group x11: y = beta_0 + beta_1 * x7 + eps
fitted_model_x7_1 <- lm(y ~ x7, data = dataset[dataset$x11 == 0, ])
fitted_model_x7_2 <- lm(y ~ x7, data = dataset[dataset$x11 == 1, ])

# residuals plots group x11 = 0
par(mfrow=c(2, 2))
plot(fitted_model_x7_1, main = "Analisys for model y ~ x7 (x11 = 0)")

# the residuals are spreaded in just three integer values that are all
# multiple of three, moreover the qq-plot shows how the normality 
# assumption is quite weak.

# residuals plots group x11 = 1
par(mfrow=c(2, 2))
plot(fitted_model_x7_2, main = "Analisys for model y ~ x7 (x11 = 1)")

# x7 is the same for all points in group x11 = 1, then they have the
# same fitted value, the scal-location plot is not reasonable and the
# qqplot is quite good but just because assumed the same value.

# fitted model plots
par(mfrow=c(1, 2))
plot_fit(fitted_model_x7_1, 
         dataset[dataset$x11 == 0, ]$x7, 
         dataset[dataset$x11 == 0, ]$y,
         xlab = "x7", 
         subtitle = "group x11 = 0")

plot_fit(fitted_model_x7_2, 
         dataset[dataset$x11 == 1, ]$x7, 
         dataset[dataset$x11 == 1, ]$y,
         xlab = "x7", 
         subtitle = "group x11 = 1")

# x7 versus residuals
par(mfrow=c(1,2))
plot(dataset[dataset$x11 == 0, ]$x7, 
     fitted_model_x7_1$residuals, main="Scatterplot x7 versus Residuals",
     xlab="x7", ylab="Residuals", pch=19) 

plot(dataset[dataset$x11 == 1, ]$x7, 
     fitted_model_x7_2$residuals, main="Scatterplot x7 versus Residuals",
     xlab="x7", ylab="Residuals", pch=19)

The residual plot for the group x11 = 0 shows us that the points are spread just over three different integer values of x7 and doesn’t seem to exist a clear relationsip among their relative abundance.

Considering the residual plot for the group x11 = 1, all points are located at a single value for x7, suggesting that x7 is not a suitable predictor at all for our model and making pratically inapplicable any statistical tool in such regard. In the building of a model for such data, it can be safely discarded as useless.

Exercise 11

Text

The following code is designed to explore effects that can result from the omission of explanatory variables among the predictors for a model:

library(lattice)
library(DAAG)

set.seed(42)

x1 <- runif(10)           # predictor which will be missing
x2 <- rbinom(10, 1, 1-x1) # observed predictor which depends on missing predictor

y <- 5*x1 + x2 + rnorm(10,sd=.1) # simulated model; coef of x2 is positive

y.lm <- lm(y ~ factor(x2)) # model fitted to observed data
coef(y.lm)
## (Intercept) factor(x2)1 
##   3.5611964  -0.3207703
# effect of missing variable:

y.lm2 <- lm(y ~ x1 + factor(x2)) # coefficient of x2 has wrong sign
coef(y.lm2)                      # correct model
## (Intercept)          x1 factor(x2)1 
##   -0.330766    5.479133    1.031886

What happens if x2 is generated according to x2 <- rbinom(10, 1, x1)? What happens if x2 is generated according to x2 <- rbinom(10, 1, .5)?

Solution

First of all, theoretically speaking, if we consider only the \(x_{2}\) predictor for the explanation of the response variable \(y\), we get that \(e=y-y_{estimated}=5\times x_{1} + v\) where \(x_{1} \thicksim U(0,1)\) and \(v \thicksim \mathcal{N}(0, 0.01)\).
We can deduce that the expected value \(E(e) \neq 0\) and, as matter of fact, the invalidation of the zero-mean hypothesis for the stochastic component may cause the model to be biased.

x2_1 <- rbinom(10, 1, x1)
y_1 <- 5*x1 + x2_1 + rnorm(10,sd=.1)
y_1.lm <- lm(y_1 ~ factor(x2_1)) # model fitted to observed data
y_1.lm2 <- lm(y_1 ~ x1 + factor(x2_1))
x2_2 <- rbinom(10, 1, .5)
y_2 <- 5*x1 + x2_2 + rnorm(10,sd=.1)
y_2.lm <- lm(y_2 ~ factor(x2_2)) # model fitted to observed data
y_2.lm2 <- lm(y_2~ x1 + factor(x2_2))
anova(y.lm,y.lm2)
## Analysis of Variance Table
## 
## Model 1: y ~ factor(x2)
## Model 2: y ~ x1 + factor(x2)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1      8 14.5522                                  
## 2      7  0.1132  1    14.439 892.86 1.212e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(y_1.lm,y_1.lm2)
## Analysis of Variance Table
## 
## Model 1: y_1 ~ factor(x2_1)
## Model 2: y_1 ~ x1 + factor(x2_1)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1      8 11.0268                                  
## 2      7  0.0612  1    10.966 1255.2 3.705e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(y_2.lm,y_2.lm2)
## Analysis of Variance Table
## 
## Model 1: y_2 ~ factor(x2_2)
## Model 2: y_2 ~ x1 + factor(x2_2)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1      8 13.8624                                  
## 2      7  0.0396  1    13.823 2443.2 3.631e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By ANOVA-testing, we can notice a difference in term of unexplained RSS between the one-variable model (x2-only) and the two-variables model, as theoretically expected. However, the \(F\)-score significance of the x1 predictor in the two-variables model – much higher than that of the one-variable one shows us that the addition of x1 to x2 among the predictors reduced overall unexplained variance much better than the addition of x2 alone over the intercept-only model. This is further confirmed by looking at the spread of partial residuals around their mean for the one-variable model vs the two-variables model w.r.t. x2.

cor(x1,x2)    # negative correlation
## [1] -0.4584511
coef(y.lm)
## (Intercept) factor(x2)1 
##   3.5611964  -0.3207703
coef(y.lm2)
## (Intercept)          x1 factor(x2)1 
##   -0.330766    5.479133    1.031886
cor(x1,x2_1)  # positive correlation
## [1] 0.5406829
coef(y_1.lm)
##   (Intercept) factor(x2_1)1 
##      2.641332      2.336463
coef(y_1.lm2)
##   (Intercept)            x1 factor(x2_1)1 
##    -0.0187044     5.0444277     0.9626073
cor(x1,x2_2)  # almost zero correlation
## [1] -0.09769845
coef(y_2.lm)
##   (Intercept) factor(x2_2)1 
##     3.2390396     0.7012496
coef(y_2.lm2)
##   (Intercept)            x1 factor(x2_2)1 
##     0.1353746     4.7872666     0.9897907

After a deeper analysis of the slope coefficient for the x2 predictor w.r.t. its correlation with x1, we get that the sign of the correlation dominates that of the original generative process in such cases in which correlation among predictors is significant (1st, 2nd) and the most explanatory variable is omitted (x2-only models).

In cases with the same generative process, but both variables present among the chosen predictors, this is generally no longer always true and signs are determined in good agreement with the original generative model.

In the third case, lack of significant correlation among the two predictors, and the peculiar generative process, makes the specific determination of the sign of the slope for x2 dominated by random noise. Additionally, such coefficients do not differ much among the one- and the two- variable models. This last observation is further put in evidence by the last (3rd) row of partial residual plots. Indeed, we can see that there is no significant difference in the mean of the partial residuals around x2 between the full and the censored model.

par(mfrow=c(1,3))
termplot(y.lm, partial.resid=TRUE, smooth=panel.smooth,
         col.res="gray30",main = "y ~ factor(x2)", ylim = c(-3,2))
termplot(y.lm2, partial.resid=TRUE, smooth=panel.smooth,
         col.res="gray30",main = "y ~ x1 + factor(x2)",ylim = c(-3,2))

par(mfrow=c(1,3))
termplot(y_1.lm, partial.resid=TRUE, smooth=panel.smooth,
         col.res="blue",main = "y_1 ~ factor(x2_1)",ylim = c(-3,2))
termplot(y_1.lm2, partial.resid=TRUE, smooth=panel.smooth,
         col.res="blue",main = "y_1 ~ x1 + factor(x2_1)",ylim = c(-3,2))

par(mfrow=c(1,3))
termplot(y_2.lm, partial.resid=TRUE, smooth=panel.smooth,
         col.res="green",main = "y_2 ~ factor(x2_2)",ylim = c(-3,2))

termplot(y_2.lm2, partial.resid=TRUE, smooth=panel.smooth,
         col.res="green",main = "y ~ x1_2 + factor(x2_2)",ylim = c(-3,2))

Exercises from DAAG, Chapter 8

Exercise 1

Text

It is given as a dataset the numbers of occasions when inhibition (i.e. no flow of current across a membrane) occurred within \(120\)s, for different concentrations of the protein peptide-C (data are used with the permission of Claudia Haarmann, who obtained these data in the course of her PhD research). The outcome yes implies that inhibition has occurred.
Use logistic regression to model the probability of inhibition as a function of protein concentration.

Solution

# LIBRARIES
library(ggplot2)

# DATA:
conc <- c(0.1, 0.5, 1.0, 10.0, 20.0, 30.0, 50.0, 70.0, 80.0, 100.0, 150.0)
no   <- c(7,   1,  10,    9,    2,    9,   13,    1,    1,     4,     3  )
yes  <- c(0,   0,   3,    4,    0,    6,    7,    0,    0,     1,     7  )


# COMPOSITES
yes_probab <- yes/(yes+no)
no_probab  <- no/(yes+no)


# Exploratory plot:

plot(conc, yes_probab)      # No obvious pattern...

plot(conc, no_probab)       # ... emerges in the data! :(

# Blind-fitting with total-count weighting

fit_out <- glm(yes_probab ~ conc,                # At least yes_probab it's increasing on average
               family=binomial(link="logit"),
               weights = (yes+no)                # and let the magic happen!
               )


synthdata <- data.frame(conc = seq(min(conc), max(conc), len=100))
synthdata$yes_probab = predict(fit_out, newdata = synthdata, type="response")

plot(yes_probab~conc)
lines(yes_probab~conc, synthdata)

Comments

Following an orthodox interpretation of the exercise text, we are required to fit a logistic model that tries to explain or predict the probability of inhibition as a function of concentration – assuming that the counts shown in the datatable refer to different, independent \(120\text{s}\)-long observations.

Upon visual analysis of the scatterplot of the data, it appears that a global trend in the data is hard to be noticed – or anyway extremely difficult to model in functional form –, and both up- and down-swinging trends are present as the concentration increases (apart from the last point, which is the highest in probability of the whole dataset).

The best we can do in this case is to fit a logistic regression model on untransformed data, directly for the continuous variable that computes the probability of inhibition, using the total number of yes/no observations per concentration as logistic regression weights.

The result is acceptable, but still corrupted by noise.

Exercise 2

Text

In the dataset (an artificial one of \(3121\) patients, that is similar to a subset of the data analyzed in Stiell et al., 2001) minor.head.injury, obtain a logistic regression model relating clinically.important.brain.injury to the other variables. Patients whose risk is sufficiently high will be sent for CT. Using a risk threshold of \(0.025\) (\(2.5\%\)), turn the result into a decision rule for use of CT.

Solution

library(DAAG)

# load dataset
data <- DAAG::head.injury

# exploring data
str(data) # they are binary data...
## 'data.frame':    3121 obs. of  11 variables:
##  $ age.65                           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ amnesia.before                   : num  1 0 0 0 0 1 0 0 0 1 ...
##  $ basal.skull.fracture             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GCS.decrease                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GCS.13                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GCS.15.2hours                    : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ high.risk                        : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ loss.of.consciousness            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ open.skull.fracture              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ vomiting                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ clinically.important.brain.injury: num  0 0 0 0 0 0 0 0 0 0 ...
# perform glm
model_glm <- glm(clinically.important.brain.injury ~ .,
                 family = binomial(link = "logit"),
                 data = data)

#summary
summary(model_glm)
## 
## Call:
## glm(formula = clinically.important.brain.injury ~ ., family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2774  -0.3511  -0.2095  -0.1489   3.0028  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.4972     0.1629 -27.611  < 2e-16 ***
## age.65                  1.3734     0.1827   7.518 5.56e-14 ***
## amnesia.before          0.6893     0.1725   3.996 6.45e-05 ***
## basal.skull.fracture    1.9620     0.2064   9.504  < 2e-16 ***
## GCS.decrease           -0.2688     0.3680  -0.730 0.465152    
## GCS.13                  1.0613     0.2820   3.764 0.000168 ***
## GCS.15.2hours           1.9408     0.1663  11.669  < 2e-16 ***
## high.risk               1.1115     0.1591   6.984 2.86e-12 ***
## loss.of.consciousness   0.9554     0.1959   4.877 1.08e-06 ***
## open.skull.fracture     0.6304     0.3151   2.001 0.045424 *  
## vomiting                1.2334     0.1961   6.290 3.17e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1741.6  on 3120  degrees of freedom
## Residual deviance: 1201.3  on 3110  degrees of freedom
## AIC: 1223.3
## 
## Number of Fisher Scoring iterations: 6
# turn the result into a decision rule for use of CT
judge_function_1 <- function(patients, threshold){
  n_coeff <- length(model_glm$coefficients) # number of estimated coefficients

  p <- model_glm$coefficients[1] + patients %*% model_glm$coefficients[2:n_coeff]
  logistic_prob <- exp(p) / (1 + exp(p))

  return(logistic_prob < threshold) # return the if condition on the threshold
}

judge_function_2 <- function(patients, threshold){
  logit <-log(threshold/(1 - threshold))
  intercept <- model_glm$coefficients[1]

  CT_data <- data.frame("coeff" = model_glm$coefficients[-1], "patient" = t(patients)); print(CT_data)
  CT_rule <- data.frame("Patient_risk" = patients %*% model_glm$coefficients[-1],
                        "CT_threshold" = logit - intercept,
                        "CT" = patients %*% model_glm$coefficients[-1] > logit - intercept)
  rownames(CT_rule) <- colnames(CT_data[, -1]); print(CT_rule)
}

# number of columns without response column
n <- length(model_glm$coefficients[-1])

# check correctness of the judge function
 x <-  sapply(data,'[[', 1)[-11];  # a patient took from the dataset

# first patient
set.seed(69)
x <- c(x, sample(c(0,1), replace = TRUE, size = n))

# second patient
set.seed(50)
x <- c(x, sample(c(0,1), replace = TRUE, size = n))

# patients table
x <- matrix(x, nrow = 3, ncol = n, byrow = TRUE)

# judge_function_1(patients = x, threshold = 0.025)
judge_function_2(patients = x, threshold = 0.025)
##                            coeff patient.1 patient.2 patient.3
## age.65                 1.3734497         0         1         1
## amnesia.before         0.6892694         1         0         0
## basal.skull.fracture   1.9619970         0         0         1
## GCS.decrease          -0.2687866         0         1         0
## GCS.13                 1.0612978         0         1         0
## GCS.15.2hours          1.9407526         0         1         1
## high.risk              1.1114875         0         0         1
## loss.of.consciousness  0.9554175         0         0         0
## open.skull.fracture    0.6304400         0         0         0
## vomiting               1.2334196         0         0         1
##           Patient_risk CT_threshold    CT
## patient.1    0.6892694    0.8336726 FALSE
## patient.2    4.1067135    0.8336726  TRUE
## patient.3    7.6211065    0.8336726  TRUE

Comments

Almost all of the variables used as predictors are highly significant according to their coefficient p-values, with – in any case – some differences w.r.t. their overall relevance for the final result.

The difference between the null model and the full one is large, positively suggesting its effective use.

Finally, a table is provided to summarize the estimated coefficients and the patient information; additionally, another table is provided to compare symptoms, risk level for each patient, and the model-predicted decision whether to sent them for a CT check.

In order to operationalize such model and provide a simple decision rule for emergency unit responders, it is possible to transform the logistic model in a linearly-predicted logit model operating at fixed probability cutoff.

Since all the predictors are binary variables carrying information about the presence (\(1\)) or not (\(0\)) of a given symptom, it is possible to assign to each symptom a score corresponding to their model multiplicative coefficient. Finally, the ER unit workers just need to sum all the symptom scores according to their presence in a given patient showing the hospital.

If the overall sum exceeds a precomputed threshold, the patient needs to be sent to CT.

Such threshold \(T_p\) depends on the risk cutoff \(p\) in that: \[ T_p = \text{logit}(p) \ - \ \text{model intercept} \ = \ \text{log}\left(\frac{p}{1-p}\right) - \ \text{model intercept} \]

For our model, at \(p = 0.025\), \(T_p = 4.4972 - 3.6635 = 0.8337\). Coefficients are shown in the tables adove.

Exercise 3

Text

Consider again the \(\mathsf{moths}\) dataset of Section 8.4.

  1. What happens to the standard error estimates when the Poisson family is used in \(\mathsf{glm()}\) instead of the Quasi-Poisson family?
  2. Analyze the P moths, in the same way as the A moths were analyzed. Comment on the effect of transect length.

Solution

# LIBRARIES
library(lattice)
library(DAAG)


# TABLE
rbind(Number = table(moths[,4]), sapply(split(DAAG::moths[,-4], DAAG::moths$habitat),
                                        apply, 2, sum))
##        Bank Disturbed Lowerside NEsoak NWsoak SEsoak SWsoak Upperside
## Number    1         7         9      6      3      7      3         5
## meters   21        49       191    254     65    193    116       952
## A         0         8        41     14     71     37     20        28
## P         4        33        17     14     19      6     48         8
# IND. TRANSECTS / MEANS
lattice::dotplot(habitat~P, data=DAAG::moths, xlab="Number of moths (species P)",

    panel = function(x, y, ...)
    {
        lattice::panel.dotplot(x, y, pch = 1, col = "black", ...)
        lattice::panel.average(x, y, pch = 3, cex = 1.25, type = "p", col = "gray45")
    },

    key = list(text = list(c("Individual transects", "Mean")),
               points  = list(pch = c(1,3), cex = c(1,1.25), col = c("black", "gray45")),
               columns = 2))

# QUASIPOISSON GLM, BANK REFLEVEL
pmoths_qpois <- glm(P ~ habitat + log(meters), family = quasipoisson, data = moths)
summary(pmoths_qpois)
## 
## Call:
## glm(formula = P ~ habitat + log(meters), family = quasipoisson, 
##     data = moths)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9479  -1.3155  -0.4167   0.5660   2.9719  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -0.3039     0.9584  -0.317  0.75326   
## habitatDisturbed   0.8794     0.8623   1.020  0.31547   
## habitatLowerside  -0.5569     0.8837  -0.630  0.53301   
## habitatNEsoak     -0.8780     0.9105  -0.964  0.34212   
## habitatNWsoak      0.4537     0.8746   0.519  0.60753   
## habitatSEsoak     -1.6266     1.0278  -1.583  0.12336   
## habitatSWsoak      1.0814     0.8348   1.295  0.20444   
## habitatUpperside  -2.1342     1.0481  -2.036  0.05007 . 
## log(meters)        0.5552     0.1759   3.157  0.00347 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.527554)
## 
##     Null deviance: 217.819  on 40  degrees of freedom
## Residual deviance:  76.924  on 32  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6
# Analysis with tighter convergence criterion: omitted as no 0-count observed!


# QUASIPOISSON GLM, UPPERSIDE REFLEVEL (Pareto-symplex optimal significance)
moths$habitat_rel <- relevel(DAAG::moths$habitat, ref="Upperside")
pmoths_qpois_rel  <- glm(P ~ habitat_rel + log(meters), family = quasipoisson, data = moths)
summary(pmoths_qpois_rel)
## 
## Call:
## glm(formula = P ~ habitat_rel + log(meters), family = quasipoisson, 
##     data = moths)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9479  -1.3155  -0.4167   0.5660   2.9719  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.4381     1.0813  -2.255  0.03112 *  
## habitat_relBank        2.1342     1.0481   2.036  0.05007 .  
## habitat_relDisturbed   3.0137     0.8509   3.542  0.00124 ** 
## habitat_relLowerside   1.5773     0.7757   2.033  0.05038 .  
## habitat_relNEsoak      1.2562     0.7508   1.673  0.10406    
## habitat_relNWsoak      2.5879     0.7711   3.356  0.00205 ** 
## habitat_relSEsoak      0.5076     0.9200   0.552  0.58497    
## habitat_relSWsoak      3.2156     0.6675   4.817 3.38e-05 ***
## log(meters)            0.5552     0.1759   3.157  0.00347 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.527554)
## 
##     Null deviance: 217.819  on 40  degrees of freedom
## Residual deviance:  76.924  on 32  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6
# THE ACTUAL EXERCISE:

# (a): POISSON GLM, UPPERSIDE REFLEVEL
moths$habitat_rel <- relevel(DAAG::moths$habitat, ref="Upperside")
pmoths_pois_rel  <- glm(P ~ habitat_rel + log(meters), family = poisson, data = moths)
summary(pmoths_pois_rel)    # Shrinkage of C.I.s as expected!
## 
## Call:
## glm(formula = P ~ habitat_rel + log(meters), family = poisson, 
##     data = moths)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9479  -1.3155  -0.4167   0.5660   2.9719  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.4381     0.6801  -3.585 0.000337 ***
## habitat_relBank        2.1342     0.6593   3.237 0.001207 ** 
## habitat_relDisturbed   3.0137     0.5352   5.631 1.79e-08 ***
## habitat_relLowerside   1.5773     0.4879   3.233 0.001226 ** 
## habitat_relNEsoak      1.2562     0.4723   2.660 0.007816 ** 
## habitat_relNWsoak      2.5879     0.4850   5.336 9.52e-08 ***
## habitat_relSEsoak      0.5076     0.5787   0.877 0.380404    
## habitat_relSWsoak      3.2156     0.4199   7.659 1.88e-14 ***
## log(meters)            0.5552     0.1106   5.018 5.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 217.819  on 40  degrees of freedom
## Residual deviance:  76.924  on 32  degrees of freedom
## AIC: 187.15
## 
## Number of Fisher Scoring iterations: 6
# (b): QUASIPOISSON GLM, UPPERSIDE REFLEVEL
moths$habitat_rel <- relevel(DAAG::moths$habitat, ref="Upperside")
pmoths_qpois_rel  <- glm(P ~ habitat_rel + log(meters), family = quasipoisson, data = moths)
summary(pmoths_qpois_rel)   # Significant transect length!
## 
## Call:
## glm(formula = P ~ habitat_rel + log(meters), family = quasipoisson, 
##     data = moths)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9479  -1.3155  -0.4167   0.5660   2.9719  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.4381     1.0813  -2.255  0.03112 *  
## habitat_relBank        2.1342     1.0481   2.036  0.05007 .  
## habitat_relDisturbed   3.0137     0.8509   3.542  0.00124 ** 
## habitat_relLowerside   1.5773     0.7757   2.033  0.05038 .  
## habitat_relNEsoak      1.2562     0.7508   1.673  0.10406    
## habitat_relNWsoak      2.5879     0.7711   3.356  0.00205 ** 
## habitat_relSEsoak      0.5076     0.9200   0.552  0.58497    
## habitat_relSWsoak      3.2156     0.6675   4.817 3.38e-05 ***
## log(meters)            0.5552     0.1759   3.157  0.00347 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.527554)
## 
##     Null deviance: 217.819  on 40  degrees of freedom
## Residual deviance:  76.924  on 32  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6
# Distinguishability analysis with deviances: omitted as no "problematic" habitat!

Comments

  1. As we can see from the printed results – and as also noted in Exercise DAAG 8.6 – standard error intervals estimated from a Poisson regression model are narrower w.r.t. Quasi-Poisson ones. This can be explained as a positive dispersion coefficient (like the one estimated in this case for the quasi-Poisson model) inflates error bounds.

  2. [paragraphs that follow]

By omitting the portions of data analysis directly determined by the observed \(0\)-count for the Bank habitat (in the case of A moths), the analysis is compact and contained in the code snipped above.

With a difference w.r.t. the analysis shown in the book, the reference level among different habitats has been chosen in order to maximize overall significance for the coefficients in the linear model contained in the (quasi-)Poisson one. Since no clear dominance exists in such choice, the closest to Pareto-front has been chosen (symplex-area method). Regardless of such criterion, all observations contained in the following are invariant to such choice.

From the summary of the fitted GLM we can observe that the \(log(\text{length})\) term has a significant coefficient (\(p < 0.004\)). The effect of transect length on moth count is therefore relevant (definitely more relevant in comparison to the A moths). This can show that – w.r.t. A moths, P moths enjoy longer flights and/or pass significantly many (more) times across transects during their aerial wanderings.

Exercise 6

Text

The function \(\mathsf{poissonsim()}\) allows for experimentation with Poisson regression.
In particular, \(\mathsf{poissonsim()}\) can be used to simulate Poisson responses with log-rates equal to \(a + bx\), where \(a\) and \(b\) are fixed values by default.

  1. Simulate \(100\) Poisson responses using the model \(log( \lambda) = 2 − 4x\) for \(x = 0, 0.01, 0.02,\dots, 1.0\).
    Fit a Poisson regression model to these data, and compare the estimated coefficients with the true coefficients. How well does the estimated model predict future observations?

  2. Simulate \(100\) Poisson responses using the model \(log( \lambda) = 2 − bx\) where \(b\) is normally distributed with mean \(4\) and standard deviation \(5\). [Use the argument \(\mathsf{slope.sd=5}\) in the \(\mathsf{poissonsim()}\) function.] How do the results using the poisson and quasipoisson families differ?

Solution

# LIBRARIES:
library(MASS)       # confint for GLMs
library(lattice)
library(DAAG)


x_grid <- seq(from=0.00, to=1.00, by=0.01)


### POISSON DATA ###


# Generation
poiss_a <- DAAG::poissonsim(x_grid, a = 2, b = -4)

# Fit (Poisson)
poiss_a_fit <- glm(poiss_a$y~poiss_a$x, family = poisson)

# Summary
summary(poiss_a_fit)
## 
## Call:
## glm(formula = poiss_a$y ~ poiss_a$x, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9684  -0.8463  -0.2382   0.3697   2.7572  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.8495     0.1130   16.37   <2e-16 ***
## poiss_a$x    -3.6008     0.3423  -10.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 232.249  on 100  degrees of freedom
## Residual deviance:  86.612  on  99  degrees of freedom
## AIC: 264.6
## 
## Number of Fisher Scoring iterations: 5
# C.I.s
confint(poiss_a_fit)
##                 2.5 %    97.5 %
## (Intercept)  1.622811  2.066056
## poiss_a$x   -4.291993 -2.948529
### OVERDISPERSED POISSON DATA ###


# Generation
poiss_b <- DAAG::poissonsim(x_grid, a = 2, b = -4, slope.sd = 5)

# Fit (Poisson)
poiss_b_fit  <- glm(poiss_b$y~poiss_b$x, family = poisson)

# Summary
summary(poiss_b_fit)
## 
## Call:
## glm(formula = poiss_b$y ~ poiss_b$x, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -14.025   -7.044   -4.260   -0.634   57.982  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.11704    0.06578   16.98   <2e-16 ***
## poiss_b$x    3.47148    0.08421   41.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12740  on 100  degrees of freedom
## Residual deviance: 10540  on  99  degrees of freedom
## AIC: 10789
## 
## Number of Fisher Scoring iterations: 7
# C.I.s
confint(poiss_b_fit)
##                 2.5 %   97.5 %
## (Intercept) 0.9866542 1.244533
## poiss_b$x   3.3075740 3.637705
# Fit (Quasi-Poisson)
qpoiss_b_fit <- glm(poiss_b$y~poiss_b$x, family = quasipoisson)

# Summary
summary(qpoiss_b_fit)
## 
## Call:
## glm(formula = poiss_b$y ~ poiss_b$x, family = quasipoisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -14.025   -7.044   -4.260   -0.634   57.982  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1.117      1.048   1.066   0.2892  
## poiss_b$x      3.471      1.342   2.587   0.0111 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 253.9653)
## 
##     Null deviance: 12740  on 100  degrees of freedom
## Residual deviance: 10540  on  99  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 7
# C.I.s
confint(qpoiss_b_fit)
##                 2.5 %   97.5 %
## (Intercept) -1.349649 2.847922
## poiss_b$x    1.080532 6.451156

Comments

  1. [paragraphs that follow]

Preliminarily to a more in-depth analysis, it is worth noting that – w.r.t. to the point we are currently answering to – we are performing a Poisson regression fit to data generated according to a Poisson regression generative model.

Far from being that obvious, this allows us to preliminarily state that what we are trying to accomplish is a posteriori (information-theoretically) the most efficient fitting procedure to predict the expected count numbers from the only available predictor (\(x\)) of the synthetic phenomenon we are dealing with.

A more precise consistency check involves the analysis of the standard error estimates (obtained by the means of the integrated fitting routine \(\mathsf{glm}\)) and approximated-normal confidence intervals obtained via profile likelihood estimation thanks to the MASS::confint() function.

As we can see from the results shown above, the estimated coefficients are always included in the \(95\%\) symmetric C.I.s around the estimate. This is also true for the \(\text{estimate}\ \pm\ SE\) interval for the slope coefficient and sometimes also for the intercept. In cases the latter is not, anyway, the difference is always of minor entity (and concern).

To analyze the robustness of the fitted model for \(x \rightarrow +\infty\) (i.e. future values) it is possible to study the (estimated) rate \(\hat{\lambda}\) of the (estimated) Poisson distribution from which we assume our counts to be sampled from.

In fact, \(\hat{\lambda} = e^{\hat{a} + x\hat{b}}\), and it is sufficient to consider that \(\hat{a} > 0\) and \(\hat{b} < 0\) for now. Furthermore, as \(x \rightarrow +\infty, \ \hat{\lambda} \rightarrow 0\) and \(E_{pois}[\text{counts}] \propto \lambda\).

From that – or also via interval-bound propagation – it is possible to show that, as long as the previous inequalities hold true, the difference in predicted counts converges to zero as \(x \rightarrow +\infty\) regardless of the specific value of the estimated parameters \(\hat{a} > 0\) and \(\hat{b} < 0\). This makes future estimations robust to stochastic noise.

  1. [paragraphs that follow]

As far as the second phenomenon and model are concerned, the following are the similarities and differences among the Poisson-family and Quasi-Poisson family regressions.

  • As expected, point-estimates for the (quasi-)Poisson regression coefficients are the same in both cases, since Quasi-Poisson regression model is the quasi-likelihood model associated to the Poisson one, with non-locked dispersion;

  • Both the confidence intervals and the standard errors shown for the Quasi-Poisson model are broader w.r.t. the Poisson ones;

  • The AIC is not shown for the Quasi-Poisson model. Settling a seemingly popular debate, such behaviour is a design choice of R Core Team for any quasi-likelihood model fitted using the \(\mathsf{glm}\) function.

Overall, it can be said that Quasi-Poisson regression accounts for (in this case) increased dispersion of the data via broadening of C.I.s.