LAB EXERCISES

Exercise 1

Write a function binomial(x,n,p) for the binomial distribution, depending on parameters x, n, p, and test it with some prespecified values. Use the function choose() for the binomial coefficient.
Plot two binomials with \(n=20\), and \(p=0.3, 0.6\) respectively.

Solution

# Function
binomial <- function(x, n, p)
{
    return(choose(n, x) * p ^ x * (1 - p) ^ (n - x))
}

# Test the function
binomial(3, 6, 0.5)

## [1] 0.3125

dbinom(3, 6, 0.5)

## [1] 0.3125

binomial(5, 100, 0.1)

## [1] 0.0338658

dbinom(5, 100, 0.1)

## [1] 0.0338658

binomial(5, 10, 0.5)

## [1] 0.2460938

dbinom(5, 10, 0.5)

## [1] 0.2460938

binomial(1, 2, 0.5)

## [1] 0.5

dbinom(1, 2, 0.5)

## [1] 0.5

# Plot
x <- seq(1:20)
y1 <- binomial(x, 20, 0.3)
y2 <- binomial(x, 20, 0.6)
plot(x, y1)
points(x, y2, col = "red")

Exercise 2

Generate in R the same output shown in the Lab lecture for the Negative Binomial distribution, but using rgeom() for generating the random variables. Hint: generate \(n\) times three geometric distribution \(X_1,\ldots, X_3\) with \(p=0.08\), store them in a matrix and compute then the sum \(Y\).

Solution

We can obtain the Negative Binomial pdf \(NegBinom(r, k, p)\) as the sum of \(k\) iid Geometrical distributions \(Geom(r, p)\) with same \(p\) and \(r\).

# Hyperparameters
k <- 3
p <- 0.08
upbound <- 1000


# As Binomial pdf
plot(0:upbound,
     dnbinom(0:upbound, size = k, prob = p),
     xlab="Y=number of failures before k successes", ylab="f(x), exact",pch=21, bg=1)

# As sum of k Geometric rvs, via histogram estimation with count averaging
avg_factor <- 1500  # A good...
bin_fact <- 1       # ... compromise.
seq_stencil <- seq(from = 0, to = upbound, by = bin_fact)


mycounts <- vector(mode = "numeric", upbound/bin_fact)

# Averaging
for (cnt in 0:avg_factor)
{
    mycounts = mycounts +
        hist(
            # k*(upbound + 1) elements are generated directly via the vectorized `rgeom` function. Such elements are
            # reshaped into a matrix of suitable dimensions via remainder-computation. Such matrix is transposed
            # and then summed column-wise to produce the wanted results.
        colSums(t(matrix(rgeom(((0:(k*(upbound+1)-1))%%(upbound+1)), prob = p), nrow=(upbound+1), ncol=k))),
        breaks = seq_stencil,
        plot = FALSE)$density
}

# Normalization
myedpf = mycounts/avg_factor

plot(seq_stencil[0:(length(seq_stencil)-1)], myedpf, # The number of bins is one less the number of breaks!
     xlab="Y=number of failures before k successes", ylab="f(x), approx.",pch=21, bg=1)

Exercise 3

Show in R, also graphically, that \(\mbox{Gamma}(n/2, 1/2)\) coincides with a \(\chi^{2}_{n}\).
Find the 5% and the 95% quantiles of a \(\mbox{Gamma}(3,3)\).

Solution

# Plot
x <- seq(1:25)
df <- 5
n <- 1000
y <- rchisq(n, df)

hist(y, breaks = 40, probability = TRUE)
curve(
    dgamma(x, rate = 1 / 2, shape = df / 2),
    col = "red",
    lwd = 2,
    add = TRUE
)

# Quantiles
qgamma(0.05, 3, 3)

## [1] 0.2725638

qgamma(0.95, 3, 3)

## [1] 2.098598

Exercise 4

Generate \(n=1000\) values from a \(\mbox{Beta}(5,2)\) and compute the sample mean and the sample variance.

Solution

n <- 1000
x <- rbeta(n, 5, 2)
mean(x)

## [1] 0.7221582

var(x)

## [1] 0.02367058

Exercise 5

Show with a simple R function that a negative binomial distribution may be seen as a mixture between a Poisson and a Gamma. In symbols: \(X|Y \sim \mathcal{P}(Y)\), \(Y \sim \mbox{Gamma}(\alpha, \beta)\), then \(X \sim \ldots\).

Solution

We can see that if \(X|Y \sim \mathcal{P}(Y)\), \(Y \sim \mbox{Gamma}(\alpha, \beta)\), then \(X \sim \mbox{NB} (\alpha, \frac{1}{\beta+1 })\).

n <- 200000
r <- 0.5
p <- 0.5

nb_mixture <- function(my_n, my_r, my_p)
{
    # Note: here we use the shape/scale parametrization!
    Z = rgamma(my_n, shape = my_r, scale = my_p/(1-my_p))
    X = rpois(my_n, lambda = Z)
    return(X)
}

plot(hist(nb_mixture(n, r, p), breaks=15, plot = FALSE)$counts/n, xlab = "x", ylab = "NegBin(x)")
points(dnbinom((0:15), r,p), col='red')

Exercise 6

Instead of using the built-in function ecdf(), write your own R function for the empirical cumulative distribution function and reproduce the two plots shown during the Lab lecture.

Solution

# Function to compute eCDF
my_ecdf <- function(x)
{
    z <- c()
    y <- sort(x)

    for (i in 1:length(y))
    {
        z[i] <- i / length(y)
    }

    z <- cbind(z, y)
    return(z)
}

# Function to plot eCDF
plot_ecdf <- function(x)
{
    plot(
        x[, 2],
        x[, 1],
        type = "s",
        ylim = c(-0.01, 1.01),
        xlab = "x",
        ylab = "Fn(x)",
        main = "ECDF and CDF: n=50"
    )

    abline(0, 0, col = "gray", lty = 2)
    abline(1, 0, col = "gray", lty = 2)
}

set.seed(2)
par(mfrow = c(1, 2))


# n = 50
n <- 50

y <- rbeta(n, 3, 4)
tt <- seq(from = 0, to = 1, by = 0.01)

my_edf_beta <- my_ecdf(y)

plot_ecdf(my_edf_beta)

lines(tt,
      pbeta(tt, 3, 4),
      col = 2,
      lty = 2,
      lwd = 2)


# n = 500
n2 <- 500

y2 <- rbeta(n2, 3, 4)

my_edf_beta2 <- my_ecdf(y2)

plot_ecdf(my_edf_beta2)

lines(tt,
      pbeta(tt, 3, 4),
      col = 2,
      lty = 2,
      lwd = 2)

Exercise 7

Compare in R the assumption of normality for these samples:

\(y_1, \ldots, y_{100} \sim t_{\nu},\) with \(\nu=5,20, 100\). What does it happens when the number of degrees of freedom \(\nu\) increases?
\(y_1, \ldots, y_{100} \sim \mbox{Cauchy}(0,1)\). Do you note something weird for the extremes quantiles?

Solution

In order to compare the assumption of normality w.r.t. a sample you can use the Q-Q plot vs. the Normal distribution.

From the Q-Q plots that follow, related to a t-Student distribution with an increasing number of degrees of freedom, we can see that the t-Student is better and better approximated (until asymptotic convergence, as the theory shows) for an arbitrarily large number of d.o.f.. In particular, the deviations of Q-Q points to the theoretical Q-Q line decrease as the number of d.o.f. increases.

par(mfrow = c(1, 3))
n <- 100


# t-Student df = 5
y <- rt(n, df = 5)
qqplot(
    qt(ppoints(n), df = 5),
    y,
    xlab = "True quantiles",
    ylab = "Sample quantiles",
    main = "Q-Q plot for t-student(5): n=100",
    ylim = c(-5, 5),
    xlim = c(-5, 5)
)

qqline(
    y,
    distribution = function(p)
        qt(p, df = 5),
    col = "red"
)


# t-Student df = 20
y2 <- rt(100, 20)
qqplot(
    qt(ppoints(n), 20),
    y2,
    xlab = "True quantiles",
    ylab = "Sample quantiles",
    main = "Q-Q plot for t-student(20): n=100",
    ylim = c(-5, 5),
    xlim = c(-5, 5)
)
qqline(
    y2,
    distribution = function(p)
        qt(p, df = 20),
    col = "red"
)


# t-Student df = 100
y3 <- rt(100, 100)
qqplot(
    qt(ppoints(n), 100),
    y3,
    xlab = "True quantiles",
    ylab = "Sample quantiles",
    main = "Q-Q plot for t-student(100): n=100",
    ylim = c(-5, 5),
    xlim = c(-5, 5)
)
qqline(
    y3,
    distribution = function(p)
        qt(p, df = 100),
    col = "red"
)

Looking at the quantiles of \(y_1, \ldots, y_{100} \sim \mbox{Cauchy}(0,1)\) we can see that the majority of the points lay on the theoretical Q-Q line, with their values near one to each other. Nevertheless we notice that extreme Q-Q points deviate more and more from the theoretically-normal Q-Q line. This is a peculiar consequence less-than-\(1/x^2\) decay of its tails, which is also responsible for the non-finiteness of expectation value and variance for such distribution.

# Cauchy
y4 <- rcauchy(100, 0, 1)

qqplot(
    qcauchy(ppoints(n), 0, 1),
    y4,
    xlab = "True quantiles",
    ylab = "Sample quantiles",
    main = "Q-Q plot for Cauchy(0,1): n=100"
)

qqline(
    y4,
    distribution = function(p)
        qcauchy(p, 0, 1),
    col = "red"
)

Exercise 8

Write a general R function for checking the validity of the central limit theorem. Hint The function will consist of two parameters: clt_function <- function(n, distr), where the first one is the sample size and the second one is the kind of distribution from which you generate. Use plots for visualizing the results.

Solution

clt_function <- function(n, dist) {
    x <- seq(1, n)

    sample <- switch(
        dist,
        "beta" = rbeta(x, 2, 5),
        "binomial" = rbinom(x, n, 0.5),
        "chisquared" = rchisq(x, 5),
        "gamma" = rgamma(x, 3),
        "negativebinomial" = rnbinom(x, 3, 0.5),
        "tstudent" = rt(x, 5)
    )

    # Plot the distribution of the mean of the observations
    hist(
        sample,
        freq = FALSE,
        main = "CLT",
        border = "red",
        nclass = 50
    )

    # Normal
    mu <- mean(sample)
    var <- var(sample)
    curve(dnorm(x, mu, sqrt(var)),
          add = TRUE,
          col = "black",
          lwd = 2)
}


# Plots
par(mfrow = c(2, 2))

clt_function(1000, "beta")
clt_function(1000, "binomial")
clt_function(1000, "chisquared")
clt_function(1000, "tstudent")

DAAG EXERCISES

Exercise 4

For the data frame ais (DAAG package).

Use the function str() to get information on each of the columns. Determine whether any of the columns hold missing values.
Make a table that shows the numbers of males and females for each different sport. In which sports is there a large imbalance (e.g., by a factor of more than 2:1) in the numbersof the two sexes?

Solution

library(DAAG)
library(data.table) # setDT

data <- ais
str(data)

## 'data.frame':    202 obs. of  13 variables:
##  $ rcc   : num  3.96 4.41 4.14 4.11 4.45 4.1 4.31 4.42 4.3 4.51 ...
##  $ wcc   : num  7.5 8.3 5 5.3 6.8 4.4 5.3 5.7 8.9 4.4 ...
##  $ hc    : num  37.5 38.2 36.4 37.3 41.5 37.4 39.6 39.9 41.1 41.6 ...
##  $ hg    : num  12.3 12.7 11.6 12.6 14 12.5 12.8 13.2 13.5 12.7 ...
##  $ ferr  : num  60 68 21 69 29 42 73 44 41 44 ...
##  $ bmi   : num  20.6 20.7 21.9 21.9 19 ...
##  $ ssf   : num  109.1 102.8 104.6 126.4 80.3 ...
##  $ pcBfat: num  19.8 21.3 19.9 23.7 17.6 ...
##  $ lbm   : num  63.3 58.5 55.4 57.2 53.2 ...
##  $ ht    : num  196 190 178 185 185 ...
##  $ wt    : num  78.9 74.4 69.1 74.9 64.6 63.7 75.2 62.3 66.5 62.9 ...
##  $ sex   : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sport : Factor w/ 10 levels "B_Ball","Field",..: 1 1 1 1 1 1 1 1 1 1 ...

# function is.na return, for each element of the data frame, if
# it's NaN then it returns TRUE, otherwise FALSE.
# function sum counts the number of elements that are TRUE (So,
# in our case it counts how elements are NaN).
cat("Number of elements that are NaN = ", sum(is.na(data)))

## Number of elements that are NaN =  0

# create a table that sumamrize numbers of males and females for each
# sport and define a new column, called "ratio", showing the ratio
# between sexes.

# summarizing table
sex_frequency <- table(data$sport, data$sex)

# ratio vector: f / m
ratio <- sex_frequency[1:nrow(sex_frequency)] / sex_frequency[1:nrow(sex_frequency), 2]

# final table with ratio column
sex_frequency <- cbind(sex_frequency, ratio)

# convert table into dataframe
sex_frequency <- as.data.frame(sex_frequency)

# convert rows name in first column of the data frame
# and print dataframe
setDT(sex_frequency, keep.rownames = "sport")[]

##       sport  f  m     ratio
##  1:  B_Ball 13 12 1.0833333
##  2:   Field  7 12 0.5833333
##  3:     Gym  4  0       Inf
##  4: Netball 23  0       Inf
##  5:     Row 22 15 1.4666667
##  6:    Swim  9 13 0.6923077
##  7:  T_400m 11 18 0.6111111
##  8: T_Sprnt  4 11 0.3636364
##  9:  Tennis  7  4 1.7500000
## 10:  W_Polo  0 17 0.0000000

cat("Unbalanced Sport (ratio > 2):", sex_frequency$sport[sex_frequency$ratio > 2])

## Unbalanced Sport (ratio > 2): Gym Netball

Exercise 6

Create a data frame called Manitoba.lakes that contains the lake’s elevation (in metersabove sea level) and area (in square kilometers) as listed below. Assign the names of the lakes using therow.names() function.

Use the following code to plotlog2(area) versus elevation, adding labeling infor-mation (there is an extreme value of area that makes a logarithmic scale pretty much essential):

attach(Manitoba.lakes)
plot(log2(area) ~ elevation, pch=16, xlim=c(170,280)) # NB: Doubling the area increases log2(area) by 1.0
text(log2(area) ~ elevation,labels=row.names (Manitoba.lakes), pos=4) text(log2(area) ~ elevation, labels=area, pos=2)
title("Manitoba's Largest Lakes")
detach(Manitoba.lakes)

Devise captions that explain the labeling on the points and on they-axis. It will be necessary to explain how distances on the scale relate to changes in area.

Repeat the plot and associated labeling, now plotting area versus elevation,but specifying log=“y” in order to obtain a logarithmicy-scale. Note: Thelog=“y” setting carries across to the subsequent text() commands. See Subsection 2.1.5 for anexample.

Solution

library(DAAG)


# Part: "preliminary"

Manitoba.lakes <- data.frame(
    row.names = c(
        "Winnipeg",
        "Winnipegosis",
        "Manitoba",
        "SouthernIndian",
        "Cedar",
        "Island",
        "Gods",
        "Cross",
        "Playgreen"
    ),

    elevation = c(217., 254., 248., 254., 253., 227., 178., 207., 217),
    area = c(24387., 5374., 4624., 2247., 1353., 1223., 1151., 755., 657.)
)


# Part: "(a)"

attach(Manitoba.lakes)

plot(log2(area) ~ elevation,
     pch = 16,
     xlim = c(170, 280))

# NB: Doubling the area increases log2(area) by 1.0
text(log2(area) ~ elevation,
     labels = row.names(Manitoba.lakes),
     pos = 4)

text(log2(area) ~ elevation, labels = Manitoba.lakes$area, pos = 2)

title("Manitoba’s Largest Lakes")

detach(Manitoba.lakes)


# Part: "(b)"

attach(Manitoba.lakes)
plot(
    area ~ elevation,
    pch = 16,
    xlim = c(170, 280),
    log = "y"
)

text(area ~ elevation,
     labels = row.names(Manitoba.lakes),
     pos = 4)

text(area ~ elevation, labels = Manitoba.lakes$area, pos = 2)

title("Manitoba’s Largest Lakes")

detach(Manitoba.lakes)

Exercise 11

Run the following code:

gender <- factor(c(rep("female", 91), rep("male", 92)))
table(gender)
gender <- factor(gender, levels=c("male", "female"))
table(gender)
gender <- factor(gender, levels=c("Male", "female"))# Note the mistake: "Male" should be "male"
table(gender)
table(gender, exclude=NULL)
rm(gender)          # Remove gender

Explain the output from the successive uses of table().

Solution

We show below: the provided code fragments, their output, and a possible explanation.

A vector of factors composed of 91 female and 92 male elements is created. The table command shows such result, according to the order of definition.

gender <- factor(c(rep("female", 91), rep("male", 92)))
table(gender)

## gender
## female   male 
##     91     92

As the levels grouping is specified, with male first and female second, the table command adapts its output accordingly.

gender <- factor(gender, levels=c("male", "female"))
table(gender)

## gender
##   male female 
##     92     91

Because of a case-mistake, we define a level which does not match any factor present in the vector. The “Male” column (should have been “male”) is empty.

# Notice the mistake! "Male" should be "male".
gender <- factor(gender, levels=c("Male", "female"))
table(gender)

## gender
##   Male female 
##      0     91

In the following, the table command is instructed to explicitly group NA (not available) data (in this case w.r.t. the levels provided) as a separate, visible level.

table(gender, exclude=NULL)

## gender
##   Male female   <NA> 
##      0     91     92

rm(gender)

Exercise 12

Write a function that calculates the proportion of values in a vector x that exceed some value cutoff. a. Use the sequence of numbers \(1, 2, ..., 100\) to check that this function gives the result that is expected. b. Obtain the vector ex01.36 from the Devore6 (or Devore7) package. These data give the times required for individuals to escape from an oil platform during a drill. Use dotplot() to show the distribution of times. Calculate the proportion of escape times that exceed 7 minutes.

Solution

library(DAAG)
library(Devore7)


# Part "(a)"

cutoff_prop <- function(x, cutoff)
{
    return(sum(x > cutoff) / length(x))
}

my_linspace <- c(1:100)
my_cutoff <- 35

cutoff_prop(my_linspace, my_cutoff)

## [1] 0.65

# Part "(b)"

devore_times <- Devore7::ex01.36

dotplot(devore_times)

seven_min_in_sec <- 7*60

toolate_prop <- cutoff_prop(devore_times$C1, seven_min_in_sec)
toolate_prop

## [1] 0.03846154

Exercise 13

The following contains plots of four different transformations of the Animals data from the MASS package. What different aspects of the data do these different graphs emphasize? Consider the effect on low values of the variables, as contrasted with the effect on high values.

Solution

require(MASS)
data(Animals)

plot(brain~body, data=Animals)  # Some linters may complain: that's fine.

In the previous scatterplot, both the brain and the body variables are represented in linear scale, inside the same plot area, using a slight majoration of min-max data range as the axis range.
So-obtained plot compresses the small-brain/small-body datapoints near \((0,0)\), being also present at least (might even be the same, though in the specific case they are not) one large-brain and one large-body datapoint. As a further consequence, the plot emphasizes extreme large values in at least one coordinate. Only 6 of such datapoints are easily distinguishable in their coordinates. Proportions are preserved.

plot(sqrt(brain)~ sqrt(body), data=Animals)  # Some linters may complain: that's fine.

In the same context as before, the data are now plotted through a square-root transform of the coordinates.
This shrinks axes ranges and alters proportions (according to the chosen transform). The presence of extreme large values in at least one coordinate is still visible. More small-brain/small-body datapoints are now distinguishable, but provided they are not too close to \((0,0)\). Some datapoint-crowding is still present for extreme small values.

plot(I(brain^0.1)~ I(body^0.1), data=Animals)  # Some linters may complain: that's fine.

By further decreasing the exponent of the power-law transform of the axes (now the \(10^{th}\) root), the effect is further amplified. All datapoints are easily distinguishable, but proportionality is completely lost. There is not anymore a directly-acknowledgeable evidence of extreme values in either coordinate of any of the datapoints. Some intuitive evidence arises about a possible descriptive law in the power or exponential domain.

plot(log(brain)~ log(body), data=Animals)  # Some linters may complain: that's fine.

At the bi-logarithmic scale, all the observations given about the plot before still hold. Proportionality is still not easily visible, but the axes are now transformed in a (usually) more familiar manner. At this exploratory stage, qualitatively, an exponential relationship between the two data-coordinates can be assumed to be worth of further quantitative analysis. Agail, there is no easily noticeable evidence of extreme values.

Exercise 15

The data frame socsupport (from the DAAG package) has data from a survey on social and other kinds of support, for a group of university students. It includes Beck Depression Inventory (BDI) scores.

The following are two alternative plots of BDI against age:

plot(BDI ~ age, data=socsupport)
plot(BDI ~ unclass(age), data=socsupport)

For examination of cases where the score seems very high, which plot is more useful? Explain.

Why is it necessary to be cautious in making anything of the plots for students in the three oldest age categories (25-30, 31-40, 40+)?

Solution

For examination of cases where the score seems to be very high, the second plot could be better. Both plots show how the observations, grouped by age, are distributed w.r.t. the score of BDI. Nevertheless thanks to the ungrouping procedure, in the second plot it is also possible to see the numerosity of each group. In this way it is possible to determine how reliable can be the data for each group.

library(DAAG)
library(lattice)
data(socsupport)

par(mfrow=c(1,2))

plot(BDI ~ age, data = socsupport)           # Some linters may complain: that's fine.
plot(BDI ~ unclass(age), data = socsupport)  # Some linters may complain: that's fine.

It is necessary to be cautious in making anythig of the plots for students in the tree oldest age categories because, as we can see from the summary, there are very few data for those classes. Hence the data can be not-representative for the students of that age.

summary(socsupport$age)

## 18-20 21-24 25-30 31-40   40+ 
##    44    35     6     6     4

Exercise 17

Given a vector x, the following demonstrates alternative ways to create a vector of numbers from \(1\) through \(n\), where \(n\) is the length of the vector:

x <- c(8, 54, 534, 1630, 6611)
seq(1, length(x))
seq(along=x)

Now set x <- NULL and repeat each of the calculations seq(1, length(x)) and seq(along=x). Which version of the calculation should be used in order to return a vector of length 0 in the event that the supplied argument is NULL.

Solution

In order to return a vector of length 0 in the event that the supplied argument is NULL, you have to use the syntax seq(along=x).
In writing seq(1, length(x)), 1 is interpreted as the starting value of the sequence and length(x), that is equal to 0, as the end value of the sequence. Since the default value of the optional parameter by is ((to - from)/(length.out - 1)), that in this case is equal to -1, a vector of two elements, 1 and 0, is created.

x <- c(8, 54, 534, 1630, 6611)
seq(1, length(x))

## [1] 1 2 3 4 5

seq(along = x)

## [1] 1 2 3 4 5

x <- NULL
seq(1, length(x))

## [1] 1 0

length(seq(1, length(x)))

## [1] 2

seq(along = x)

## integer(0)

length(seq(along = x))

## [1] 0

Exercise 20

The help page for iris (type help(iris)) gives code that converts the data in iris3 (datasets package) to case-by-variable format, with column names Sepal.Length Sepal.Width, Petal.Length, Petal.Width, and Species. Look up the help pages for the functions that are used, and make sure that you understand them. Then add annotation to this code that explains each step in the computation.

Solution

library(datasets)

#help(iris)

dni3 <- dimnames(iris3) # save the dimnames of the object
                        # iris3
# given that iris3 is a 3-dimensional array 50 x 4 x 3
# the function dimnames returns the names of each one
# of the 3 dimensions of it. 
# the output is a list where the first element is null
# becuse the rows of iris3 have no names
# the second and the third elements are arrays that
# contain the names of the second and third dimension
# of iris3

test <- aperm(iris3, c(1,3,2))
ii <- data.frame( # create a dataframe from a matrix
    matrix( # create a matrix from the transposition of
            # the iris3 array

      aperm(iris3, c(1,3,2)), # transpose the 3-d array iris
                              # result: 50 x 3 x 4 array
      ncol = 4, # having 4 columns

      # set the names of the rows and columns of the matrix;
      dimnames = list(NULL, # create a list, starting from a
                            # NULL object

                      sub(" L.",".Length",
                          #substitute "L."  with ".Length"
                          # in the output of the following line

                          sub(" W.",".Width", dni3[[2]])))),
                          # substitute "W." with ".Width" in
                          # the strings contained in the list
                          # of the names of the second
                          # dimension of iris3: ("Sepal L."
                          #  "Sepal W." "Petal L." "Petal W.")

    # create a new column of the dataframe called Species
    Species = gl( # generate factors given:
      3, 50, # the number of levels (=3), the number of
             # replications (=50)

      labels =  # and the labels of the levels
        sub("S", "s", sub("V", "v", dni3[[3]])))) 
        # substitute "S" with "s" and "V" with "v" in in the
        # strings contained in the list of the names of the
        # third dimension of iris3
        # ("Setosa" "Versicolor" "Virginica")

all.equal(ii, iris) # compare the created object and iris

## [1] TRUE

CS EXERCISES

Exercise 1.1

Exponential random variable, \(X \geq 0\), has pdf \(f(x) = \lambda \cdot \ e^{-\lambda x}\).

Find the cdf and the quantile function for \(X\).
Find \(Pr(X < \lambda)\) and the median of \(X\).
Find the mean and variance of \(X\).

Solution

The pdf for \(X\) is: \[ f(x)=\begin{cases} \lambda e^{-\lambda x} &x\geq0 \\x=0 &x<0 \end{cases} \]

We know that its cdf is \(F(x)=Pr(x\leq X)\), so we can compute for \(x \geq 0\): \[ F(x)=\int_{-\infty}^{0} f(x)dx+\int_{0}^{X}f(x)dx= \int_{0}^{X}\lambda e^{-\lambda x}dx=[-e^{-\lambda x}]_{0}^{X}= e^{-\lambda 0}-e^{-\lambda X}= 1-e^{-\lambda X} \]

The quantile function instead is its inverse, so we have:

\[ F=1-e^{-\lambda Q} \ \Rightarrow \ 1-F=e^{\lambda Q} \ \Rightarrow \ Q=-\frac{ln(1-F)}{\lambda} \]

We know that \(F(\lambda)=Pr(x \leq \lambda)\) so we can compute: \[ F(\lambda)=1-e^{\lambda^2} \] The median is: \[ \int_{-\infty}^{m} f(x)dx = \frac{1}{2} =\int_{m}^{+\infty}f(x)dx\Rightarrow [1-e^{-\lambda x}]_{m}^{+\infty}=\frac{1}{2} \Rightarrow e^{-\lambda m}=\frac{1}{2}\Rightarrow m=\frac{ln(2)}{\lambda} \]

3.the mean is: \[ E[x]=\int_{-\infty}^{+\infty}x f(x)dx= \int_{-\infty}^{0}x f(x)dx+\int_{0}^{+\infty}x f(x)dx= \int_{0}^{+\infty}x f(x)dx= \int_{0}^{+\infty}x \lambda e^{-\lambda x}dx \] By using integration by parts, then:

\[ g'(x)=\lambda e^{-\lambda x} \ \ g(x)=-e^{-\lambda x}\\ f'(x)=1 \ \ f(x)= x \\ \int_{0}^{+\infty} g'(x)f(x)dx = g(x)f(x) - \int_{0}^{+\infty} g(x) f'(x)dx \] And thus we obtain: \[ E[x]=[-xe^{-\lambda x}]_{0}^{+\infty}+\int_{0}^{+\infty}-e^{-\lambda x}dx=\left[\frac{-e^{-\lambda x}}{\lambda}\right]_{0}^{+\infty}=\frac{1}{\lambda} \] For what concernes the computation of the variance, we have that if the distribution admits finite variance, that implies the existance of \(E[x]\), and \(var(x)=E[x^2]-E[x]^2\). \[ E[x^2]=\int_{0}^{+\infty}x^2 \lambda e^{-\lambda x}dx \] By using integration by parts, we let

\[ g'(x)=\lambda e^{-\lambda x} \ \ g(x)=-e^{-\lambda x}\\ f'(x)=2x \ \ f(x)= x^2 \\ \]

And then, \[ E[x^2]=[-x^2e^{-\lambda x}]_{0}^{+\infty}+\int_{0}^{+\infty} 2x e^{-\lambda x}dx=\frac{2}{\lambda^2} \] because we know that \(\int_{0}^{+\infty} x e^{-\lambda x}dx=\frac{1}{\lambda^2}\).

Thus, it directly follows that \(var(x)=E[x^2]-E[x]^2 = {{1}\over{\lambda^2}}\)

Exercise 1.2

Evaluate \(Pr(X < 0.5, Y < 0.5)\) if \(X\) and \(Y\) have joint pdf given (see below).

Solution

Let \[ f (x, y) =\begin{cases} x + \frac{3y^2} {2} &0<x<1 \wedge 0<y<1 \\ 0 &\text{otherwise} \end{cases} \]

\[ Pr(X < 0.5, Y < 0.5)=\int_{0}^{0.5}\int_{0}^{0.5}x + \frac{3y^2} {2}dxdy= \int_{0}^{0.5}\int_{0}^{0.5}x\ dxdy + \int_{0}^{0.5}\int_{0}^{0.5}\frac{3y^2}{2}dxdy = \frac{1}{2} \left({ \int_{0}^{0.5}x\ dx + \int_{0}^{0.5}\frac{3y^2}{2}\ dy }\right) \] At this point, we can compute: \(\int_{0}^{0.5}x\ dx = \frac{1}{8}\) and \(\int_{0}^{0.5}\frac{3y^2}{2}\ dy = \frac{1}{16}\).

It directly follows that \(Pr(X < 0.5, Y < 0.5)= \frac{1+2}{32} = \frac{3}{32} \approx0.094\)

Exercise 1.6

Let \(X\) and \(Y\) be non-independent random variables, such that \(var(X) =\sigma_{X}^{2}\), \(var(Y)=\sigma_{Y}^{2}\) and \(cov(X, Y) =\sigma_{X, Y}^{2}\). Using the result from Section 1.6.2, find \(var(X+Y)\) and \(var(X-Y)\).

Solution

If we define \(Z = X + Y\), than we can compute \(Var(Z)\) as follow:

\(Var(Z) = E[(Z - E[Z])^{2}] = E[(X + Y - E[X + Y])^{2}] = E[((X - E[X]) + (Y - E[Y]))^{2}] =\)

\(E[(X - E[X])^{2}] + E[(Y - E[Y])^{2}] + 2 \cdot E[(X - E[X]) \cdot (Y - E[Y])] = \sigma_{X}^{2} + \sigma_{Y}^{2} + 2 \cdot \sigma_{X, Y}^{2}\)

If we define \(Z = X - Y\), than we can compute \(Var(z)\) as follow:

\(Var(Z') = E[(Z - E[Z])^{2}] = E[(X - Y - E[X - Y])^{2}] = E[((X - E[X]) - (Y - E[Y]))^{2}] =\)

\(E[(X - E[X])^{2}] + E[(Y - E[Y])^{2}] - 2 \cdot E[(X - E[X]) \cdot (Y - E[Y])] = \sigma_{X}^{2} + \sigma_{Y}^{2} - 2 \cdot \sigma_{X, Y}^{2}\).

However, if we want to follow the hinted-to paragraph, we may suppose that \(X\) and \(Y\) are the coordinates of a multi-variate random vector \(W = V_{1} = \begin{bmatrix} X \\ Y \end{bmatrix}\) that has a generic distribution with covariance matrix \(\Sigma\).

So, let us define \(V_{1} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}^{T}\) and \(V_{2} = \begin{bmatrix} 1 \\ -1 \end{bmatrix}^{T}\) two bi-dimensional vectors used to obtain the random variates \(Z = X + Y\) and \(Z' = X - Y\), respectively, via a linear transformation (matrix multiplication in this case). Thus, we can obtain the covariances of \(Z\) and \(Z'\) by applying the corresponding transformation rule for \(\Sigma\):

Case \(Z = X + Y\) \(Var(X + Y) = V \cdot \Sigma \cdot V^{T} = \begin{bmatrix} 1 & 1 \end{bmatrix} \cdot \begin{bmatrix} \sigma_{x}^{2} & \sigma_{x, y} \\ \sigma_{x, y} & \sigma_{y}^{2} \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} \sigma_{x}^{2} + \sigma_{x, y} && \sigma_{y}^{2} + \sigma_{x, y} \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \sigma_{X}^{2} + \sigma_{Y}^{2} + 2 \cdot \sigma_{X, Y}^{2}\)
Case \(Z = X - Y\) \(Var(X - Y) = V \cdot \Sigma \cdot V^{T} = \begin{bmatrix} 1 & -1 \end{bmatrix} \cdot \begin{bmatrix} \sigma_{x}^{2} & \sigma_{x, y} \\ \sigma_{x, y} & \sigma_{y}^{2} \end{bmatrix} \cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix} = \begin{bmatrix} \sigma_{x}^{2} - \sigma_{x, y} && -\sigma_{y}^{2} + \sigma_{x, y} \end{bmatrix} \cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix} = \sigma_{X}^{2} + \sigma_{Y}^{2} - 2 \cdot \sigma_{X, Y}^{2}\)

Exercise 1.7

Let \(Y_1\), \(Y_2\) and \(Y_3\) be independent \(N(\mu, \sigma^2)\) r.v.s. Somehow using the matrix \[\textbf{W} = \begin{bmatrix} 1/3 & 1/3 & 1/3\\ 2/3 & -1/3 & -1/3\\ -1/3 & 2/3 & -1/3 \end{bmatrix}\] show that \(\bar{Y} = \sum_{i=1}^3{Y_i/3}\) and \(\sum_{i=1}^{3}{(Y_i - \bar{Y})^2}\) are independent random variables.

Solution (Part 1: rephrasing the exercise)

We can rephrase and complement the exercise above with a more expressive notation and by adding a row to the matrix provided. Such manipulation of the original text allows for a more direct and compact solution, while still embedding \(\textbf{W}\), though not explicitly. Such solution can be seen as a generalization of the suggestion provided by the authors.

Let \(Y_1\), \(Y_2\) and \(Y_3\) be independent \(N(\mu, \sigma^2)\) r.v.s. Somehow using the matrix \(\textbf{M} = \begin{bmatrix} & \textbf{W} \\ -1/3 & -1/3 & 2/3 \end{bmatrix} = \begin{bmatrix} 1/3 & 1/3 & 1/3\\ 2/3 & -1/3 & -1/3\\ -1/3 & 2/3 & -1/3 \\ -1/3 & -1/3 & 2/3 \end{bmatrix}\) show that \(\bar{Y} = \sum_{i=1}^3{Y_i/3}\) and \(\sum_{i=1}^{3}{(Y_i - \bar{Y})^2} = \sum_{i=1}^{3}{(V_i)^2}\) are independent random variables.

As such, the exercise can be solved as follows.

Solution (Part 2: actual solution)

First, consider the r.v.s \(Y_1\), \(Y_2\) and \(Y_3\) as the coordinates of a random vector \(\textbf{Y} = \begin{bmatrix} Y_1\\ Y_2\\ Y_3 \end{bmatrix}\) whose distribution is the Multivariate Normal of i.i.d. variables, with mean (vector) \(\boldsymbol{\mu} = \begin{bmatrix} \mu\\ \mu\\ \mu \end{bmatrix}\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} \sigma^2 & 0 & 0\\ 0 & \sigma^2 & 0\\ 0 & 0 &\sigma^2 \end{bmatrix}\).

Then, by computing symbolically a matrix-vector product, we can show that \(\textbf{S} = \textbf{M}\times\textbf{Y} = \begin{bmatrix} {Y_1 + Y_2 + Y_3}\over{3}\\ {2Y_1 - Y_2 - Y_3}\over{3}\\ {-Y_1 + 2Y_2 - Y_3}\over{3}\\ {-Y_1 - Y_2 + 2Y_3}\over{3} \end{bmatrix} = \begin{bmatrix} \bar{Y}\\ Y_1 - \bar{Y}\\ Y_2 - \bar{Y}\\ Y_3 - \bar{Y} \end{bmatrix} = \begin{bmatrix} \bar{Y}\\ V_1\\ V_2\\ V_3 \end{bmatrix}\).

By the properties of the Multivariate Normal distribution, the random vector \(\textbf{S}\) can be shown to be Normally distributed, and with covariance matrix \(\boldsymbol{\Sigma'} = \textbf{M} \times \boldsymbol{\Sigma} \times \textbf{M}^{T} = \begin{bmatrix} \times & 0 & 0 & 0\\ 0 & \times & \times & \times\\ 0 & \times &\times & \times \end{bmatrix}\) where the \(\times\) symbol as a matrix element has been used as a generic placeholder. So-marked element, as well as the mean vector of such distribution are irrelevant w.r.t. the problem at hand.

From the \(\boldsymbol{\Sigma'}\) matrix above, and from their normality, we can notice that the variables \(\bar{Y}\), \(V_1\), \(V_2\) and \(V_3\) are normally-distributed and that \(\bar{Y}\) is pairwise-uncorrelated to \(V_1\), \(V_2\) and \(V_3\). This (in the case of a multinormal distribution only, which is our case) implies also independence (pairwise, among \(\bar{Y}\) and the set of \(V_1\), \(V_2\) and \(V_3\)).

From the transitivity of pairwise-independence among sets of random variables through set-closed arithmetic operations, we can conclude that also \(\sum_{i=1}^{3}{(V_i)^2}\) is independent from \(\bar{Y}\).

Exercise 1.8

If \(log(X)~N(\mu, \sigma^{2})\), find the pdf of \(X\).

Solution

From the Transformation Theorem for r.v. we can define the cdf:

\(F_{log(X)}(x) = P(log(X) \leq x) = P(X \leq e^{x}) = F_{X}(e^{x}) = \int_{0}^{e^{x}} \frac{1}{\sqrt{2 \cdot \pi \cdot \sigma^{2}}} \cdot e^{- \frac{(t - \mu)^{2}}{2}} \space dt\).

And also the pdf:

\(f_{X}(x) = \frac{d F_{X}(e^{x})}{dx} = \frac{1}{\sqrt{2 \cdot \pi \cdot \sigma^{2}}} \cdot e^{- \frac{(e^{x} - \mu)^{2}}{2}}\).

Exercise 1.9

Discrete random variable \(Y\) has a Poisson distribution with parameter \(\lambda\) if its pdf is \(f(y) =\lambda \cdot y \cdot e^{- \frac{\lambda}{y!}}\), for \(y= 0,1, \dots\)

Find the moment generating function for \(Y\) (hint: the power series repre-sentation of the exponential function is useful).
If \(Y_{1}~Poi(\lambda_{1})\) and independently \(Y_{2}~Poi(\lambda_{2})\) deduce the distribution of \(Y_{1}+Y_{2}\), by employing a general property of m.g.f.s.
Making use of the previous result and the central limit theorem, deduce the normal approximation to the Poisson distribution.
Confirm the previous result graphically, using R functions dpois, dnorm, plotorbarplotandlines. Confirm that the approximation improves with increasing \(\lambda\).

Solution

a) The moment generating function (mgf) is defined as: \(G_{Y}(t) = E[e^{t \cdot Y}] = \sum_{i = 1}^{n} p_{i} \cdot e^{t \cdot Y_{i}}\).

In case of discrete distribution, so the mgf for the Poisson distribution is the following one: \(G_{Y}(t) = \sum_{i = 0}^{\infty} e^{t \cdot i} \cdot \lambda^{i} \cdot \frac{e^{-\lambda}}{i!} = e^{-\lambda} \cdot \sum_{i = 0}^{\infty} \frac{(\lambda \cdot e^{t})^{i}}{i!} = e^{-\lambda} \cdot e^{\lambda \cdot e^{t}} = e^{\lambda \cdot (e^{t} - 1)}\)

b) Here we can apply the mgf property with indipendent random variables: \(G_{Y_{1} + Y_{2}}(t) = G_{Y_{1}}(t) \cdot G_{Y_{2}}(t) = e^{\lambda_{1} \cdot (e^{t} - 1)} \cdot e^{\lambda_{2} \cdot (e^{t} - 1)} = e^{(\lambda_{1} + \lambda_{2}) \cdot (e^{t}-1)}\)

Note that this is the mfg of a Poisson distribution with \(\lambda = \lambda_{1} + \lambda_{2}\).

c) Note that \(X_{n} ~ Poi(n)\) and consider the standardized Poisson \(\frac{X_{n}-n}{\sqrt{n}}\), then:

\(\lim_{n \rightarrow\infty} E[e^{t \cdot \frac{X_{n}-n}{\sqrt{n}}}] = \lim_{n \rightarrow\infty} e^{- t\cdot \sqrt{n}} \cdot E[e^{t \cdot \frac{X_{n}}{\sqrt{n}}}] =\)

\(\lim_{n \rightarrow\infty} e^{- t\cdot \sqrt{n}} \cdot e^{n \cdot (e^{\frac{t}{\sqrt{n}}} - 1)} = \lim_{n \rightarrow\infty} e^{-t \cdot \sqrt{n} + n \cdot (1 + t \cdot n^{\frac{1}{2}} + \frac{t^{2} \cdot n^{-1}}{2} + \cdots -1)} =\)

\(\lim_{n \rightarrow\infty} e^{-t \cdot \sqrt{n} + t \cdot \sqrt{n} + \frac{t^{2}}{2} + \frac{t^{3}}{6 \cdot \sqrt{n}} + \cdots} = \lim_{n \rightarrow\infty} e^{\frac{t^{2}}{2}} \cdot e^{(\frac{t^{3}}{6 \cdot \sqrt{n}} + \cdots)} = e^{\frac{t^{2}}{2}}\)

Where \(e^{\frac{t^{2}}{2}}\) is the mgf of the Standard Normal distribution.

x <- seq(from = 0, to = 20, by = 1)

lambda <- c(0.7, 1.2, 5, 15)

par(mfcol = c(2, 2))

# Place the legend programmatically ;)
legendloc <- function(iter)
{
    if (iter < 4) {return("topright")}
    else {return("topleft")}
}

for (i in 1:4)
{
    poisson_distribution <- dpois(x, lambda[i])
    normal_distribution <-
        dnorm(x, mean = lambda[i], sd = sqrt(lambda[i]))

    plot(
        x,
        poisson_distribution,
        type = "l",
        ylab = "F(x)",
        main = paste("lambda = ", lambda[i]),
        col = "blue"
    )

    lines(x, normal_distribution, type = "l", col = "red")

    legend(
        legendloc(i),
        legend = c("poisson", "normal"),
        col = c("blue", "red"),
        lty = c(1, 1)
    )
}

SMDS Homework - Block 1

A. Vegliach, P. Morichetti, A. Cicchini and E. Ballarin

8th April 2020

LAB EXERCISES

DAAG EXERCISES

CS EXERCISES