Exercise 1
Write a function binomial(x,n,p)
for the binomial distribution, depending on parameters x
, n
, p
, and test it with some prespecified values. Use the function choose()
for the binomial coefficient.
Plot two binomials with \(n=20\), and \(p=0.3, 0.6\) respectively.
Solution
# Function
binomial <- function(x, n, p)
{
return(choose(n, x) * p ^ x * (1 - p) ^ (n - x))
}
# Test the function
binomial(3, 6, 0.5)
## [1] 0.3125
dbinom(3, 6, 0.5)
## [1] 0.3125
binomial(5, 100, 0.1)
## [1] 0.0338658
dbinom(5, 100, 0.1)
## [1] 0.0338658
binomial(5, 10, 0.5)
## [1] 0.2460938
dbinom(5, 10, 0.5)
## [1] 0.2460938
binomial(1, 2, 0.5)
## [1] 0.5
dbinom(1, 2, 0.5)
## [1] 0.5
# Plot
x <- seq(1:20)
y1 <- binomial(x, 20, 0.3)
y2 <- binomial(x, 20, 0.6)
plot(x, y1)
points(x, y2, col = "red")
Exercise 2
Generate in R
the same output shown in the Lab lecture for the Negative Binomial distribution, but using rgeom()
for generating the random variables. Hint: generate \(n\) times three geometric distribution \(X_1,\ldots, X_3\) with \(p=0.08\), store them in a matrix and compute then the sum \(Y\).
Solution
We can obtain the Negative Binomial pdf \(NegBinom(r, k, p)\) as the sum of \(k\) iid Geometrical distributions \(Geom(r, p)\) with same \(p\) and \(r\).
# Hyperparameters
k <- 3
p <- 0.08
upbound <- 1000
# As Binomial pdf
plot(0:upbound,
dnbinom(0:upbound, size = k, prob = p),
xlab="Y=number of failures before k successes", ylab="f(x), exact",pch=21, bg=1)
# As sum of k Geometric rvs, via histogram estimation with count averaging
avg_factor <- 1500 # A good...
bin_fact <- 1 # ... compromise.
seq_stencil <- seq(from = 0, to = upbound, by = bin_fact)
mycounts <- vector(mode = "numeric", upbound/bin_fact)
# Averaging
for (cnt in 0:avg_factor)
{
mycounts = mycounts +
hist(
# k*(upbound + 1) elements are generated directly via the vectorized `rgeom` function. Such elements are
# reshaped into a matrix of suitable dimensions via remainder-computation. Such matrix is transposed
# and then summed column-wise to produce the wanted results.
colSums(t(matrix(rgeom(((0:(k*(upbound+1)-1))%%(upbound+1)), prob = p), nrow=(upbound+1), ncol=k))),
breaks = seq_stencil,
plot = FALSE)$density
}
# Normalization
myedpf = mycounts/avg_factor
plot(seq_stencil[0:(length(seq_stencil)-1)], myedpf, # The number of bins is one less the number of breaks!
xlab="Y=number of failures before k successes", ylab="f(x), approx.",pch=21, bg=1)
Exercise 3
Show in R
, also graphically, that \(\mbox{Gamma}(n/2, 1/2)\) coincides with a \(\chi^{2}_{n}\).
Find the 5% and the 95% quantiles of a \(\mbox{Gamma}(3,3)\).
Solution
# Plot
x <- seq(1:25)
df <- 5
n <- 1000
y <- rchisq(n, df)
hist(y, breaks = 40, probability = TRUE)
curve(
dgamma(x, rate = 1 / 2, shape = df / 2),
col = "red",
lwd = 2,
add = TRUE
)
# Quantiles
qgamma(0.05, 3, 3)
## [1] 0.2725638
qgamma(0.95, 3, 3)
## [1] 2.098598
Exercise 4
Generate \(n=1000\) values from a \(\mbox{Beta}(5,2)\) and compute the sample mean and the sample variance.
Solution
n <- 1000
x <- rbeta(n, 5, 2)
mean(x)
## [1] 0.7221582
var(x)
## [1] 0.02367058
Exercise 5
Show with a simple R
function that a negative binomial distribution may be seen as a mixture between a Poisson and a Gamma. In symbols: \(X|Y \sim \mathcal{P}(Y)\), \(Y \sim \mbox{Gamma}(\alpha, \beta)\), then \(X \sim \ldots\).
Solution
We can see that if \(X|Y \sim \mathcal{P}(Y)\), \(Y \sim \mbox{Gamma}(\alpha, \beta)\), then \(X \sim \mbox{NB} (\alpha, \frac{1}{\beta+1 })\).
n <- 200000
r <- 0.5
p <- 0.5
nb_mixture <- function(my_n, my_r, my_p)
{
# Note: here we use the shape/scale parametrization!
Z = rgamma(my_n, shape = my_r, scale = my_p/(1-my_p))
X = rpois(my_n, lambda = Z)
return(X)
}
plot(hist(nb_mixture(n, r, p), breaks=15, plot = FALSE)$counts/n, xlab = "x", ylab = "NegBin(x)")
points(dnbinom((0:15), r,p), col='red')
Exercise 6
Instead of using the built-in function ecdf()
, write your own R
function for the empirical cumulative distribution function and reproduce the two plots shown during the Lab lecture.
Solution
# Function to compute eCDF
my_ecdf <- function(x)
{
z <- c()
y <- sort(x)
for (i in 1:length(y))
{
z[i] <- i / length(y)
}
z <- cbind(z, y)
return(z)
}
# Function to plot eCDF
plot_ecdf <- function(x)
{
plot(
x[, 2],
x[, 1],
type = "s",
ylim = c(-0.01, 1.01),
xlab = "x",
ylab = "Fn(x)",
main = "ECDF and CDF: n=50"
)
abline(0, 0, col = "gray", lty = 2)
abline(1, 0, col = "gray", lty = 2)
}
set.seed(2)
par(mfrow = c(1, 2))
# n = 50
n <- 50
y <- rbeta(n, 3, 4)
tt <- seq(from = 0, to = 1, by = 0.01)
my_edf_beta <- my_ecdf(y)
plot_ecdf(my_edf_beta)
lines(tt,
pbeta(tt, 3, 4),
col = 2,
lty = 2,
lwd = 2)
# n = 500
n2 <- 500
y2 <- rbeta(n2, 3, 4)
my_edf_beta2 <- my_ecdf(y2)
plot_ecdf(my_edf_beta2)
lines(tt,
pbeta(tt, 3, 4),
col = 2,
lty = 2,
lwd = 2)
Exercise 7
Compare in R
the assumption of normality for these samples:
\(y_1, \ldots, y_{100} \sim t_{\nu},\) with \(\nu=5,20, 100\). What does it happens when the number of degrees of freedom \(\nu\) increases?
\(y_1, \ldots, y_{100} \sim \mbox{Cauchy}(0,1)\). Do you note something weird for the extremes quantiles?
Solution
In order to compare the assumption of normality w.r.t. a sample you can use the Q-Q plot vs. the Normal distribution.
From the Q-Q plots that follow, related to a t-Student distribution with an increasing number of degrees of freedom, we can see that the t-Student is better and better approximated (until asymptotic convergence, as the theory shows) for an arbitrarily large number of d.o.f.. In particular, the deviations of Q-Q points to the theoretical Q-Q line decrease as the number of d.o.f. increases.
par(mfrow = c(1, 3))
n <- 100
# t-Student df = 5
y <- rt(n, df = 5)
qqplot(
qt(ppoints(n), df = 5),
y,
xlab = "True quantiles",
ylab = "Sample quantiles",
main = "Q-Q plot for t-student(5): n=100",
ylim = c(-5, 5),
xlim = c(-5, 5)
)
qqline(
y,
distribution = function(p)
qt(p, df = 5),
col = "red"
)
# t-Student df = 20
y2 <- rt(100, 20)
qqplot(
qt(ppoints(n), 20),
y2,
xlab = "True quantiles",
ylab = "Sample quantiles",
main = "Q-Q plot for t-student(20): n=100",
ylim = c(-5, 5),
xlim = c(-5, 5)
)
qqline(
y2,
distribution = function(p)
qt(p, df = 20),
col = "red"
)
# t-Student df = 100
y3 <- rt(100, 100)
qqplot(
qt(ppoints(n), 100),
y3,
xlab = "True quantiles",
ylab = "Sample quantiles",
main = "Q-Q plot for t-student(100): n=100",
ylim = c(-5, 5),
xlim = c(-5, 5)
)
qqline(
y3,
distribution = function(p)
qt(p, df = 100),
col = "red"
)
Looking at the quantiles of \(y_1, \ldots, y_{100} \sim \mbox{Cauchy}(0,1)\) we can see that the majority of the points lay on the theoretical Q-Q line, with their values near one to each other. Nevertheless we notice that extreme Q-Q points deviate more and more from the theoretically-normal Q-Q line. This is a peculiar consequence less-than-\(1/x^2\) decay of its tails, which is also responsible for the non-finiteness of expectation value and variance for such distribution.
# Cauchy
y4 <- rcauchy(100, 0, 1)
qqplot(
qcauchy(ppoints(n), 0, 1),
y4,
xlab = "True quantiles",
ylab = "Sample quantiles",
main = "Q-Q plot for Cauchy(0,1): n=100"
)
qqline(
y4,
distribution = function(p)
qcauchy(p, 0, 1),
col = "red"
)
Exercise 8
Write a general R
function for checking the validity of the central limit theorem. Hint The function will consist of two parameters: clt_function <- function(n, distr)
, where the first one is the sample size and the second one is the kind of distribution from which you generate. Use plots for visualizing the results.
Solution
clt_function <- function(n, dist) {
x <- seq(1, n)
sample <- switch(
dist,
"beta" = rbeta(x, 2, 5),
"binomial" = rbinom(x, n, 0.5),
"chisquared" = rchisq(x, 5),
"gamma" = rgamma(x, 3),
"negativebinomial" = rnbinom(x, 3, 0.5),
"tstudent" = rt(x, 5)
)
# Plot the distribution of the mean of the observations
hist(
sample,
freq = FALSE,
main = "CLT",
border = "red",
nclass = 50
)
# Normal
mu <- mean(sample)
var <- var(sample)
curve(dnorm(x, mu, sqrt(var)),
add = TRUE,
col = "black",
lwd = 2)
}
# Plots
par(mfrow = c(2, 2))
clt_function(1000, "beta")
clt_function(1000, "binomial")
clt_function(1000, "chisquared")
clt_function(1000, "tstudent")
Exercise 4
For the data frame ais
(DAAG
package).
Use the function str()
to get information on each of the columns. Determine whether any of the columns hold missing values.
Make a table that shows the numbers of males and females for each different sport. In which sports is there a large imbalance (e.g., by a factor of more than 2:1) in the numbersof the two sexes?
Solution
library(DAAG)
library(data.table) # setDT
data <- ais
str(data)
## 'data.frame': 202 obs. of 13 variables:
## $ rcc : num 3.96 4.41 4.14 4.11 4.45 4.1 4.31 4.42 4.3 4.51 ...
## $ wcc : num 7.5 8.3 5 5.3 6.8 4.4 5.3 5.7 8.9 4.4 ...
## $ hc : num 37.5 38.2 36.4 37.3 41.5 37.4 39.6 39.9 41.1 41.6 ...
## $ hg : num 12.3 12.7 11.6 12.6 14 12.5 12.8 13.2 13.5 12.7 ...
## $ ferr : num 60 68 21 69 29 42 73 44 41 44 ...
## $ bmi : num 20.6 20.7 21.9 21.9 19 ...
## $ ssf : num 109.1 102.8 104.6 126.4 80.3 ...
## $ pcBfat: num 19.8 21.3 19.9 23.7 17.6 ...
## $ lbm : num 63.3 58.5 55.4 57.2 53.2 ...
## $ ht : num 196 190 178 185 185 ...
## $ wt : num 78.9 74.4 69.1 74.9 64.6 63.7 75.2 62.3 66.5 62.9 ...
## $ sex : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
## $ sport : Factor w/ 10 levels "B_Ball","Field",..: 1 1 1 1 1 1 1 1 1 1 ...
# function is.na return, for each element of the data frame, if
# it's NaN then it returns TRUE, otherwise FALSE.
# function sum counts the number of elements that are TRUE (So,
# in our case it counts how elements are NaN).
cat("Number of elements that are NaN = ", sum(is.na(data)))
## Number of elements that are NaN = 0
# create a table that sumamrize numbers of males and females for each
# sport and define a new column, called "ratio", showing the ratio
# between sexes.
# summarizing table
sex_frequency <- table(data$sport, data$sex)
# ratio vector: f / m
ratio <- sex_frequency[1:nrow(sex_frequency)] / sex_frequency[1:nrow(sex_frequency), 2]
# final table with ratio column
sex_frequency <- cbind(sex_frequency, ratio)
# convert table into dataframe
sex_frequency <- as.data.frame(sex_frequency)
# convert rows name in first column of the data frame
# and print dataframe
setDT(sex_frequency, keep.rownames = "sport")[]
## sport f m ratio
## 1: B_Ball 13 12 1.0833333
## 2: Field 7 12 0.5833333
## 3: Gym 4 0 Inf
## 4: Netball 23 0 Inf
## 5: Row 22 15 1.4666667
## 6: Swim 9 13 0.6923077
## 7: T_400m 11 18 0.6111111
## 8: T_Sprnt 4 11 0.3636364
## 9: Tennis 7 4 1.7500000
## 10: W_Polo 0 17 0.0000000
cat("Unbalanced Sport (ratio > 2):", sex_frequency$sport[sex_frequency$ratio > 2])
## Unbalanced Sport (ratio > 2): Gym Netball
Exercise 6
Create a data frame called Manitoba.lakes
that contains the lake’s elevation
(in metersabove sea level) and area
(in square kilometers) as listed below. Assign the names of the lakes using therow.names()
function.
attach(Manitoba.lakes)
plot(log2(area) ~ elevation, pch=16, xlim=c(170,280)) # NB: Doubling the area increases log2(area) by 1.0
text(log2(area) ~ elevation,labels=row.names (Manitoba.lakes), pos=4) text(log2(area) ~ elevation, labels=area, pos=2)
title("Manitoba's Largest Lakes")
detach(Manitoba.lakes)
Devise captions that explain the labeling on the points and on they-axis. It will be necessary to explain how distances on the scale relate to changes in area.
area
versus elevation
,but specifying log=“y” in order to obtain a logarithmicy-scale. Note: Thelog=“y” setting carries across to the subsequent text()
commands. See Subsection 2.1.5 for anexample.Solution
library(DAAG)
# Part: "preliminary"
Manitoba.lakes <- data.frame(
row.names = c(
"Winnipeg",
"Winnipegosis",
"Manitoba",
"SouthernIndian",
"Cedar",
"Island",
"Gods",
"Cross",
"Playgreen"
),
elevation = c(217., 254., 248., 254., 253., 227., 178., 207., 217),
area = c(24387., 5374., 4624., 2247., 1353., 1223., 1151., 755., 657.)
)
# Part: "(a)"
attach(Manitoba.lakes)
plot(log2(area) ~ elevation,
pch = 16,
xlim = c(170, 280))
# NB: Doubling the area increases log2(area) by 1.0
text(log2(area) ~ elevation,
labels = row.names(Manitoba.lakes),
pos = 4)
text(log2(area) ~ elevation, labels = Manitoba.lakes$area, pos = 2)
title("Manitoba’s Largest Lakes")
detach(Manitoba.lakes)
# Part: "(b)"
attach(Manitoba.lakes)
plot(
area ~ elevation,
pch = 16,
xlim = c(170, 280),
log = "y"
)
text(area ~ elevation,
labels = row.names(Manitoba.lakes),
pos = 4)
text(area ~ elevation, labels = Manitoba.lakes$area, pos = 2)
title("Manitoba’s Largest Lakes")
detach(Manitoba.lakes)
Exercise 11
Run the following code:
gender <- factor(c(rep("female", 91), rep("male", 92)))
table(gender)
gender <- factor(gender, levels=c("male", "female"))
table(gender)
gender <- factor(gender, levels=c("Male", "female"))# Note the mistake: "Male" should be "male"
table(gender)
table(gender, exclude=NULL)
rm(gender) # Remove gender
Explain the output from the successive uses of table()
.
Solution
We show below: the provided code fragments, their output, and a possible explanation.
A vector of factor
s composed of 91 female
and 92 male
elements is created. The table
command shows such result, according to the order of definition.
gender <- factor(c(rep("female", 91), rep("male", 92)))
table(gender)
## gender
## female male
## 91 92
As the levels
grouping is specified, with male
first and female
second, the table
command adapts its output accordingly.
gender <- factor(gender, levels=c("male", "female"))
table(gender)
## gender
## male female
## 92 91
Because of a case-mistake, we define a level
which does not match any factor
present in the vector. The “Male” column (should have been “male”) is empty.
# Notice the mistake! "Male" should be "male".
gender <- factor(gender, levels=c("Male", "female"))
table(gender)
## gender
## Male female
## 0 91
In the following, the table
command is instructed to explicitly group NA
(not available) data (in this case w.r.t. the levels
provided) as a separate, visible level.
table(gender, exclude=NULL)
## gender
## Male female <NA>
## 0 91 92
rm(gender)
Exercise 12
Write a function that calculates the proportion of values in a vector x that exceed some value cutoff
. a. Use the sequence of numbers \(1, 2, ..., 100\) to check that this function gives the result that is expected. b. Obtain the vector ex01.36
from the Devore6
(or Devore7
) package. These data give the times required for individuals to escape from an oil platform during a drill. Use dotplot()
to show the distribution of times. Calculate the proportion of escape times that exceed 7 minutes.
Solution
library(DAAG)
library(Devore7)
# Part "(a)"
cutoff_prop <- function(x, cutoff)
{
return(sum(x > cutoff) / length(x))
}
my_linspace <- c(1:100)
my_cutoff <- 35
cutoff_prop(my_linspace, my_cutoff)
## [1] 0.65
# Part "(b)"
devore_times <- Devore7::ex01.36
dotplot(devore_times)
seven_min_in_sec <- 7*60
toolate_prop <- cutoff_prop(devore_times$C1, seven_min_in_sec)
toolate_prop
## [1] 0.03846154
Exercise 13
The following contains plots of four different transformations of the Animals
data from the MASS
package. What different aspects of the data do these different graphs emphasize? Consider the effect on low values of the variables, as contrasted with the effect on high values.
Solution
require(MASS)
data(Animals)
plot(brain~body, data=Animals) # Some linters may complain: that's fine.
In the previous scatterplot, both the
brain
and the body
variables are represented in linear scale, inside the same plot area, using a slight majoration of min-max data range as the axis range.
So-obtained plot compresses the small-brain
/small-body
datapoints near \((0,0)\), being also present at least (might even be the same, though in the specific case they are not) one large-brain
and one large-body
datapoint. As a further consequence, the plot emphasizes extreme large values in at least one coordinate. Only 6 of such datapoints are easily distinguishable in their coordinates. Proportions are preserved.
plot(sqrt(brain)~ sqrt(body), data=Animals) # Some linters may complain: that's fine.
In the same context as before, the data are now plotted through a square-root transform of the coordinates.
This shrinks axes ranges and alters proportions (according to the chosen transform). The presence of extreme large values in at least one coordinate is still visible. More small-brain
/small-body
datapoints are now distinguishable, but provided they are not too close to \((0,0)\). Some datapoint-crowding is still present for extreme small values.
plot(I(brain^0.1)~ I(body^0.1), data=Animals) # Some linters may complain: that's fine.
By further decreasing the exponent of the power-law transform of the axes (now the \(10^{th}\) root), the effect is further amplified. All datapoints are easily distinguishable, but proportionality is completely lost. There is not anymore a directly-acknowledgeable evidence of extreme values in either coordinate of any of the datapoints. Some intuitive evidence arises about a possible descriptive law in the power or exponential domain.
plot(log(brain)~ log(body), data=Animals) # Some linters may complain: that's fine.
At the bi-logarithmic scale, all the observations given about the plot before still hold. Proportionality is still not easily visible, but the axes are now transformed in a (usually) more familiar manner. At this exploratory stage, qualitatively, an exponential relationship between the two data-coordinates can be assumed to be worth of further quantitative analysis. Agail, there is no easily noticeable evidence of extreme values.
Exercise 15
The data frame socsupport
(from the DAAG
package) has data from a survey on social and other kinds of support, for a group of university students. It includes Beck Depression Inventory (BDI) scores.
The following are two alternative plots of BDI against age:
plot(BDI ~ age, data=socsupport)
plot(BDI ~ unclass(age), data=socsupport)
For examination of cases where the score seems very high, which plot is more useful? Explain.
Why is it necessary to be cautious in making anything of the plots for students in the three oldest age categories (25-30, 31-40, 40+)?
Solution
For examination of cases where the score seems to be very high, the second plot could be better. Both plots show how the observations, grouped by age, are distributed w.r.t. the score of BDI. Nevertheless thanks to the ungrouping procedure, in the second plot it is also possible to see the numerosity of each group. In this way it is possible to determine how reliable can be the data for each group.
library(DAAG)
library(lattice)
data(socsupport)
par(mfrow=c(1,2))
plot(BDI ~ age, data = socsupport) # Some linters may complain: that's fine.
plot(BDI ~ unclass(age), data = socsupport) # Some linters may complain: that's fine.
It is necessary to be cautious in making anythig of the plots for students in the tree oldest age categories because, as we can see from the summary, there are very few data for those classes. Hence the data can be not-representative for the students of that age.
summary(socsupport$age)
## 18-20 21-24 25-30 31-40 40+
## 44 35 6 6 4
Exercise 17
Given a vector x
, the following demonstrates alternative ways to create a vector of numbers from \(1\) through \(n\), where \(n\) is the length of the vector:
x <- c(8, 54, 534, 1630, 6611)
seq(1, length(x))
seq(along=x)
Now set x <- NULL
and repeat each of the calculations seq(1, length(x))
and seq(along=x)
. Which version of the calculation should be used in order to return a vector of length 0 in the event that the supplied argument is NULL
.
Solution
In order to return a vector of length 0 in the event that the supplied argument is NULL
, you have to use the syntax seq(along=x)
.
In writing seq(1, length(x))
, 1
is interpreted as the starting value of the sequence and length(x)
, that is equal to 0, as the end value of the sequence. Since the default value of the optional parameter by
is ((to - from)/(length.out - 1))
, that in this case is equal to -1, a vector of two elements, 1 and 0, is created.
x <- c(8, 54, 534, 1630, 6611)
seq(1, length(x))
## [1] 1 2 3 4 5
seq(along = x)
## [1] 1 2 3 4 5
x <- NULL
seq(1, length(x))
## [1] 1 0
length(seq(1, length(x)))
## [1] 2
seq(along = x)
## integer(0)
length(seq(along = x))
## [1] 0
Exercise 20
The help page for iris
(type help(iris))
gives code that converts the data in iris3
(datasets
package) to case-by-variable format, with column names Sepal.Length
Sepal.Width
, Petal.Length
, Petal.Width
, and Species
. Look up the help pages for the functions that are used, and make sure that you understand them. Then add annotation to this code that explains each step in the computation.
Solution
library(datasets)
#help(iris)
dni3 <- dimnames(iris3) # save the dimnames of the object
# iris3
# given that iris3 is a 3-dimensional array 50 x 4 x 3
# the function dimnames returns the names of each one
# of the 3 dimensions of it.
# the output is a list where the first element is null
# becuse the rows of iris3 have no names
# the second and the third elements are arrays that
# contain the names of the second and third dimension
# of iris3
test <- aperm(iris3, c(1,3,2))
ii <- data.frame( # create a dataframe from a matrix
matrix( # create a matrix from the transposition of
# the iris3 array
aperm(iris3, c(1,3,2)), # transpose the 3-d array iris
# result: 50 x 3 x 4 array
ncol = 4, # having 4 columns
# set the names of the rows and columns of the matrix;
dimnames = list(NULL, # create a list, starting from a
# NULL object
sub(" L.",".Length",
#substitute "L." with ".Length"
# in the output of the following line
sub(" W.",".Width", dni3[[2]])))),
# substitute "W." with ".Width" in
# the strings contained in the list
# of the names of the second
# dimension of iris3: ("Sepal L."
# "Sepal W." "Petal L." "Petal W.")
# create a new column of the dataframe called Species
Species = gl( # generate factors given:
3, 50, # the number of levels (=3), the number of
# replications (=50)
labels = # and the labels of the levels
sub("S", "s", sub("V", "v", dni3[[3]]))))
# substitute "S" with "s" and "V" with "v" in in the
# strings contained in the list of the names of the
# third dimension of iris3
# ("Setosa" "Versicolor" "Virginica")
all.equal(ii, iris) # compare the created object and iris
## [1] TRUE
Exercise 1.1
Exponential random variable, \(X \geq 0\), has pdf \(f(x) = \lambda \cdot \ e^{-\lambda x}\).
Solution
We know that its cdf is \(F(x)=Pr(x\leq X)\), so we can compute for \(x \geq 0\): \[ F(x)=\int_{-\infty}^{0} f(x)dx+\int_{0}^{X}f(x)dx= \int_{0}^{X}\lambda e^{-\lambda x}dx=[-e^{-\lambda x}]_{0}^{X}= e^{-\lambda 0}-e^{-\lambda X}= 1-e^{-\lambda X} \]
The quantile function instead is its inverse, so we have:
\[ F=1-e^{-\lambda Q} \ \Rightarrow \ 1-F=e^{\lambda Q} \ \Rightarrow \ Q=-\frac{ln(1-F)}{\lambda} \]
3.the mean is: \[ E[x]=\int_{-\infty}^{+\infty}x f(x)dx= \int_{-\infty}^{0}x f(x)dx+\int_{0}^{+\infty}x f(x)dx= \int_{0}^{+\infty}x f(x)dx= \int_{0}^{+\infty}x \lambda e^{-\lambda x}dx \] By using integration by parts, then:
\[ g'(x)=\lambda e^{-\lambda x} \ \ g(x)=-e^{-\lambda x}\\ f'(x)=1 \ \ f(x)= x \\ \int_{0}^{+\infty} g'(x)f(x)dx = g(x)f(x) - \int_{0}^{+\infty} g(x) f'(x)dx \] And thus we obtain: \[ E[x]=[-xe^{-\lambda x}]_{0}^{+\infty}+\int_{0}^{+\infty}-e^{-\lambda x}dx=\left[\frac{-e^{-\lambda x}}{\lambda}\right]_{0}^{+\infty}=\frac{1}{\lambda} \] For what concernes the computation of the variance, we have that if the distribution admits finite variance, that implies the existance of \(E[x]\), and \(var(x)=E[x^2]-E[x]^2\). \[ E[x^2]=\int_{0}^{+\infty}x^2 \lambda e^{-\lambda x}dx \] By using integration by parts, we let
\[ g'(x)=\lambda e^{-\lambda x} \ \ g(x)=-e^{-\lambda x}\\ f'(x)=2x \ \ f(x)= x^2 \\ \]
And then, \[ E[x^2]=[-x^2e^{-\lambda x}]_{0}^{+\infty}+\int_{0}^{+\infty} 2x e^{-\lambda x}dx=\frac{2}{\lambda^2} \] because we know that \(\int_{0}^{+\infty} x e^{-\lambda x}dx=\frac{1}{\lambda^2}\).
Thus, it directly follows that \(var(x)=E[x^2]-E[x]^2 = {{1}\over{\lambda^2}}\)
Exercise 1.2
Evaluate \(Pr(X < 0.5, Y < 0.5)\) if \(X\) and \(Y\) have joint pdf given (see below).
Solution
Let \[ f (x, y) =\begin{cases} x + \frac{3y^2} {2} &0<x<1 \wedge 0<y<1 \\ 0 &\text{otherwise} \end{cases} \]
\[ Pr(X < 0.5, Y < 0.5)=\int_{0}^{0.5}\int_{0}^{0.5}x + \frac{3y^2} {2}dxdy= \int_{0}^{0.5}\int_{0}^{0.5}x\ dxdy + \int_{0}^{0.5}\int_{0}^{0.5}\frac{3y^2}{2}dxdy = \frac{1}{2} \left({ \int_{0}^{0.5}x\ dx + \int_{0}^{0.5}\frac{3y^2}{2}\ dy }\right) \] At this point, we can compute: \(\int_{0}^{0.5}x\ dx = \frac{1}{8}\) and \(\int_{0}^{0.5}\frac{3y^2}{2}\ dy = \frac{1}{16}\).
It directly follows that \(Pr(X < 0.5, Y < 0.5)= \frac{1+2}{32} = \frac{3}{32} \approx0.094\)
Exercise 1.6
Let \(X\) and \(Y\) be non-independent random variables, such that \(var(X) =\sigma_{X}^{2}\), \(var(Y)=\sigma_{Y}^{2}\) and \(cov(X, Y) =\sigma_{X, Y}^{2}\). Using the result from Section 1.6.2, find \(var(X+Y)\) and \(var(X-Y)\).
Solution
If we define \(Z = X + Y\), than we can compute \(Var(Z)\) as follow:
\(Var(Z) = E[(Z - E[Z])^{2}] = E[(X + Y - E[X + Y])^{2}] = E[((X - E[X]) + (Y - E[Y]))^{2}] =\)
\(E[(X - E[X])^{2}] + E[(Y - E[Y])^{2}] + 2 \cdot E[(X - E[X]) \cdot (Y - E[Y])] = \sigma_{X}^{2} + \sigma_{Y}^{2} + 2 \cdot \sigma_{X, Y}^{2}\)
If we define \(Z = X - Y\), than we can compute \(Var(z)\) as follow:
\(Var(Z') = E[(Z - E[Z])^{2}] = E[(X - Y - E[X - Y])^{2}] = E[((X - E[X]) - (Y - E[Y]))^{2}] =\)
\(E[(X - E[X])^{2}] + E[(Y - E[Y])^{2}] - 2 \cdot E[(X - E[X]) \cdot (Y - E[Y])] = \sigma_{X}^{2} + \sigma_{Y}^{2} - 2 \cdot \sigma_{X, Y}^{2}\).
However, if we want to follow the hinted-to paragraph, we may suppose that \(X\) and \(Y\) are the coordinates of a multi-variate random vector \(W = V_{1} = \begin{bmatrix} X \\ Y \end{bmatrix}\) that has a generic distribution with covariance matrix \(\Sigma\).
So, let us define \(V_{1} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}^{T}\) and \(V_{2} = \begin{bmatrix} 1 \\ -1 \end{bmatrix}^{T}\) two bi-dimensional vectors used to obtain the random variates \(Z = X + Y\) and \(Z' = X - Y\), respectively, via a linear transformation (matrix multiplication in this case). Thus, we can obtain the covariances of \(Z\) and \(Z'\) by applying the corresponding transformation rule for \(\Sigma\):
Case \(Z = X + Y\) \(Var(X + Y) = V \cdot \Sigma \cdot V^{T} = \begin{bmatrix} 1 & 1 \end{bmatrix} \cdot \begin{bmatrix} \sigma_{x}^{2} & \sigma_{x, y} \\ \sigma_{x, y} & \sigma_{y}^{2} \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} \sigma_{x}^{2} + \sigma_{x, y} && \sigma_{y}^{2} + \sigma_{x, y} \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \sigma_{X}^{2} + \sigma_{Y}^{2} + 2 \cdot \sigma_{X, Y}^{2}\)
Case \(Z = X - Y\) \(Var(X - Y) = V \cdot \Sigma \cdot V^{T} = \begin{bmatrix} 1 & -1 \end{bmatrix} \cdot \begin{bmatrix} \sigma_{x}^{2} & \sigma_{x, y} \\ \sigma_{x, y} & \sigma_{y}^{2} \end{bmatrix} \cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix} = \begin{bmatrix} \sigma_{x}^{2} - \sigma_{x, y} && -\sigma_{y}^{2} + \sigma_{x, y} \end{bmatrix} \cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix} = \sigma_{X}^{2} + \sigma_{Y}^{2} - 2 \cdot \sigma_{X, Y}^{2}\)
Exercise 1.7
Let \(Y_1\), \(Y_2\) and \(Y_3\) be independent \(N(\mu, \sigma^2)\) r.v.s. Somehow using the matrix \[\textbf{W} = \begin{bmatrix} 1/3 & 1/3 & 1/3\\ 2/3 & -1/3 & -1/3\\ -1/3 & 2/3 & -1/3 \end{bmatrix}\] show that \(\bar{Y} = \sum_{i=1}^3{Y_i/3}\) and \(\sum_{i=1}^{3}{(Y_i - \bar{Y})^2}\) are independent random variables.
Solution (Part 1: rephrasing the exercise)
We can rephrase and complement the exercise above with a more expressive notation and by adding a row to the matrix provided. Such manipulation of the original text allows for a more direct and compact solution, while still embedding \(\textbf{W}\), though not explicitly. Such solution can be seen as a generalization of the suggestion provided by the authors.
Let \(Y_1\), \(Y_2\) and \(Y_3\) be independent \(N(\mu, \sigma^2)\) r.v.s. Somehow using the matrix \(\textbf{M} = \begin{bmatrix} & \textbf{W} \\ -1/3 & -1/3 & 2/3 \end{bmatrix} = \begin{bmatrix} 1/3 & 1/3 & 1/3\\ 2/3 & -1/3 & -1/3\\ -1/3 & 2/3 & -1/3 \\ -1/3 & -1/3 & 2/3 \end{bmatrix}\) show that \(\bar{Y} = \sum_{i=1}^3{Y_i/3}\) and \(\sum_{i=1}^{3}{(Y_i - \bar{Y})^2} = \sum_{i=1}^{3}{(V_i)^2}\) are independent random variables.
As such, the exercise can be solved as follows.
Solution (Part 2: actual solution)
First, consider the r.v.s \(Y_1\), \(Y_2\) and \(Y_3\) as the coordinates of a random vector \(\textbf{Y} = \begin{bmatrix} Y_1\\ Y_2\\ Y_3 \end{bmatrix}\) whose distribution is the Multivariate Normal of i.i.d. variables, with mean (vector) \(\boldsymbol{\mu} = \begin{bmatrix} \mu\\ \mu\\ \mu \end{bmatrix}\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} \sigma^2 & 0 & 0\\ 0 & \sigma^2 & 0\\ 0 & 0 &\sigma^2 \end{bmatrix}\).
Then, by computing symbolically a matrix-vector product, we can show that \(\textbf{S} = \textbf{M}\times\textbf{Y} = \begin{bmatrix} {Y_1 + Y_2 + Y_3}\over{3}\\ {2Y_1 - Y_2 - Y_3}\over{3}\\ {-Y_1 + 2Y_2 - Y_3}\over{3}\\ {-Y_1 - Y_2 + 2Y_3}\over{3} \end{bmatrix} = \begin{bmatrix} \bar{Y}\\ Y_1 - \bar{Y}\\ Y_2 - \bar{Y}\\ Y_3 - \bar{Y} \end{bmatrix} = \begin{bmatrix} \bar{Y}\\ V_1\\ V_2\\ V_3 \end{bmatrix}\).
By the properties of the Multivariate Normal distribution, the random vector \(\textbf{S}\) can be shown to be Normally distributed, and with covariance matrix \(\boldsymbol{\Sigma'} = \textbf{M} \times \boldsymbol{\Sigma} \times \textbf{M}^{T} = \begin{bmatrix} \times & 0 & 0 & 0\\ 0 & \times & \times & \times\\ 0 & \times &\times & \times \end{bmatrix}\) where the \(\times\) symbol as a matrix element has been used as a generic placeholder. So-marked element, as well as the mean vector of such distribution are irrelevant w.r.t. the problem at hand.
From the \(\boldsymbol{\Sigma'}\) matrix above, and from their normality, we can notice that the variables \(\bar{Y}\), \(V_1\), \(V_2\) and \(V_3\) are normally-distributed and that \(\bar{Y}\) is pairwise-uncorrelated to \(V_1\), \(V_2\) and \(V_3\). This (in the case of a multinormal distribution only, which is our case) implies also independence (pairwise, among \(\bar{Y}\) and the set of \(V_1\), \(V_2\) and \(V_3\)).
From the transitivity of pairwise-independence among sets of random variables through set-closed arithmetic operations, we can conclude that also \(\sum_{i=1}^{3}{(V_i)^2}\) is independent from \(\bar{Y}\).
Exercise 1.8
If \(log(X)~N(\mu, \sigma^{2})\), find the pdf of \(X\).
Solution
From the Transformation Theorem for r.v. we can define the cdf:
\(F_{log(X)}(x) = P(log(X) \leq x) = P(X \leq e^{x}) = F_{X}(e^{x}) = \int_{0}^{e^{x}} \frac{1}{\sqrt{2 \cdot \pi \cdot \sigma^{2}}} \cdot e^{- \frac{(t - \mu)^{2}}{2}} \space dt\).
And also the pdf:
\(f_{X}(x) = \frac{d F_{X}(e^{x})}{dx} = \frac{1}{\sqrt{2 \cdot \pi \cdot \sigma^{2}}} \cdot e^{- \frac{(e^{x} - \mu)^{2}}{2}}\).
Exercise 1.9
Discrete random variable \(Y\) has a Poisson distribution with parameter \(\lambda\) if its pdf is \(f(y) =\lambda \cdot y \cdot e^{- \frac{\lambda}{y!}}\), for \(y= 0,1, \dots\)
Find the moment generating function for \(Y\) (hint: the power series repre-sentation of the exponential function is useful).
If \(Y_{1}~Poi(\lambda_{1})\) and independently \(Y_{2}~Poi(\lambda_{2})\) deduce the distribution of \(Y_{1}+Y_{2}\), by employing a general property of m.g.f.s.
Making use of the previous result and the central limit theorem, deduce the normal approximation to the Poisson distribution.
Confirm the previous result graphically, using R
functions dpois, dnorm, plotorbarplotandlines. Confirm that the approximation improves with increasing \(\lambda\).
Solution
a) The moment generating function (mgf) is defined as: \(G_{Y}(t) = E[e^{t \cdot Y}] = \sum_{i = 1}^{n} p_{i} \cdot e^{t \cdot Y_{i}}\).
In case of discrete distribution, so the mgf for the Poisson distribution is the following one: \(G_{Y}(t) = \sum_{i = 0}^{\infty} e^{t \cdot i} \cdot \lambda^{i} \cdot \frac{e^{-\lambda}}{i!} = e^{-\lambda} \cdot \sum_{i = 0}^{\infty} \frac{(\lambda \cdot e^{t})^{i}}{i!} = e^{-\lambda} \cdot e^{\lambda \cdot e^{t}} = e^{\lambda \cdot (e^{t} - 1)}\)
b) Here we can apply the mgf property with indipendent random variables: \(G_{Y_{1} + Y_{2}}(t) = G_{Y_{1}}(t) \cdot G_{Y_{2}}(t) = e^{\lambda_{1} \cdot (e^{t} - 1)} \cdot e^{\lambda_{2} \cdot (e^{t} - 1)} = e^{(\lambda_{1} + \lambda_{2}) \cdot (e^{t}-1)}\)
Note that this is the mfg of a Poisson distribution with \(\lambda = \lambda_{1} + \lambda_{2}\).
c) Note that \(X_{n} ~ Poi(n)\) and consider the standardized Poisson \(\frac{X_{n}-n}{\sqrt{n}}\), then:
\(\lim_{n \rightarrow\infty} E[e^{t \cdot \frac{X_{n}-n}{\sqrt{n}}}] = \lim_{n \rightarrow\infty} e^{- t\cdot \sqrt{n}} \cdot E[e^{t \cdot \frac{X_{n}}{\sqrt{n}}}] =\)
\(\lim_{n \rightarrow\infty} e^{- t\cdot \sqrt{n}} \cdot e^{n \cdot (e^{\frac{t}{\sqrt{n}}} - 1)} = \lim_{n \rightarrow\infty} e^{-t \cdot \sqrt{n} + n \cdot (1 + t \cdot n^{\frac{1}{2}} + \frac{t^{2} \cdot n^{-1}}{2} + \cdots -1)} =\)
\(\lim_{n \rightarrow\infty} e^{-t \cdot \sqrt{n} + t \cdot \sqrt{n} + \frac{t^{2}}{2} + \frac{t^{3}}{6 \cdot \sqrt{n}} + \cdots} = \lim_{n \rightarrow\infty} e^{\frac{t^{2}}{2}} \cdot e^{(\frac{t^{3}}{6 \cdot \sqrt{n}} + \cdots)} = e^{\frac{t^{2}}{2}}\)
Where \(e^{\frac{t^{2}}{2}}\) is the mgf of the Standard Normal distribution.
d)
x <- seq(from = 0, to = 20, by = 1)
lambda <- c(0.7, 1.2, 5, 15)
par(mfcol = c(2, 2))
# Place the legend programmatically ;)
legendloc <- function(iter)
{
if (iter < 4) {return("topright")}
else {return("topleft")}
}
for (i in 1:4)
{
poisson_distribution <- dpois(x, lambda[i])
normal_distribution <-
dnorm(x, mean = lambda[i], sd = sqrt(lambda[i]))
plot(
x,
poisson_distribution,
type = "l",
ylab = "F(x)",
main = paste("lambda = ", lambda[i]),
col = "blue"
)
lines(x, normal_distribution, type = "l", col = "red")
legend(
legendloc(i),
legend = c("poisson", "normal"),
col = c("blue", "red"),
lty = c(1, 1)
)
}