This is a vignette for the R
package ipsRdbs. This package contains data sets, programmes and illustrations discussed in the book, “Introduction to Probability, Statistics & R: Foundations for Data-Based Sciences” by Sahu (2024).
ipsRdbs 1.0.0
beanie
: Age and value of beanie baby toys.bill
: Wealth and age of world billionairesbodyfat
: Body fat percentage and skinfold thickness of athletesbombhits
: Number and frequency of bombhits in Londoncement
: Breaking strength of cementcfail
: Number of weekly computer failurescheese
: Taste of cheeseemissions
: Exhaust emissions of carserr_age
: Error in guessing ages from photographsffood
: Service times in a fast food restaurantgasmileage
: Gas mileage of carspossum
: Body weight and length of possums in Australian regionspuffin
: Nesting habits of puffins in Newfoundlandrice
: data set on rice yieldwgain
: Weight gain of students starting collegeThis package complements the book, `Introduction to Probability, Statistics and R for Data-Based Sciences’ by Sahu (2024). The package distributes the data sets used in the book and provides code illustrating the statistical modeling of the data sets. In addition, the package provides code for illustrating various results in probability and statistics. For example, it provides code to simulate the Monty python game illustrating conditional probability, and gives simulation based examples to illustrate the central limit theorem and the weak law of large numbers. Thus the package helps a beginner reader in enhancing understanding of a few elementary concepts in probability and statistics, and introduces them to perform linear statistical modelling, i.e., regression and ANOVA which are among the key foundational concepts of data science and machine learning, more generally data-based sciences.
The reader is first instructed to install the R
software package by searching for CRAN
in the internet. The reader shoud then go onto the web-page https://cran.r-project.org/ and install the correct and latest version of the package on their own computer. Please note that R
cannot be installed on a mobile phone.
Once R
has been installed, the next task is to install the frontend software package Rstudio,
which provides an easier interface to work with R
.
After installing R
and Rstudio
, the reader should launch the Rstudio
programme in their computer. This will open up a four pane window with one named ‘Console’ or ‘Terminal’. This window accepts commands (or our instructions) and prints results. For example, the reader may type 2+2 and then hit the Enter button to examine the result. The reader is asked to search the internet for gentler introductions and videos.
In order to getting started here, thereader is aked to install the add-on R
package ipsRdbs
simply by issuing the R
command
install.packages("ipsRdbs", dependencies=TRUE)
without committing any typing mistakes. If this installation is successful, the reader can issue the following two commands to list all R
objects (data sets and programmes) included in the package.
library(ipsRdbs)
ls("package:ipsRdbs")
Note that this command will only produce the intended results if only the package has been successfully installed in the first place.
All the listed objects, as the output of the ls
command in the previous section, have associated help files. The reader can gain information for each of those object by asking for help by typing the question mark immediately followed by the object name, e.g. ?butterfly
or by issuing the command help(butterfly)
.
The help files provide details about the objects and the user is able to run all the code included as illustrations at the end of the help file. This cam be done either by clicking the Run Examples
link or simply by copy-pasting all the commands onto the command console in Rstudio. This is a great advantage of R
as it allows the users to reproduce the results without having to learn all the commands and syntax correctly. After gaining this confidence, a beginner user can examine and experiment with the commands further. More details regarding the objects are provided in the book Sahu (2024).
The remainder of this vignette simply elaborates the help files for all the main objects and programmes included in the package. The main intention here is to enable the reader to reproduce all the results by actually running the commands and the code included already included in the help files.
Section 2 discusses all the data sets.
All the R
functions are discussed in 3.
Some summary remarks are provided in Section 4.
beanie
: Age and value of beanie baby toys.This data set contains the age and the value of 50 beanie baby toys. Source: Beanie world magazine. This data set has been used as an example of simple linear regression modellinhg where the exercise is to predict the value of a beanie baby toy by knowing it’s age.
head(beanie)
#> name age value
#> 1 Ally 52 55
#> 2 Batty 12 12
#> 3 Bongo 28 40
#> 4 Blackie 52 10
#> 5 Bucky 40 45
#> 6 Bumble 28 600
summary(beanie)
#> name age value
#> Length:50 Min. : 5.00 Min. : 10.0
#> Class :character 1st Qu.:12.00 1st Qu.: 15.0
#> Mode :character Median :28.00 Median : 26.5
#> Mean :26.52 Mean : 128.9
#> 3rd Qu.:40.00 3rd Qu.: 62.5
#> Max. :64.00 Max. :1900.0
plot(beanie$age, beanie$value, xlab="Age", ylab="Value", pch="*", col="red")
bill
: Wealth and age of world billionairesThis data set contains wealth, age and region of 225 billionaires in 1992 as reported in the Fortune magazine. This data set can be used to illustrate exploratory data analysis by producing side-by-side box plots of wealth for billionaires from different continents of the world. It can also be used for multiple linear regression models, although such tasks have not been undertaken here.
head(bill)
#> wealth age region
#> 1 37.0 50 M
#> 2 24.0 88 U
#> 3 14.0 64 A
#> 4 13.0 63 U
#> 5 13.0 66 U
#> 6 11.7 72 E
summary(bill)
#> wealth age region
#> Min. : 1.000 Min. : 7.00 A:37
#> 1st Qu.: 1.300 1st Qu.: 56.00 E:76
#> Median : 1.800 Median : 65.00 M:22
#> Mean : 2.726 Mean : 64.03 O:28
#> 3rd Qu.: 3.000 3rd Qu.: 72.00 U:62
#> Max. :37.000 Max. :102.00
table(bill$region)
#>
#> A E M O U
#> 37 76 22 28 62
levels(bill$region)
#> [1] "A" "E" "M" "O" "U"
levels(bill$region) <- c("Asia", "Europe", "Mid-East", "Other", "USA")
bill.wealth.ge5 <- bill[bill$wealth>5, ]
bill.wealth.ge5
#> wealth age region
#> 1 37.0 50 Mid-East
#> 2 24.0 88 USA
#> 3 14.0 64 Asia
#> 4 13.0 63 USA
#> 5 13.0 66 USA
#> 6 11.7 72 Europe
#> 7 10.0 71 Mid-East
#> 8 8.2 77 USA
#> 9 8.1 68 USA
#> 10 7.2 66 Europe
#> 11 7.0 69 Mid-East
#> 12 6.2 36 Other
#> 13 5.9 49 USA
#> 14 5.3 73 USA
#> 15 5.2 52 Europe
bill.region.A <- bill[ bill$region == "A", ]
bill.region.A
#> [1] wealth age region
#> <0 rows> (or 0-length row.names)
a <- seq(1, 10, by =2)
oddrows <- bill[a, ]
barplot(table(bill$region), col=2:6)
hist(bill$wealth) # produces a dull looking plot
hist(bill$wealth, nclass=20) # produces a more detailed plot.
hist(bill$wealth, nclass=20, xlab="Wealth",
main="Histogram of wealth of billionaires")
plot(bill$age, bill$wealth) # A very dull plot.
plot(bill$age, bill$wealth, xlab="Age", ylab="Wealth", pch="*") # better
plot(bill$age, bill$wealth, xlab="Age", ylab="Wealth", type="n")
# Lays the plot area but does not plot.
text(bill$age, bill$wealth, labels=bill$region, cex=0.7, col=2:6)
# Adds the points to the empty plot.
# Provides a better looking plot with more information.
boxplot(data=bill, wealth ~ region, col=2:6)
tapply(X=bill$wealth, INDEX=bill$region, FUN=mean)
#> Asia Europe Mid-East Other USA
#> 2.651351 2.257895 4.263636 2.278571 3.000000
tapply(X=bill$wealth, INDEX=bill$region, FUN=sd)
#> Asia Europe Mid-East Other USA
#> 2.192617 1.623187 7.657150 1.265308 3.659974
round(tapply(X=bill$wealth, INDEX=bill$region, FUN=mean), 2)
#> Asia Europe Mid-East Other USA
#> 2.65 2.26 4.26 2.28 3.00
library(ggplot2)
gg <- ggplot2::ggplot(data=bill, aes(x=age, y=wealth)) +
geom_point(aes(col=region, size=wealth)) +
geom_smooth(method="loess", se=FALSE) +
xlim(c(7, 102)) +
ylim(c(1, 37)) +
labs(subtitle="Wealth vs Age of Billionaires",
y="Wealth (Billion US $)", x="Age",
title="Scatterplot", caption = "Source: Fortune Magazine, 1992.")
plot(gg)
#> `geom_smooth()` using formula = 'y ~ x'
bodyfat
: Body fat percentage and skinfold thickness of athletesThis data set contains body fat percentage data for 102 elite male athletes training at the Australian Institute of Sport. This data set has been used to illustrate simple linear regression in Chapter 17 of the book by Sahu (2024).
summary(bodyfat)
#> Skinfold Bodyfat
#> Min. : 28.00 Min. : 5.630
#> 1st Qu.: 37.52 1st Qu.: 6.968
#> Median : 47.70 Median : 8.625
#> Mean : 51.42 Mean : 9.251
#> 3rd Qu.: 58.15 3rd Qu.:10.010
#> Max. :113.50 Max. :19.940
plot(bodyfat$Skinfold, bodyfat$Bodyfat, xlab="Skin", ylab="Fat")
plot(bodyfat$Skinfold, log(bodyfat$Bodyfat), xlab="Skin", ylab="log Fat")
plot(log(bodyfat$Skinfold), log(bodyfat$Bodyfat), xlab="log Skin", ylab="log Fat")
# Keep the transformed variables in the data set
bodyfat$logskin <- log(bodyfat$Skinfold)
bodyfat$logbfat <- log(bodyfat$Bodyfat)
bodyfat$logskin <- log(bodyfat$Skinfold)
# Create a grouped variable
bodyfat$cutskin <- cut(log(bodyfat$Skinfold), breaks=6)
boxplot(data=bodyfat, Bodyfat~cutskin, col=2:7)
require(ggplot2)
p2 <- ggplot(data=bodyfat, aes(x=cutskin, y=logbfat)) +
geom_boxplot(col=2:7) +
stat_summary(fun=mean, geom="line", aes(group=1), col="blue", linewidth=1) +
labs(x="Skinfold", y="Percentage of log bodyfat",
title="Boxplot of log-bodyfat percentage vs grouped log-skinfold")
plot(p2)
n <- nrow(bodyfat)
x <- bodyfat$logskin
y <- bodyfat$logbfat
xbar <- mean(x)
ybar <- mean(y)
sx2 <- var(x)
sy2 <- var(y)
sxy <- cov(x, y)
r <- cor(x, y)
print(list(n=n, xbar=xbar, ybar=ybar, sx2=sx2, sy2=sy2, sxy=sxy, r=r))
#> $n
#> [1] 102
#>
#> $xbar
#> [1] 3.883108
#>
#> $ybar
#> [1] 2.175978
#>
#> $sx2
#> [1] 0.1084285
#>
#> $sy2
#> [1] 0.09106112
#>
#> $sxy
#> [1] 0.09566069
#>
#> $r
#> [1] 0.9627095
hatbeta1 <- r * sqrt(sy2/sx2) # calculates estimate of the slope
hatbeta0 <- ybar - hatbeta1 * xbar # calculates estimate of the intercept
rs <- y - hatbeta0 - hatbeta1 * x # calculates residuals
s2 <- sum(rs^2)/(n-2) # calculates estimate of sigma2
s2
#> [1] 0.006731453
bfat.lm <- lm(logbfat ~ logskin, data=bodyfat)
### Check the diagnostics
plot(bfat.lm$fit, bfat.lm$res, xlab="Fitted values", ylab = "Residuals", pch="*")
abline(h=0)
### Should be a random scatter
qqnorm(bfat.lm$res, col=2)
qqline(bfat.lm$res, col="blue")
# All Points should be on the straight line
summary(bfat.lm)
#>
#> Call:
#> lm(formula = logbfat ~ logskin, data = bodyfat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.223408 -0.055661 0.000052 0.056390 0.152984
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.24988 0.09661 -12.94 <2e-16 ***
#> logskin 0.88225 0.02479 35.59 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.08205 on 100 degrees of freedom
#> Multiple R-squared: 0.9268, Adjusted R-squared: 0.9261
#> F-statistic: 1266 on 1 and 100 DF, p-value: < 2.2e-16
anova(bfat.lm)
#> Analysis of Variance Table
#>
#> Response: logbfat
#> Df Sum Sq Mean Sq F value Pr(>F)
#> logskin 1 8.5240 8.5240 1266.3 < 2.2e-16 ***
#> Residuals 100 0.6731 0.0067
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(bodyfat$logskin, bodyfat$logbfat, xlab="log Skin", ylab="log Fat")
abline(bfat.lm, col=7)
title("Scatter plot with the fitted Linear Regression line")
# 95% CI for beta(1)
# 0.88225 + c(-1, 1) * qt(0.975, df=100) * 0.02479
# round(0.88225 + c(-1, 1) * qt(0.975, df=100) * 0.02479, 2)
# To test H0: beta1 = 1.
tstat <- (0.88225 -1)/0.02479
pval <- 2 * (1- pt(abs(tstat), df=100))
x <- seq(from=-5, to=5, length=500)
y <- dt(x, df=100)
plot(x, y, xlab="", ylab="", type="l")
title("T-density with df=100")
abline(v=abs(tstat))
abline(h=0)
x1 <- seq(from=abs(tstat), to=10, length=100)
y1 <- rep(0, length=100)
x2 <- x1
y2 <- dt(x1, df=100)
segments(x1, y1, x2, y2)
abline(h=0)
# Predict at a new value of Skinfold=70
# Create a new data set called new
newx <- data.frame(logskin=log(70))
a <- predict(bfat.lm, newdata=newx, se.fit=TRUE)
# Confidence interval for the mean of log bodyfat at skinfold=70
a <- predict(bfat.lm, newdata=newx, interval="confidence")
# a
# fit lwr upr
# [1,] 2.498339 2.474198 2.52248
# Prediction interval for a future log bodyfat at skinfold=70
a <- predict(bfat.lm, newdata=newx, interval="prediction")
a
#> fit lwr upr
#> 1 2.498339 2.333783 2.662895
# fit lwr upr
# [1,] 2.498339 2.333783 2.662895
#prediction intervals for the mean
pred.bfat.clim <- predict(bfat.lm, data=bodyfat, interval="confidence")
#prediction intervals for future observation
pred.bfat.plim <- suppressWarnings(predict(bfat.lm, data=bodyfat, interval="prediction"))
plot(bodyfat$logskin, bodyfat$logbfat, xlab="log Skin", ylab="log Fat")
abline(bfat.lm, col=5)
lines(log(bodyfat$Skinfold), pred.bfat.clim[,2], lty=2, col=2)
lines(log(bodyfat$Skinfold), pred.bfat.clim[,3], lty=2, col=2)
lines(log(bodyfat$Skinfold), pred.bfat.plim[,2], lty=4, col=3)
lines(log(bodyfat$Skinfold), pred.bfat.plim[,3], lty=4, col=3)
title("Scatter plot with the fitted line and prediction intervals")
symb <- c("Fitted line", "95% CI for mean", "95% CI for observation")
## legend(locator(1), legend = symb, lty = c(1, 2, 4), col = c(5, 2, 3))
# Shows where we predicted earlier
abline(v=log(70))
summary(bfat.lm)
#>
#> Call:
#> lm(formula = logbfat ~ logskin, data = bodyfat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.223408 -0.055661 0.000052 0.056390 0.152984
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.24988 0.09661 -12.94 <2e-16 ***
#> logskin 0.88225 0.02479 35.59 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.08205 on 100 degrees of freedom
#> Multiple R-squared: 0.9268, Adjusted R-squared: 0.9261
#> F-statistic: 1266 on 1 and 100 DF, p-value: < 2.2e-16
anova(bfat.lm)
#> Analysis of Variance Table
#>
#> Response: logbfat
#> Df Sum Sq Mean Sq F value Pr(>F)
#> logskin 1 8.5240 8.5240 1266.3 < 2.2e-16 ***
#> Residuals 100 0.6731 0.0067
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bombhits
: Number and frequency of bombhits in LondonThis data set contains the number of bomb hits in 576 areas in London during World War II. Data sourced from Shaw and Shaw (2019), see also Hand {} (1993). (David J. Hand and Ostrowski 1993) (Shaw and Shaw 2019) This data set has been used to illustrate elementary concepts of statistical inference in Chapter 9 of the book Sahu (2024).
summary(bombhits)
#> numberhit freq
#> Min. :0.00 Min. : 1.0
#> 1st Qu.:1.25 1st Qu.: 14.0
#> Median :2.50 Median : 64.0
#> Mean :2.50 Mean : 96.0
#> 3rd Qu.:3.75 3rd Qu.:181.5
#> Max. :5.00 Max. :229.0
# Create a vector of data
x <- c(rep(0, 229), rep(1, 211), rep(2, 93), rep(3, 35), rep(4, 7), 5)
y <- c(229, 211, 93, 35, 7, 1) # Frequencies
rel_freq <- y/576
xbar <- mean(x)
pois_prob <- dpois(x=0:5, lambda=xbar)
fit_freq <- pois_prob * 576
#Check
cbind(x=0:5, obs_freq=y, rel_freq =round(rel_freq, 4),
Poisson_prob=round(pois_prob, 4), fit_freq=round(fit_freq, 1))
#> x obs_freq rel_freq Poisson_prob fit_freq
#> [1,] 0 229 0.3976 0.3950 227.5
#> [2,] 1 211 0.3663 0.3669 211.3
#> [3,] 2 93 0.1615 0.1704 98.1
#> [4,] 3 35 0.0608 0.0528 30.4
#> [5,] 4 7 0.0122 0.0122 7.1
#> [6,] 5 1 0.0017 0.0023 1.3
obs_freq <- y
xuniques <- 0:5
a <- data.frame(xuniques=0:5, obs_freq =y, fit_freq=fit_freq)
barplot(rbind(obs_freq, fit_freq),
args.legend = list(x = "topright"),
xlab="No of bomb hits",
names.arg = xuniques, beside=TRUE,
col=c("darkblue","red"),
legend =c("Observed", "Fitted"),
main="Observed and Poisson distribution fitted frequencies
for the number of bomb hits in London")
cement
: Breaking strength of cementContains data regarding breaking strength of cement. This data set has been used to illustrate analysis of variance in Chapter 19 of the book Sahu (2024).
summary(cement)
#> strength gauger breaker
#> Min. :4160 Min. :1 Min. :1
#> 1st Qu.:4660 1st Qu.:1 1st Qu.:1
#> Median :5100 Median :2 Median :2
#> Mean :5118 Mean :2 Mean :2
#> 3rd Qu.:5565 3rd Qu.:3 3rd Qu.:3
#> Max. :6200 Max. :3 Max. :3
boxplot(data=cement, strength~gauger, col=1:3)
boxplot(data=cement, strength~breaker, col=1:3)
cfail
: Number of weekly computer failuresThis data set contains weekly number of failures of a university computer
system over a period of two years. Source: Hand {} (1993).
(David J. Hand and Ostrowski 1993) Like the bombhits
example this data set has been used to illustrate elementary concepts of statistical inference in Chapter 9 of the book Sahu (2024).
cheese
: Taste of cheeseThis data set is from a testing of cheese experiment. This data set has been used to illustrate multiple linear regression modeling in Chapter 18 of the book Sahu (2024).
summary(cheese)
#> Taste AceticAcid H2S LacticAcid logH2S
#> Min. : 0.70 Min. : 87.97 Min. : 20.01 Min. :0.860 Min. : 2.996
#> 1st Qu.:13.55 1st Qu.:188.20 1st Qu.: 53.74 1st Qu.:1.250 1st Qu.: 3.977
#> Median :20.95 Median :227.03 Median : 207.46 Median :1.450 Median : 5.329
#> Mean :24.53 Mean :284.18 Mean : 2702.98 Mean :1.442 Mean : 5.942
#> 3rd Qu.:36.70 3rd Qu.:358.84 3rd Qu.: 1950.35 3rd Qu.:1.667 3rd Qu.: 7.575
#> Max. :57.20 Max. :637.78 Max. :26876.31 Max. :2.010 Max. :10.199
pairs(cheese)
GGally::ggpairs(data=cheese)
cheese.lm <- lm(Taste ~ AceticAcid + LacticAcid + logH2S, data=cheese, subset=2:30)
# Check the diagnostics
plot(cheese.lm$fit, cheese.lm$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
# Should be a random scatter
qqnorm(cheese.lm$res, col=2)
qqline(cheese.lm$res, col="blue")
summary(cheese.lm)
#>
#> Call:
#> lm(formula = Taste ~ AceticAcid + LacticAcid + logH2S, data = cheese,
#> subset = 2:30)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -18.0718 -5.9432 0.3657 5.3919 25.0876
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -31.41789 9.95702 -3.155 0.00414 **
#> AceticAcid 0.00478 0.01484 0.322 0.74997
#> LacticAcid 21.70560 8.69079 2.498 0.01945 *
#> logH2S 3.84994 1.21284 3.174 0.00396 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 10.06 on 25 degrees of freedom
#> Multiple R-squared: 0.6631, Adjusted R-squared: 0.6227
#> F-statistic: 16.4 on 3 and 25 DF, p-value: 4.222e-06
cheese.lm2 <- lm(Taste ~ LacticAcid + logH2S, data=cheese)
# Check the diagnostics
plot(cheese.lm2$fit, cheese.lm2$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
qqnorm(cheese.lm2$res, col=2)
qqline(cheese.lm2$res, col="blue")
summary(cheese.lm2)
#>
#> Call:
#> lm(formula = Taste ~ LacticAcid + logH2S, data = cheese)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.343 -6.529 -1.164 4.844 25.618
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -27.592 8.982 -3.072 0.00481 **
#> LacticAcid 19.888 7.959 2.499 0.01885 *
#> logH2S 3.946 1.136 3.475 0.00174 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9.942 on 27 degrees of freedom
#> Multiple R-squared: 0.6517, Adjusted R-squared: 0.6259
#> F-statistic: 25.26 on 2 and 27 DF, p-value: 6.552e-07
# How can we predict?
newcheese <- data.frame(AceticAcid = 300, LacticAcid = 1.5, logH2S=4)
cheese.pred <- predict(cheese.lm2, newdata=newcheese, se.fit=TRUE)
cheese.pred
#> $fit
#> 1
#> 18.02409
#>
#> $se.fit
#> [1] 3.111924
#>
#> $df
#> [1] 27
#>
#> $residual.scale
#> [1] 9.942372
# Obtain confidence interval
cheese.pred$fit + c(-1, 1) * qt(0.975, df=27) * cheese.pred$se.fit
#> [1] 11.63895 24.40923
# Using R to predict
cheese.pred.conf.limits <- predict(cheese.lm2, newdata=newcheese, interval="confidence")
cheese.pred.conf.limits
#> fit lwr upr
#> 1 18.02409 11.63895 24.40923
# How to find prediction interval
cheese.pred.pred.limits <- predict(cheese.lm2, newdata=newcheese, interval="prediction")
cheese.pred.pred.limits
#> fit lwr upr
#> 1 18.02409 -3.351891 39.40007
emissions
: Exhaust emissions of carsData on the nitrous oxide content of exhaust emissions from a set of
cars was collected by the Australian Traffic Accident Research Bureau
to explore the relationship between several measures of nitrous
oxide emissions. Like the cheese
data set, this has been used to illustrate multiple linear regression modeling in Chapter 18 of the book Sahu (2024).
summary(emissions)
#> Make Odometer Capacity CS505 T867
#> Length:54 Min. : 7285 Min. :1.300 Min. : 3.675 Min. : 1.158
#> Class :character 1st Qu.: 44208 1st Qu.:1.600 1st Qu.: 8.364 1st Qu.: 4.188
#> Mode :character Median : 80164 Median :2.000 Median :10.598 Median : 6.769
#> Mean : 84444 Mean :2.346 Mean :10.949 Mean : 7.217
#> 3rd Qu.:117248 3rd Qu.:2.900 3rd Qu.:13.125 3rd Qu.:10.001
#> Max. :232574 Max. :5.000 Max. :21.475 Max. :15.982
#> H505 ADR27 ADR37 logCS505 logT867
#> Min. : 1.340 Min. :0.480 Min. :0.340 Min. :1.302 Min. :0.1466
#> 1st Qu.: 4.884 1st Qu.:1.053 1st Qu.:1.005 1st Qu.:2.124 1st Qu.:1.4322
#> Median : 8.795 Median :1.475 Median :1.423 Median :2.361 Median :1.9104
#> Mean : 9.376 Mean :1.525 Mean :1.478 Mean :2.318 Mean :1.8185
#> 3rd Qu.:12.906 3rd Qu.:1.981 3rd Qu.:2.029 3rd Qu.:2.575 3rd Qu.:2.3026
#> Max. :21.465 Max. :2.851 Max. :2.862 Max. :3.067 Max. :2.7715
#> logH505 logADR27 logADR37
#> Min. :0.2927 Min. :-0.73397 Min. :-1.078810
#> 1st Qu.:1.5850 1st Qu.: 0.05191 1st Qu.: 0.004947
#> Median :2.1742 Median : 0.38857 Median : 0.352585
#> Mean :2.0440 Mean : 0.33895 Mean : 0.290615
#> 3rd Qu.:2.5577 3rd Qu.: 0.68382 3rd Qu.: 0.707568
#> Max. :3.0664 Max. : 1.04765 Max. : 1.051611
rawdata <- emissions[, c(8, 4:7)]
pairs(rawdata)
# Fit the model on the raw scale
raw.lm <- lm(ADR37 ~ ADR27 + CS505 + T867 + H505, data=rawdata)
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2,1))
plot(raw.lm$fit, raw.lm$res,xlab="Fitted values",ylab="Residuals", main="Anscombe plot")
abline(h=0)
qqnorm(raw.lm$res,main="Normal probability plot", col=2)
qqline(raw.lm$res, col="blue")
# summary(raw.lm)
logdata <- log(rawdata)
# This only logs the values but not the column names!
# We can use the following command to change the column names or you can use
# fix(logdata) to do it.
dimnames(logdata)[[2]] <- c("logADR37", "logCS505", "logT867", "logH505", "logADR27")
pairs(logdata)
log.lm <- lm(logADR37 ~ logADR27 + logCS505 + logT867 + logH505, data=logdata)
plot(log.lm$fit, log.lm$res,xlab="Fitted values",ylab="Residuals", main="Anscombe plot")
abline(h=0)
qqnorm(log.lm$res,main="Normal probability plot", col=2)
qqline(log.lm$res, col="blue")
summary(log.lm)
#>
#> Call:
#> lm(formula = logADR37 ~ logADR27 + logCS505 + logT867 + logH505,
#> data = logdata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.140390 -0.038389 -0.003712 0.040204 0.153275
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.12997 0.43073 -0.302 0.764119
#> logADR27 0.94173 0.24140 3.901 0.000292 ***
#> logCS505 -0.16790 0.16086 -1.044 0.301712
#> logT867 0.15736 0.08986 1.751 0.086177 .
#> logH505 0.09997 0.01857 5.385 2.04e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06038 on 49 degrees of freedom
#> Multiple R-squared: 0.9852, Adjusted R-squared: 0.984
#> F-statistic: 815.9 on 4 and 49 DF, p-value: < 2.2e-16
log.lm2 <- lm(logADR37 ~ logADR27 + logT867 + logH505, data=logdata)
summary(log.lm2)
#>
#> Call:
#> lm(formula = logADR37 ~ logADR27 + logT867 + logH505, data = logdata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.157932 -0.039221 -0.004817 0.038672 0.157626
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.57484 0.06231 -9.226 2.25e-12 ***
#> logADR27 0.69588 0.05289 13.157 < 2e-16 ***
#> logT867 0.24471 0.03275 7.472 1.10e-09 ***
#> logH505 0.09030 0.01610 5.608 8.85e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06043 on 50 degrees of freedom
#> Multiple R-squared: 0.9849, Adjusted R-squared: 0.984
#> F-statistic: 1086 on 3 and 50 DF, p-value: < 2.2e-16
plot(log.lm2$fit, log.lm2$res,xlab="Fitted values",ylab="Residuals", main="Anscombe plot")
abline(h=0)
qqnorm(log.lm2$res,main="Normal probability plot", col=2)
qqline(log.lm2$res, col="blue")
par(old.par)
#####################################
# Multicollinearity Analysis
######################################
mod.adr27 <- lm(logADR27 ~ logT867 + logCS505 + logH505, data=logdata)
summary(mod.adr27) # Multiple R^2 = 0.9936,
#>
#> Call:
#> lm(formula = logADR27 ~ logT867 + logCS505 + logH505, data = logdata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.060363 -0.020065 -0.005929 0.015471 0.153280
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.77188 0.02975 -59.56 < 2e-16 ***
#> logT867 0.36474 0.01052 34.66 < 2e-16 ***
#> logCS505 0.65020 0.02063 31.52 < 2e-16 ***
#> logH505 -0.02901 0.01007 -2.88 0.00584 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.03537 on 50 degrees of freedom
#> Multiple R-squared: 0.9936, Adjusted R-squared: 0.9932
#> F-statistic: 2589 on 3 and 50 DF, p-value: < 2.2e-16
mod.t867 <- lm(logT867 ~ logADR27 + logH505 + logCS505, data=logdata)
summary(mod.t867) # Multiple R^2 = 0.977,
#>
#> Call:
#> lm(formula = logT867 ~ logADR27 + logH505 + logCS505, data = logdata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.46512 -0.04505 0.01989 0.05619 0.16509
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.64307 0.16840 27.572 <2e-16 ***
#> logADR27 2.63215 0.07594 34.663 <2e-16 ***
#> logH505 0.07192 0.02739 2.626 0.0114 *
#> logCS505 -1.66719 0.09218 -18.086 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.09503 on 50 degrees of freedom
#> Multiple R-squared: 0.977, Adjusted R-squared: 0.9757
#> F-statistic: 709.1 on 3 and 50 DF, p-value: < 2.2e-16
mod.cs505 <- lm(logCS505 ~ logADR27 + logH505 + logT867, data=logdata)
summary(mod.cs505) # Multiple R^2 = 0.9837,
#>
#> Call:
#> lm(formula = logCS505 ~ logADR27 + logH505 + logT867, data = logdata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.197444 -0.025126 0.008913 0.028843 0.104478
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.64960 0.05473 48.411 < 2e-16 ***
#> logADR27 1.46430 0.04646 31.519 < 2e-16 ***
#> logH505 0.05760 0.01415 4.072 0.000166 ***
#> logT867 -0.52029 0.02877 -18.086 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.05309 on 50 degrees of freedom
#> Multiple R-squared: 0.9837, Adjusted R-squared: 0.9828
#> F-statistic: 1008 on 3 and 50 DF, p-value: < 2.2e-16
mod.h505 <- lm(logH505 ~ logADR27 + logCS505 + logT867, data=logdata)
summary(mod.h505) # Multiple R^2 = 0.5784,
#>
#> Call:
#> lm(formula = logH505 ~ logADR27 + logCS505 + logT867, data = logdata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.50570 -0.11784 0.07984 0.24783 0.85079
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -9.3778 3.0009 -3.125 0.002959 **
#> logADR27 -4.9041 1.7029 -2.880 0.005844 **
#> logCS505 4.3236 1.0618 4.072 0.000166 ***
#> logT867 1.6848 0.6417 2.626 0.011444 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4599 on 50 degrees of freedom
#> Multiple R-squared: 0.5784, Adjusted R-squared: 0.5531
#> F-statistic: 22.87 on 3 and 50 DF, p-value: 1.849e-09
# Variance inflation factors
vifs <- c(0.9936, 0.977, 0.9837, 0.5784)
vifs <- 1/(1-vifs)
#Condition numbers
X <- logdata
# X is a copy of logdata
X[,1] <- 1
# the first column of X is 1
# this is for the intercept
X <- as.matrix(X)
# Coerces X to be a matrix
xtx <- t(X) %*% X # Gives X^T X
eigenvalues <- eigen(xtx)$values
kappa <- max(eigenvalues)/min(eigenvalues)
kappa <- sqrt(kappa)
# kappa = 244 is much LARGER than 30!
### Validation statistic
# Fit the log.lm2 model with the first 45 observations
# use the fitted model to predict the remaining 9 observations
# Calculate the mean square error validation statistic
log.lmsub <- lm(logADR37 ~ logADR27 + logT867 + logH505, data=logdata, subset=1:45)
# Now predict all 54 observations using the fitted model
mod.pred <- predict(log.lmsub, logdata, se.fit=TRUE)
mod.pred$fit # provides all the 54 predicted values
#> 1 2 3 4 5 6 7
#> 0.869733716 0.751945766 0.464377364 0.065106368 -0.750908661 0.676749888 0.863161161
#> 8 9 10 11 12 13 14
#> 0.533123773 0.724024242 0.101489732 0.758165626 -0.009235144 0.262467812 0.499359749
#> 15 16 17 18 19 20 21
#> 0.896844370 -0.512513262 0.860758626 0.288778088 0.638085289 0.691680440 0.047967101
#> 22 23 24 25 26 27 28
#> -0.125585969 0.542661299 -0.298423547 -0.607263492 0.529236907 0.296110240 0.476668407
#> 29 30 31 32 33 34 35
#> -0.343488224 0.286358929 0.732443474 0.740078857 -0.002182982 0.421013296 -0.201423341
#> 36 37 38 39 40 41 42
#> 0.002081194 0.782644479 -0.305494632 -0.978948479 0.130633957 0.245486987 0.071403922
#> 43 44 45 46 47 48 49
#> 0.781436122 0.341921707 0.195907253 0.425131377 -0.094842350 0.754611045 0.444664924
#> 50 51 52 53 54
#> 0.353051544 0.805738645 -0.303182437 -0.356706826 1.052803842
logdata$pred <- mod.pred$fit
# Get only last 9
a <- logdata[46:54, ]
validation.residuals <- a$logADR37 - a$pred
validation.stat <- mean(validation.residuals^2)
validation.stat
#> [1] 0.005814136
err_age
: Error in guessing ages from photographsThis data set contains the errors in guessing ages of 10 Southampton mathematicians. This data set has been used as an exercise in obtaining summary statistics and performing exploratory data analysis in Chapter 2 of the book Sahu (2024).
summary(err_age)
#> group size females photo sex
#> Min. : 1 Min. :2.000 Min. :0 Min. : 1.0 Length:550
#> 1st Qu.:14 1st Qu.:2.000 1st Qu.:0 1st Qu.: 3.0 Class :character
#> Median :28 Median :3.000 Median :1 Median : 5.5 Mode :character
#> Mean :28 Mean :2.727 Mean :1 Mean : 5.5
#> 3rd Qu.:42 3rd Qu.:3.000 3rd Qu.:2 3rd Qu.: 8.0
#> Max. :55 Max. :3.000 Max. :3 Max. :10.0
#> race est_age tru_age error abs_error
#> Length:550 Min. :15.00 Min. :22.0 Min. :-23.0000 Min. : 0.000
#> Class :character 1st Qu.:23.00 1st Qu.:22.0 1st Qu.: -4.0000 1st Qu.: 2.000
#> Mode :character Median :33.00 Median :35.5 Median : 0.0000 Median : 3.000
#> Mean :38.56 Mean :39.5 Mean : -0.9436 Mean : 5.336
#> 3rd Qu.:55.00 3rd Qu.:55.0 3rd Qu.: 3.0000 3rd Qu.: 7.000
#> Max. :80.00 Max. :73.0 Max. : 18.0000 Max. :23.000
ffood
: Service times in a fast food restaurantService (waiting) times (in seconds) of customers at a fast-food restaurant. Source: Unknown.
summary(ffood)
#> AM PM
#> Min. : 38.00 Min. :45.00
#> 1st Qu.: 53.75 1st Qu.:59.75
#> Median : 63.50 Median :67.00
#> Mean : 68.90 Mean :66.80
#> 3rd Qu.: 83.75 3rd Qu.:74.25
#> Max. :107.00 Max. :88.00
# 95% Confidence interval for the mean waiting time usig t-distribution
a <- c(ffood$AM, ffood$PM)
mean(a) + c(-1, 1) * qt(0.975, df=19) * sqrt(var(a))/sqrt(20)
#> [1] 59.25467 76.44533
# Two sample t-test for the difference between morning and afternoon times
t.test(ffood$AM, ffood$PM)
#>
#> Welch Two Sample t-test
#>
#> data: ffood$AM and ffood$PM
#> t = 0.24929, df = 14.201, p-value = 0.8067
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -15.9434 20.1434
#> sample estimates:
#> mean of x mean of y
#> 68.9 66.8
gasmileage
: Gas mileage of carsThis data set contains gas mileage obtained from four models of a car.
This data set has been used to illustrate the concepts of analysis of variance
in Chapter 19 of the book Sahu (2024).
summary(gasmileage)
#> mileage model
#> Min. :22.00 A:2
#> 1st Qu.:24.00 B:4
#> Median :27.00 C:3
#> Mean :26.55 D:2
#> 3rd Qu.:28.50
#> Max. :32.00
y <- c(22, 26, 28, 24, 29, 29, 32, 28, 23, 24)
xx <- c(1,1,2,2,2,3,3,3,4,4)
# Plot the observations
plot(xx, y, col="red", pch="*", xlab="Model", ylab="Mileage")
# Method1: Hand calculation
ni <- c(2, 3, 3, 2)
means <- tapply(y, xx, mean)
vars <- tapply(y, xx, var)
round(rbind(means, vars), 2)
#> 1 2 3 4
#> means 24 27 29.67 23.5
#> vars 8 7 4.33 0.5
sum(y^2) # gives 7115
#> [1] 7115
totalSS <- sum(y^2) - 10 * (mean(y))^2 # gives 92.5
RSSf <- sum(vars*(ni-1)) # gives 31.17
groupSS <- totalSS - RSSf # gives 61.3331.17/6
meangroupSS <- groupSS/3 # gives 20.44
meanErrorSS <- RSSf/6 # gives 5.194
Fvalue <- meangroupSS/meanErrorSS # gives 3.936
pvalue <- 1-pf(Fvalue, df1=3, df2=6)
#### Method 2: Illustrate using dummy variables
#################################
#Create the design matrix X for the full regression model
g <- 4
n1 <- 2
n2 <- 3
n3 <- 3
n4 <- 2
n <- n1+n2+n3+n4
X <- matrix(0, ncol=g, nrow=n) #Set X as a zero matrix initially
X[1:n1,1] <- 1 #Determine the first column of X
X[(n1+1):(n1+n2),2] <- 1 #the 2nd column
X[(n1+n2+1):(n1+n2+n3),3] <- 1 #the 3rd
X[(n1+n2+n3+1):(n1+n2+n3+n4),4] <- 1 #the 4th
#################################
####Fitting the full model####
#################################
#Estimation
XtXinv <- solve(t(X)%*%X)
betahat <- XtXinv %*%t(X)%*%y #Estimation of the coefficients
Yhat <- X%*%betahat #Fitted Y values
Resids <- y - Yhat #Residuals
SSE <- sum(Resids^2) #Error sum of squares
S2hat <- SSE/(n-g) #Estimation of sigma-square; mean square for error
Sigmahat <- sqrt(S2hat)
##############################################################
####Fitting the reduced model -- the 4 means are equal #####
##############################################################
Xr <- matrix(1, ncol=1, nrow=n)
kr <- dim(Xr)[2]
#Estimation
Varr <- solve(t(Xr)%*%Xr)
hbetar <- solve(t(Xr)%*%Xr)%*%t(Xr)%*% y #Estimation of the coefficients
hYr = Xr%*%hbetar #Fitted Y values
Resir <- y - hYr #Residuals
SSEr <- sum(Resir^2) #Total sum of squares
###################################################################
####F-test for comparing the reduced model with the full model ####
###################################################################
FStat <- ((SSEr-SSE)/(g-kr))/(SSE/(n-g)) #The test statistic of the F-test
alpha <- 0.05
Critical_value_F <- qf(1-alpha, g-kr,n-g) #The critical constant of F-test
pvalue_F <- 1-pf(FStat,g-kr, n-g) #p-value of F-test
modelA <- c(22, 26)
modelB <- c(28, 24, 29)
modelC <- c(29, 32, 28)
modelD <- c(23, 24)
SSerror = sum( (modelA-mean(modelA))^2 ) + sum( (modelB-mean(modelB))^2 )
+ sum( (modelC-mean(modelC))^2 ) + sum( (modelD-mean(modelD))^2 )
#> [1] 9.166667
SStotal <- sum( (y-mean(y))^2 )
SSgroup <- SStotal-SSerror
####
#### Method 3: Use the built-in function lm directly
#####################################
aa <- "modelA"
bb <- "modelB"
cc <- "modelC"
dd <- "modelD"
Expl <- c(aa,aa,bb,bb,bb,cc,cc,cc,dd,dd)
is.factor(Expl)
#> [1] FALSE
Expl <- factor(Expl)
model1 <- lm(y~Expl)
summary(model1)
#>
#> Call:
#> lm(formula = y ~ Expl)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.000 -1.417 0.000 1.750 2.333
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24.000 1.612 14.892 5.77e-06 ***
#> ExplmodelB 3.000 2.081 1.442 0.1994
#> ExplmodelC 5.667 2.081 2.724 0.0345 *
#> ExplmodelD -0.500 2.279 -0.219 0.8336
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.279 on 6 degrees of freedom
#> Multiple R-squared: 0.6631, Adjusted R-squared: 0.4946
#> F-statistic: 3.936 on 3 and 6 DF, p-value: 0.07227
anova(model1)
#> Analysis of Variance Table
#>
#> Response: y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Expl 3 61.333 20.4444 3.9358 0.07227 .
#> Residuals 6 31.167 5.1944
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###Alternatively ###
xxf <- factor(xx)
is.factor(xxf)
#> [1] TRUE
model2 <- lm(y~xxf)
summary(model2)
#>
#> Call:
#> lm(formula = y ~ xxf)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.000 -1.417 0.000 1.750 2.333
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24.000 1.612 14.892 5.77e-06 ***
#> xxf2 3.000 2.081 1.442 0.1994
#> xxf3 5.667 2.081 2.724 0.0345 *
#> xxf4 -0.500 2.279 -0.219 0.8336
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.279 on 6 degrees of freedom
#> Multiple R-squared: 0.6631, Adjusted R-squared: 0.4946
#> F-statistic: 3.936 on 3 and 6 DF, p-value: 0.07227
anova(model2)
#> Analysis of Variance Table
#>
#> Response: y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> xxf 3 61.333 20.4444 3.9358 0.07227 .
#> Residuals 6 31.167 5.1944
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
possum
: Body weight and length of possums in Australian regionsThis data set has been used to illustrate multiple linear regression modeling and analysis of variance in Chapter 19 of the book Sahu (2024).
The data set contains body weight and length of possums (tree living furry animals who are mostly nocturnal (marsupial) caught in 7 different regions of Australia. Source: Lindenmayer and Donnelly (1995). (Lindenmayer and Donnelly 1995)
head(possum)
#> Body_Weight Length Location
#> 1 3.5 89.0 1
#> 2 3.3 91.5 1
#> 3 3.5 95.5 1
#> 4 3.1 92.0 1
#> 5 2.8 85.5 1
#> 6 3.0 90.5 1
dim(possum)
#> [1] 101 3
summary(possum)
#> Body_Weight Length Location
#> Min. :1.80 Min. :75.00 Min. :1.000
#> 1st Qu.:2.60 1st Qu.:84.00 1st Qu.:1.000
#> Median :2.80 Median :88.00 Median :3.000
#> Mean :2.88 Mean :87.16 Mean :3.604
#> 3rd Qu.:3.20 3rd Qu.:90.00 3rd Qu.:6.000
#> Max. :4.20 Max. :96.50 Max. :7.000
## Code lines used in the book
## Create a new data set
poss <- possum
poss$region <- factor(poss$Location)
levels(poss$region) <- c("W.A", "S.A", "N.T", "QuL", "NSW", "Vic", "Tas")
colnames(poss)<-c("y","z","Location", "x")
head(poss)
#> y z Location x
#> 1 3.5 89.0 1 W.A
#> 2 3.3 91.5 1 W.A
#> 3 3.5 95.5 1 W.A
#> 4 3.1 92.0 1 W.A
#> 5 2.8 85.5 1 W.A
#> 6 3.0 90.5 1 W.A
# Draw side by side boxplots
boxplot(y~x, data=poss, col=2:8, xlab="region", ylab="Weight")
# Obtain scatter plot
# Start with a skeleton plot
plot(poss$z, poss$y, type="n", xlab="Length", ylab="Weight")
# Add points for the seven regions
for (i in 1:7) {
points(poss$z[poss$Location==i],poss$y[poss$Location==i],type="p", pch=as.character(i), col=i)
}
## Add legends
legend(x=76, y=4.2, legend=paste(as.character(1:7), levels(poss$x)), lty=1:7, col=1:7)
# Start modelling
#Fit the model with interaction.
poss.lm1<-lm(y~z+x+z:x,data=poss)
summary(poss.lm1)
#>
#> Call:
#> lm(formula = y ~ z + x + z:x, data = poss)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.67953 -0.15793 0.01999 0.14591 0.78255
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3.5842639 1.5008154 -2.388 0.0191 *
#> z 0.0748795 0.0167202 4.478 2.27e-05 ***
#> xS.A 0.3283443 2.3839136 0.138 0.8908
#> xN.T 6.0527549 3.3898871 1.786 0.0777 .
#> xQuL -5.9670098 4.5482536 -1.312 0.1930
#> xNSW -2.4035332 2.7983463 -0.859 0.3927
#> xVic 0.8984502 2.4135144 0.372 0.7106
#> xTas -0.1263706 2.4142962 -0.052 0.9584
#> z:xS.A -0.0031036 0.0280484 -0.111 0.9121
#> z:xN.T -0.0675469 0.0383301 -1.762 0.0815 .
#> z:xQuL 0.0663129 0.0494622 1.341 0.1835
#> z:xNSW 0.0256881 0.0318912 0.805 0.4227
#> z:xVic -0.0141738 0.0279034 -0.508 0.6128
#> z:xTas -0.0003129 0.0276672 -0.011 0.9910
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.2828 on 87 degrees of freedom
#> Multiple R-squared: 0.6731, Adjusted R-squared: 0.6243
#> F-statistic: 13.78 on 13 and 87 DF, p-value: 4.8e-16
plot(poss$z, poss$y,type="n", xlab="Length", ylab="Weight")
for (i in 1:7) {
lines(poss$z[poss$Location==i],poss.lm1$fit[poss$Location==i],type="l",
lty=i, col=i, lwd=1.8)
points(poss$z[poss$Location==i],poss$y[poss$Location==i],type="p",
pch=as.character(i), col=i)
}
poss.lm0 <- lm(y~z,data=poss)
abline(poss.lm0, lwd=3, col=9)
# Has drawn the seven parallel regression lines
legend(x=76, y=4.2, legend=paste(as.character(1:7), levels(poss$x)),
lty=1:7, col=1:7)
n <- length(possum$Body_Weight)
# Wrong model since Location is not a numeric covariate
wrong.lm <- lm(Body_Weight~Location, data=possum)
summary(wrong.lm)
#>
#> Call:
#> lm(formula = Body_Weight ~ Location, data = possum)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.20753 -0.21060 0.01309 0.21309 1.35124
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.16630 0.07740 40.910 < 2e-16 ***
#> Location -0.07939 0.01801 -4.409 2.64e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4239 on 99 degrees of freedom
#> Multiple R-squared: 0.1641, Adjusted R-squared: 0.1557
#> F-statistic: 19.44 on 1 and 99 DF, p-value: 2.643e-05
nis <- table(possum$Location)
meanwts <- tapply(possum$Body_Weight, possum$Location, mean)
varwts <- tapply(possum$Body_Weight, possum$Location, var)
datasums <- data.frame(nis=nis, mean=meanwts, var=varwts)
datasums <- data.frame(nis=nis, mean=meanwts, var=varwts)
modelss <- sum(datasums[,2] * (meanwts - mean(meanwts))^2)
residss <- sum( (datasums[,2] - 1) * varwts)
fvalue <- (modelss/6) / (residss/94)
fcritical <- qf(0.95, df1= 6, df2=94)
x <- seq(from=0, to=12, length=200)
y <- df(x, df1=6, df2=94)
plot(x, y, type="l", xlab="", ylab="Density of F(6, 94)", col=4)
abline(v=fcritical, lty=3, col=3)
abline(v=fvalue, lty=2, col=2)
pvalue <- 1-pf(fvalue, df1=6, df2=94)
### Doing the above in R
# Convert the Location column to a factor
localpossum <- possum
localpossum$Location <- as.factor(localpossum$Location)
summary(localpossum) # Now Location is a factor
#> Body_Weight Length Location
#> Min. :1.80 Min. :75.00 1:33
#> 1st Qu.:2.60 1st Qu.:84.00 2:12
#> Median :2.80 Median :88.00 3: 7
#> Mean :2.88 Mean :87.16 4: 6
#> 3rd Qu.:3.20 3rd Qu.:90.00 5:13
#> Max. :4.20 Max. :96.50 6:13
#> 7:17
# Put the identifiability constraint:
options(contrasts=c("contr.treatment", "contr.poly"))
# Change to have easier column names
colnames(localpossum) <- c("y", "z", "x")
# Fit model M1
possum.lm1 <- lm(y~x, data=localpossum)
summary(possum.lm1)
#>
#> Call:
#> lm(formula = y ~ x, data = localpossum)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.84167 -0.23333 0.01765 0.21765 0.96667
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.13333 0.06402 48.943 < 2e-16 ***
#> x2 -0.49167 0.12397 -3.966 0.000143 ***
#> x3 -0.01905 0.15304 -0.124 0.901214
#> x4 0.33333 0.16322 2.042 0.043929 *
#> x5 -0.37949 0.12043 -3.151 0.002182 **
#> x6 -0.68718 0.12043 -5.706 1.34e-07 ***
#> x7 -0.45098 0.10979 -4.108 8.53e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3678 on 94 degrees of freedom
#> Multiple R-squared: 0.4026, Adjusted R-squared: 0.3644
#> F-statistic: 10.56 on 6 and 94 DF, p-value: 6.204e-09
anova(possum.lm1)
#> Analysis of Variance Table
#>
#> Response: y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> x 6 8.5667 1.42778 10.556 6.204e-09 ***
#> Residuals 94 12.7137 0.13525
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
possum.lm2 <- lm(y~z, data=localpossum)
summary(possum.lm2)
#>
#> Call:
#> lm(formula = y ~ z, data = localpossum)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.77843 -0.19072 -0.02632 0.20928 0.82708
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -4.284148 0.633108 -6.767 9.36e-10 ***
#> z 0.082202 0.007256 11.329 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3059 on 99 degrees of freedom
#> Multiple R-squared: 0.5646, Adjusted R-squared: 0.5602
#> F-statistic: 128.4 on 1 and 99 DF, p-value: < 2.2e-16
anova(possum.lm2)
#> Analysis of Variance Table
#>
#> Response: y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> z 1 12.0139 12.0139 128.35 < 2.2e-16 ***
#> Residuals 99 9.2665 0.0936
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Include both location and length but no interaction
possum.lm3 <- lm(y~x+z, data=localpossum)
summary(possum.lm3)
#>
#> Call:
#> lm(formula = y ~ x + z, data = localpossum)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.67242 -0.19970 0.01845 0.17516 0.78209
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3.390403 0.816320 -4.153 7.27e-05 ***
#> x2 0.057028 0.117868 0.484 0.62964
#> x3 0.100261 0.119312 0.840 0.40288
#> x4 0.152418 0.128260 1.188 0.23772
#> x5 -0.176672 0.096536 -1.830 0.07044 .
#> x6 -0.310958 0.104334 -2.980 0.00367 **
#> x7 -0.161791 0.092289 -1.753 0.08288 .
#> z 0.072719 0.009083 8.006 3.29e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.2845 on 93 degrees of freedom
#> Multiple R-squared: 0.6463, Adjusted R-squared: 0.6197
#> F-statistic: 24.28 on 7 and 93 DF, p-value: < 2.2e-16
anova(possum.lm3)
#> Analysis of Variance Table
#>
#> Response: y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> x 6 8.5667 1.4278 17.643 1.521e-13 ***
#> z 1 5.1876 5.1876 64.102 3.287e-12 ***
#> Residuals 93 7.5262 0.0809
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Include interaction effect
possum.lm4 <- lm(y~x+z+x:z, data=localpossum)
summary(possum.lm4)
#>
#> Call:
#> lm(formula = y ~ x + z + x:z, data = localpossum)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.67953 -0.15793 0.01999 0.14591 0.78255
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3.5842639 1.5008154 -2.388 0.0191 *
#> x2 0.3283443 2.3839136 0.138 0.8908
#> x3 6.0527549 3.3898871 1.786 0.0777 .
#> x4 -5.9670098 4.5482536 -1.312 0.1930
#> x5 -2.4035332 2.7983463 -0.859 0.3927
#> x6 0.8984502 2.4135144 0.372 0.7106
#> x7 -0.1263706 2.4142962 -0.052 0.9584
#> z 0.0748795 0.0167202 4.478 2.27e-05 ***
#> x2:z -0.0031036 0.0280484 -0.111 0.9121
#> x3:z -0.0675469 0.0383301 -1.762 0.0815 .
#> x4:z 0.0663129 0.0494622 1.341 0.1835
#> x5:z 0.0256881 0.0318912 0.805 0.4227
#> x6:z -0.0141738 0.0279034 -0.508 0.6128
#> x7:z -0.0003129 0.0276672 -0.011 0.9910
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.2828 on 87 degrees of freedom
#> Multiple R-squared: 0.6731, Adjusted R-squared: 0.6243
#> F-statistic: 13.78 on 13 and 87 DF, p-value: 4.8e-16
anova(possum.lm4)
#> Analysis of Variance Table
#>
#> Response: y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> x 6 8.5667 1.4278 17.8561 2.194e-13 ***
#> z 1 5.1876 5.1876 64.8768 3.833e-12 ***
#> x:z 6 0.5696 0.0949 1.1873 0.3206
#> Residuals 87 6.9565 0.0800
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(possum.lm2, possum.lm3)
#> Analysis of Variance Table
#>
#> Model 1: y ~ z
#> Model 2: y ~ x + z
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 99 9.2665
#> 2 93 7.5262 6 1.7404 3.5842 0.003065 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Check the diagnostics for M3
plot(possum.lm3$fit, possum.lm3$res,xlab="Fitted values",ylab="Residuals",
main="Anscombe plot")
abline(h=0)
qqnorm(possum.lm3$res,main="Normal probability plot", col=2)
qqline(possum.lm3$res, col="blue")
rm(localpossum)
rm(poss)
puffin
: Nesting habits of puffins in NewfoundlandThis object contains data regarding nesting habits of common puffin. Source: Nettleship (1972). (Nettleship 1972) This data set has been used to illustrate multiple linear regression modeling in Chapter 18 of the book Sahu (2024).
head(puffin)
#> Nesting_Frequency Grass_Cover Mean_Soil_Depth Slope_Angle Distance_from_Edge
#> 1 0 15 27.8 8 45
#> 2 0 0 41.9 8 54
#> 3 0 20 36.8 5 60
#> 4 0 30 37.7 8 42
#> 5 0 75 45.5 5 48
#> 6 0 15 51.4 8 54
dim(puffin)
#> [1] 38 5
summary(puffin)
#> Nesting_Frequency Grass_Cover Mean_Soil_Depth Slope_Angle Distance_from_Edge
#> Min. : 0.000 Min. : 0.00 Min. :24.30 Min. : 2.00 Min. : 3.00
#> 1st Qu.: 0.000 1st Qu.:40.00 1st Qu.:32.75 1st Qu.: 7.25 1st Qu.:18.00
#> Median : 7.500 Median :60.00 Median :37.40 Median :10.00 Median :30.00
#> Mean : 7.684 Mean :56.45 Mean :37.72 Mean :15.00 Mean :30.39
#> 3rd Qu.:12.750 3rd Qu.:80.00 3rd Qu.:42.83 3rd Qu.:21.25 3rd Qu.:44.25
#> Max. :25.000 Max. :95.00 Max. :51.40 Max. :38.00 Max. :60.00
pairs(puffin)
puffin$sqrtfreq <- sqrt(puffin$Nesting_Frequency)
puff.sqlm <- lm(sqrtfreq~ Grass_Cover + Mean_Soil_Depth + Slope_Angle
+Distance_from_Edge, data=puffin)
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2,1))
qqnorm(puff.sqlm$res,main="Normal probability plot", col=2)
qqline(puff.sqlm$res, col="blue")
plot(puff.sqlm$fit, puff.sqlm$res,xlab="Fitted values",ylab="Residuals",
main="Anscombe plot", col="red")
abline(h=0)
summary(puff.sqlm)
#>
#> Call:
#> lm(formula = sqrtfreq ~ Grass_Cover + Mean_Soil_Depth + Slope_Angle +
#> Distance_from_Edge, data = puffin)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.17058 -0.28555 0.07289 0.43606 1.08192
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.4176153 0.7479664 5.906 1.27e-06 ***
#> Grass_Cover 0.0004179 0.0045698 0.091 0.9277
#> Mean_Soil_Depth 0.0486374 0.0181385 2.681 0.0114 *
#> Slope_Angle -0.0318777 0.0182695 -1.745 0.0903 .
#> Distance_from_Edge -0.1189727 0.0134970 -8.815 3.45e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.6217 on 33 degrees of freedom
#> Multiple R-squared: 0.8853, Adjusted R-squared: 0.8714
#> F-statistic: 63.67 on 4 and 33 DF, p-value: 4.753e-15
par(old.par)
#####################################
# F test for two betas at the same time:
######################################
puff.sqlm2 <- lm(sqrtfreq~ Mean_Soil_Depth + Distance_from_Edge, data=puffin)
anova(puff.sqlm)
#> Analysis of Variance Table
#>
#> Response: sqrtfreq
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Grass_Cover 1 7.307 7.307 18.904 0.0001241 ***
#> Mean_Soil_Depth 1 0.820 0.820 2.122 0.1546449
#> Slope_Angle 1 60.278 60.278 155.945 4.725e-14 ***
#> Distance_from_Edge 1 30.034 30.034 77.700 3.453e-10 ***
#> Residuals 33 12.756 0.387
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(puff.sqlm2)
#> Analysis of Variance Table
#>
#> Response: sqrtfreq
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Mean_Soil_Depth 1 0.513 0.513 1.2596 0.2694
#> Distance_from_Edge 1 96.437 96.437 236.9547 <2e-16 ***
#> Residuals 35 14.245 0.407
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fval <- 1/2*(14.245-12.756)/0.387 # 1.924
qf(0.95, 2, 33) # 3.28
#> [1] 3.284918
1-pf(fval, 2, 33) # 0.162
#> [1] 0.1620835
anova(puff.sqlm2, puff.sqlm)
#> Analysis of Variance Table
#>
#> Model 1: sqrtfreq ~ Mean_Soil_Depth + Distance_from_Edge
#> Model 2: sqrtfreq ~ Grass_Cover + Mean_Soil_Depth + Slope_Angle + Distance_from_Edge
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 35 14.245
#> 2 33 12.756 2 1.4889 1.926 0.1618
rice
: data set on rice yieldThis data set has been used to illustrate multiple linear regression modeling in Chapter 18 of the book Sahu (2024). This data set has been obtained from the journal research article Bal and Ojha (1975). (Bal and Ojha 1975)
summary(rice)
#> Yield Days
#> Min. :2508 Min. :16.0
#> 1st Qu.:3092 1st Qu.:23.5
#> Median :3318 Median :31.0
#> Mean :3283 Mean :31.0
#> 3rd Qu.:3549 3rd Qu.:38.5
#> Max. :3883 Max. :46.0
plot(rice$Days, rice$Yield, pch="*", xlab="Days", ylab="Yield")
rice$daymin31 <- rice$Days-31
rice.lm <- lm(Yield ~ daymin31, data=rice)
summary(rice.lm)
#>
#> Call:
#> lm(formula = Yield ~ daymin31, data = rice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -691.07 -217.65 45.85 271.77 612.14
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3283.12 103.95 31.582 2.05e-14 ***
#> daymin31 12.26 11.28 1.088 0.295
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 415.8 on 14 degrees of freedom
#> Multiple R-squared: 0.07791, Adjusted R-squared: 0.01205
#> F-statistic: 1.183 on 1 and 14 DF, p-value: 0.2951
# Check the diagnostics
plot(rice.lm$fit, rice.lm$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
# Should be a random scatter
# Needs a quadratic term
qqnorm(rice.lm$res, col=2)
qqline(rice.lm$res, col="blue")
rice.lm2 <- lm(Yield ~ daymin31 + I(daymin31^2) , data=rice)
old.par <- par(no.readonly = TRUE)
par(mfrow=c(1, 2))
plot(rice.lm2$fit, rice.lm2$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
# Should be a random scatter
# Much better plot!
qqnorm(rice.lm2$res, col=2)
qqline(rice.lm2$res, col="blue")
summary(rice.lm2)
#>
#> Call:
#> lm(formula = Yield ~ daymin31 + I(daymin31^2), data = rice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -303.96 -118.11 13.86 115.67 319.06
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3668.6682 76.7086 47.826 5.33e-16 ***
#> daymin31 12.2632 5.5286 2.218 0.045 *
#> I(daymin31^2) -4.5358 0.6744 -6.726 1.41e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 203.9 on 13 degrees of freedom
#> Multiple R-squared: 0.7942, Adjusted R-squared: 0.7625
#> F-statistic: 25.08 on 2 and 13 DF, p-value: 3.452e-05
par(old.par) # par(mfrow=c(1,1))
plot(rice$Days, rice$Yield, xlab="Days", ylab="Yield")
lines(rice$Days, rice.lm2$fit, lty=1, col=3)
rice.lm3 <- lm(Yield ~ daymin31 + I(daymin31^2)+I(daymin31^3) , data=rice)
#check the diagnostics
summary(rice.lm3) # Will print the summary of the fitted model
#>
#> Call:
#> lm(formula = Yield ~ daymin31 + I(daymin31^2) + I(daymin31^3),
#> data = rice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -281.97 -113.21 -6.11 97.75 330.92
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3668.66815 79.32668 46.248 6.82e-15 ***
#> daymin31 17.52493 14.49451 1.209 0.25
#> I(daymin31^2) -4.53580 0.69743 -6.504 2.92e-05 ***
#> I(daymin31^3) -0.03457 0.08751 -0.395 0.70
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 210.8 on 12 degrees of freedom
#> Multiple R-squared: 0.7968, Adjusted R-squared: 0.746
#> F-statistic: 15.68 on 3 and 12 DF, p-value: 0.0001876
#### Predict at a new value of Days=31.1465
# Create a new data set called new
new <- data.frame(daymin31=32.1465-31)
a <- predict(rice.lm2, newdata=new, se.fit=TRUE)
# Confidence interval for the mean of rice yield at day=31.1465
a <- predict(rice.lm2, newdata=new, interval="confidence")
a
#> fit lwr upr
#> 1 3676.766 3511.904 3841.628
# fit lwr upr
# [1,] 3676.766 3511.904 3841.628
# Prediction interval for a future yield at day=31.1465
b <- predict(rice.lm2, newdata=new, interval="prediction")
b
#> fit lwr upr
#> 1 3676.766 3206.461 4147.071
# fit lwr upr
#[1,] 3676.766 3206.461 4147.071
wgain
: Weight gain of students starting collegeThis contains weight gain data from 68 first year students during their first 12
weeks in college. Source: Levitsky {} (2004). (Levitsky and Mrdjenovic 2004)
This data set has been used to illustrate confidence intervals and \(t\)-tests
in part I of the book Sahu (2024).
summary(wgain)
#> student initial final
#> Min. : 1.00 Min. :42.64 Min. : 43.54
#> 1st Qu.:17.75 1st Qu.:53.86 1st Qu.: 54.32
#> Median :34.50 Median :60.78 Median : 60.78
#> Mean :34.50 Mean :61.72 Mean : 62.59
#> 3rd Qu.:51.25 3rd Qu.:68.04 3rd Qu.: 68.49
#> Max. :68.00 Max. :99.79 Max. :101.60
plot(wgain$initial, wgain$final)
abline(0, 1, col="red")
plot(wgain$initial, wgain$final, xlab="Wt in Week 1", ylab="Wt in Week 12",
pch="*", las=1)
abline(0, 1, col="red")
title("A scatter plot of the weights in Week 12 against the weights in Week 1")
# 95% Confidence interval for mean weight gain
x <- wgain$final - wgain$initial
mean(x) + c(-1, 1) * qt(0.975, df=67) * sqrt(var(x)/68)
#> [1] 0.6334959 1.1008265
# t-test to test the mean difference equals 0
t.test(x)
#>
#> One Sample t-test
#>
#> data: x
#> t = 7.4074, df = 67, p-value = 2.813e-10
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#> 0.6334959 1.1008265
#> sample estimates:
#> mean of x
#> 0.8671612
This function draws a butterfly as on the front cover of the book. This is a plot obtained as follows. Initially a sequence of
angles denoted by \(\theta\) is chosen in the range 0 to 24\(\pi\). Then, we specify two parameters \(a\) and \(b\) and evaluate the following equations. The illustrations show the effect of these two parameters on the shape of the resulting plot obtained by plotting \(x-y\) pairs. Successive points on the plot are joined by lines.
\[\begin{equation}
r = \exp(\cos(\theta)) - a \cos(b \, \theta) + \sin\left(\frac{\theta}{12}\right) \\
x = r \sin(\theta) \\
y = -r \cos(\theta)
\end{equation}\]
butterfly(color = 6)
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2, 2))
butterfly(a=10, b=1.5, color = "seagreen")
butterfly(color = 6)
butterfly(a=5, b=5, color=2)
butterfly(a=20, b=4, color = "blue")
par(old.par) # par(mfrow=c(1, 1))
The function ?monty
simulates the famous Monty Hall game. This function is written by (Chivers 2012). The function takes the arguments: strat
: short form for strategy which can take one of the three choices:
* “stay” : Do not change the initial door chosen
* “swap” : Swap the door chosen initially.
* “random” : Randomly decide to stay or swap.
The other parameters are \(N\): How many games to play, defaults to 1000 and
print_games
, which is a logical argument that tells whether to print the results of each of the \(N\) games. Here are three illustrations of the games.
monty("stay", print_games = FALSE)
monty("switch", print_games = FALSE)
monty("random", print_games = FALSE)
There are two functions illustrating the CLT. The first one
‘?see_the_clt_for_Bernoulli’ simulates nrep=10000
samples of size
nsize=10
with probability of success \(p=0.8\) by default. It then finds
the means of the nrep
samples and standardises the means by subtracting
the overall mean and dividing by the sample standard deviation. It then draws a
histogram of the sample means and superimposes the theoretical density of the standard normal distribution. The histogram will closely resemble the density of the standard normal distribution if the CLT approximation is good. The quality of the approximation depends on the sample size nsize
. This can be observed from the illustrations provided below.
a <- see_the_clt_for_Bernoulli()
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2, 3))
a30 <- see_the_clt_for_Bernoulli(nsize=30)
a50 <- see_the_clt_for_Bernoulli(nsize=50)
a100 <- see_the_clt_for_Bernoulli(nsize=100)
a500 <- see_the_clt_for_Bernoulli(nsize=500)
a1000 <- see_the_clt_for_Bernoulli(nsize=1000)
a5000 <- see_the_clt_for_Bernoulli(nsize=5000)
par(old.par)
The second function ?see_the_clt_for_uniform
demonstrates the CLT for sampling from the uniform distribution in the interval (0,1). As in the previous function for the Bernoulli distribution, this also takes two similar arguments and behaves very similarly as demonstrated below. But the sampling is done from the uniform distribution.
a <- see_the_clt_for_uniform()
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2, 3))
a1 <- see_the_clt_for_uniform(nsize=1)
a2 <- see_the_clt_for_uniform(nsize=2)
a3 <- see_the_clt_for_uniform(nsize=5)
a4 <- see_the_clt_for_uniform(nsize=10)
a5 <- see_the_clt_for_uniform(nsize=20)
a6 <- see_the_clt_for_uniform(nsize=50)
par(old.par)
ybars <- see_the_clt_for_uniform(nsize=12)
zbars <- (ybars - mean(ybars))/sd(ybars)
k <- 100
u <- seq(from=min(zbars), to= max(zbars), length=k)
ecdf <- rep(NA, k)
for(i in 1:k) ecdf[i] <- length(zbars[zbars<u[i]])/length(zbars)
tcdf <- pnorm(u)
plot(u, tcdf, type="l", col="red", lwd=4, xlab="", ylab="cdf")
lines(u, ecdf, lty=2, col="darkgreen", lwd=4)
symb <- c("cdf of sample means", "cdf of N(0, 1)")
legend(x=-3.5, y=0.4, legend = symb, lty = c(2, 1),
col = c("darkgreen","red"), bty="n")
This function also provides a plot of the estimated cumulative density function (cdf) of the standarised sample means and superimposes the cdf of the standard normal distribution. The CLT states that these cdf’s will be very close to each other as the sample size \(n \to \infty\).
The function ?see_the_wlln_for_uniform
illustrates the WLLN for sampling from the uniform distribution in the interval (0, 1). Similar to the two functions demonstrating the CLT, this function takes two arguments nsize
for sample size and nrep
for number of replications. The function draws the histogram of the replicated sample means and draws an estimated density function of the samples.
The below illustration shows that the estimated density gets more peaked as the sample size \(n\) increases. The WLLN says that the estimated density will gradually become just a spike as \(n \to \infty\).
a1 <- see_the_wlln_for_uniform(nsize=1, nrep=50000)
a2 <- see_the_wlln_for_uniform(nsize=10, nrep=50000)
a3 <- see_the_wlln_for_uniform(nsize=50, nrep=50000)
a4 <- see_the_wlln_for_uniform(nsize=100, nrep=50000)
plot(a4, type="l", lwd=2, ylim=range(c(a1$y, a2$y, a3$y, a4$y)), col=1,
lty=1, xlab="mean", ylab="density estimates")
lines(a3, type="l", lwd=2, col=2, lty=2)
lines(a2, type="l", lwd=2, col=3, lty=3)
lines(a1, type="l", lwd=2, col=4, lty=4)
symb <- c("n=1", "n=10", "n=50", "n=100")
legend(x=0.37, y=11.5, legend = symb, lty =4:1, col = 4:1)
The data sets included in this package and discussed in this vignette can be used in teaching and learning of introductory probability, statistics, R
for data-based sciences with or without reading the book by the same author, (Sahu 2024). So that a beginner reader can understand the methods, and also being introductory in nature, this package has used very basic R
commands and scripts throughout. But the package has illustrated the use of the advanced graphics package ggplot2 and function writing through the butterfly
function. However, ipsRdbs does not intentionally teach R programming and does not expect the users to be proficient in adopting those techniques in their own data analysis and processing. It has been purposefully designed this way to interest the reader to further learn and use more advanced graphics and data manipulation and analysis capabilities of R.