Chapter 1 Introduction to Statistics

1.1 What is statistics?

1.1.1 Early and modern definitions

The word statistics comes from the Latin status, meaning ‘the state’. In the mid-18th century, it referred to the collection, processing, and use of data by governments. During the rapid industrialisation of Europe in the early 19th century, statistics developed into a formal discipline. In 1834, the Royal Statistical Society was founded as the leading professional association for statisticians in the UK and globally. As the field grew, statistics came to mean the interpretation of data and methods for extracting information to support decision making. Today, statistics refers to the collection, analysis, and interpretation of data. Indeed, the Oxford English Dictionary defines statistics as

The practice or science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

Notice that the original connection to ‘the state’ is no longer present. Now, statistical methods are essential for anyone who wants to answer questions using data.

For example: Will it rain tomorrow? Is smoking harmful during pregnancy? What degree classification will I receive when I graduate? Will the stock market crash tomorrow?

1.1.2 Statistics tames uncertainty

We often make decisions when outcomes are uncertain. Understanding the extent of this uncertainty helps us reduce the chance of making mistakes, or at least minimise the impact of those mistakes.

In other words: \[\text{Uncertain knowledge} + \text{Understanding its uncertainty} = \text{Usable knowledge}\]

1.1.3 Why should I study statistics as part of my degree?

Studying statistics will give you essential skills in data analysis and scientific reasoning. No matter which branch of mathematics, engineering, science, or social science you pursue, a solid understanding of statistics is important. Learning statistical theory also lets you apply your mathematical reasoning to real-world problems. This helps you improve both your mathematical and statistical skills.

All knowledge is, in final analysis, history.
All sciences are, in the abstract, mathematics.
All judgements are, in their rationale, statistics.
– Prof. C. R. Rao

1.1.4 Lies, Damn Lies and Statistics?

People sometimes joke, ‘You can prove anything with statistics!’ Such comments reflect the fact that statistics and statistical methods are often misquoted or misused without proper verification or justification. This is even more important in recent years of the global pandemic, when every day we have been showered with a deluge of numbers.

Statistics can easily be misused or misinterpreted. As statisticians, it is our responsibility to apply statistical techniques correctly and develop strong, scientifically sound arguments. In particular, we must be careful about the assumptions we make.

1.1.5 What’s in this module?

Chapter 1: We begin with basic statistics used in everyday life, such as the mean, median, mode, and standard deviation. We will discuss statistical analysis, report writing, and how to explore data using graphs. The R programming language will be introduced to assist with these tasks.
Chapter 2: Introduction to Probability. We will define and interpret probability as a way to measure uncertainty, learn the rules of probability, and work through examples.
Chapter 3: Probability Distributions. We will study various probability distributions that can be used to model the outcomes of different types of events.
Chapter 4: Statistical Inference. We will cover the basics of statistical inference, including point and interval estimation and hypothesis testing.
Chapter 5: Simple Linear Regression. We will introduce simple linear regressions models, which are used to model the relationships between two variables.

1.2 Example data sets

In this module, we will assume that we have data from $n$ randomly selected sampling units, which we will conveniently denote by $x_1, x_2, \dots, x_n$. We will assume that these values are numeric, either discrete like counts, e.g. number of road accidents, or continuous, e.g. heights of 4-year-olds, marks obtained in an examination.

We will use the following example datasets to illustrate concepts throughout the module:

Example 1.1 (Fast food service time) This dataset records the service times (in seconds) for customers at a fast-food restaurant. The first row shows customers served from 9–10am, and the second row shows those served from 2–3pm on the same day.

AM	38	100	64	43	63	59	107	52	86	77
PM	45	62	52	72	81	88	64	75	59	70

We would like to compare these AM and PM service times.

Example 1.2 (Computer failures) This dataset lists the number of weekly failures in a university computer system over two years.

4 0 0 0 3 2 0 0 6 7 6 2 1 11 6 1 2 1 1 2 0 2 2 1 0 12 8 4 5 0 5 4 1 0 8 2 5 2 1 
12 8 9 10 17 2 3 4 8 1 2 5 1 2 2 3 1 2 0 2 1 6 3 3 6 11 10 4 3 0 2 4 2 1 5 3 3 
2 5 3 4 1 3 6 4 4 5 2 10 4 1 5 6 9 7 3 1 3 0 2 2 1 4 2 13

We would like to summarise this data and make predictions about future failures.

Example 1.3 (Weight gain) Do students tend to gain weight during their first year at college? Professor David Levitsky at Cornell recruited students from two large introductory health classes. Although participation was voluntary, the students were similar to the rest of the class in terms of demographics such as sex and ethnicity. Sixty-eight students were weighed in the first week of the semester and again 12 weeks later. The first 10 rows of the data are:

Student number	Initial weight (kg)	Final weight (kg)
1	77.56423	76.20346
2	49.89512	50.34871
3	60.78133	61.68851
4	52.16308	53.97745
5	68.03880	70.30676
6	47.17357	48.08075
7	64.41006	67.13162
8	54.43104	56.24541
9	65.31725	67.13162
10	70.76035	69.85317

We would like to explore the data graphically and to test whether the data support the hypothesis that students gain weight during their first year in college.

Example 1.4 (Billionaires) Each year, Fortune magazine publishes a list of the world’s billionaires. The 1992 list includes 225 individuals, reporting their wealth, age, and geographic region (Asia, Europe, Middle East, United States, and Other).

The variables in the data are:

wealth: Wealth of family or individual in billions of dollars
age: Age in years (for families it is the maximum age of family members)
region: Region of the World (Asia, Europe, Middle East, United States and Other).

The first 10 rows of the data are:

wealth	age	region
37.0	50	Middle East
24.0	88	United States
14.0	64	Asia
13.0	63	United States
13.0	66	United States
11.7	72	Europe
10.0	71	Middle East
8.2	77	United States
8.1	68	United States
7.2	66	Europe

We will investigate differences in wealth of billionaires by age and region using many exploratory graphical tools and statistical methods.

1.3 Introduction to R

R is a programming language designed for statistics. We will use R to perform calculations, summarise data, create exploratory plots, carry out statistical analyses, illustrate theorems, and calculate probabilities. You will gain hands-on experience with R during the computer practical sessions.

R is free to download—just search for ‘download R’ or visit https://cran.r-project.org/. We will use R through the RStudio integrated development environment, which you can download from https://posit.co/download/rstudio-desktop/. RStudio provides a user-friendly editor for R code and a console. Instructions for using R and RStudio will be provided in the practical sessions. If you plan to work on the practicals on your own computer, please install R and RStudio beforehand.

These notes do not cover the details of using R—that will be addressed in the practical sessions. Here, we provide simple R commands for calculating useful quantities and show some example graphical outputs.

1.4 Summarising data sets

1.4.1 Summarising categorical data

Categorical (non-numeric) data can be summarised using tables. For example, the results of 20 coin tosses might be summarised as 12 heads and 8 tails.

For the computer failure data (see Example 1.2), we can summarise the number of failures per week as follows:

0	1	2	3	4	5	6	7	8	9	10	11	12	13	17
12	16	21	12	11	8	7	2	4	2	3	2	2	1	1

1.4.2 Measures of location

1.4.2.1 Choosing a representative value for the data

We want a representative value for the data $x_1, x_2, \dots, x_n$ that is a function of the data. If $a$ is our chosen value, how much error does it have? We could measure total error as the sum of squared errors, \[\text{SSE}(a) = \sum_{i=1}^n (x_i - a)^2\] or as the sum of absolute errors, \[\text{SAE}(a) = \sum_{i=1}^n |x_i - a|.\] What value of $a$ will minimise the SSE or the SAE? For SSE the answer is the sample mean and for SAE the answer is the sample median.

1.4.2.2 The sample mean

The sample mean is \[\bar x = \frac{1}{n} (x_1 + x_2 + \dots + x_n) = \frac{1}{n} \sum_{i=1}^n x_i.\] For the service time data from Example 1.1, the AM mean time is 68.9 seconds and the PM mean time is 66.8 seconds.

Theorem 1.1 The sample mean $\bar x$ minimises the SSE.

Proof. We have \[\begin{align*} \text{SSE}(a) &= \sum_{i=1}^{n}\left(x_{i}-a\right)^{2} \\ &=\sum_{i=1}^{n}\left(x_{i}-\bar{x}+\bar{x}-a\right)^{2} \quad \text{(Add and subtract $\bar{x}$)} \\ &=\sum_{i=1}^{n}\left\{\left(x_{i}-\bar{x}\right)^{2}+2\left(x_{i}-\bar{x}\right)(\bar{x}-a)+(\bar{x}-a)^{2}\right\} \\ &=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}+2(\bar{x}-a) \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)+\sum_{i=1}^{n}(\bar{x}-a)^{2} \\ &=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}+n(\bar{x}-a)^{2}, \end{align*}\] since $\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)=n \bar{x}-n \bar{x}=0$.

The first term is free of $a$ and the second term is non-negative for any value of $a$. Hence the minimum occurs when the second term is zero, i.e. when $a = \bar x$.

We have established the fact that the sum of (or mean) squares of the deviations from any number $a$ is minimised when $a$ is the mean. This justifies why we often use the mean as a representative value.

1.4.2.3 The sample median

The sample median is the middle value in the ordered list of observations $x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)}$. For the AM service time data \[38 < 43 < 52 < 59 < 63 < 64 < 77 < 86 < 100 < 107.\] If $n$ is odd, there is a unique middle value. If $n$ is even, there are two middle values (63 and 64 for the AM service time data). Any value between those two middle values is a sample median. By convention, we often use the mean of these values (63.5 for the AM service time data).

Theorem 1.2 The sample median minimises the SAE.

Proof. If $a<x_{(1)}$, then \[\begin{equation} \text{SAE}(a) = \sum_{i=1}^n (x_i - a). \tag{1.1} \end{equation}\] As $a$ increases, each term of (1.1) decreases until $a$ reaches $x_{(1)}$, so $\text{SAE} (x_{(1)})< \text{SAE}(a)$ for all $a<x_{(1)}$. We conclude the minimiser of the SAE is at least $x_{(1)}$.

Now suppose $x_{(k)} \leq a < x_{(k+1)}$. Then \[\begin{align*} \text{SAE}(a) &= \sum_{i=1}^k (a - x_{(i)}) + \sum_{i={k+1}}^n (x_{(i)} - a) \\ &= (2k - n) a - \sum_{i=1}^k x_{(i)} + \sum_{i={k+1}}^n x_{(i)}. \end{align*}\]

The term $- \sum_{i=1}^k x_{(i)} + \sum_{i={k+1}}^n x_{(i)}$ is constant for each any $a$ in the interval $[x_{(k)}, x_{(k+1)})$. So the SAE in this interval is a straight line, with slope $2k - n$. This slope is negative if $k < \frac{n}{2}$, zero if $k = \frac{n}{2}$, and positive if $k>\frac{n}{2}$.

For each $k < \frac{n}{2}$, the SAE is decreasing in the interval $[x_{(k)}, x_{(k+1)})$, and we conclude the minimiser of the SAE is at least $x_{(k+1)}$. Starting at $k = 1$, we continue increasing $k$ by one and concluding the the minimiser of the SAE is at least $x_{(k+1)}$, until we reach a $k$ such that $k \geq \frac{n}{2}$:

If $n$ is odd, this happens at $k = \frac{n+1}{2}$. In that case, since $k > \frac{n}{2}$, the SAE is increasing in the interval $[x_{(k)}, x_{(k+1)})$, so we conclude the SAE is minimised at $x_{(k)} = x_{\left(\frac{n+1}{2}\right)}$, the median point.
If $n$ is even, this happens at $k = \frac{n}{2}$. In that case the SAE is constant in the interval $[x_{(k)}, x_{(k+1)})$, so we conclude that the SAE is minimised at any $a$ between the two middle points $x_{\left(\frac{n}{2}\right)}$ and $x_{\left(\frac{n}{2}+1\right)}$, i.e. any median point.

We have established the fact that the sum of (or mean) of the absolute deviations from any number $a$ is minimised when $a$ is the median. This justifies why median is also often used as a representative value.

The mean is more affected by extreme values than the median. For example, for the AM service times, if the next observation is 190, the median will be 64 instead of 63.5, but the mean will jump to 79.9.

1.4.2.4 The sample mode

The mode, or most frequent value in the data, is the best representative if we use a 0-1 error function instead of SAE or SSE. Here, the error is 0 if our guess $a$ is correct and 1 if it is not. In this case, the best guess $a$ is the mode of the data.

1.4.3 Measures of spread

A quick way to measure spread is the range, which is the difference between the maximum and minimum values. For the AM service times, the range is $69$ ($107 - 38$) seconds.

The variance is \[\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}.\] We have \[\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}=\sum_{i=1}^{n}\left(x_{i}^{2}-2 x_{i} \bar{x}+\bar{x}^{2}\right)=\sum_{i=1}^{n} x_{i}^{2}-2 \bar{x}(n \bar{x})+n \bar{x}^{2}=\sum_{i=1}^{n} x_{i}^{2}-n \bar{x}^{2},\] so we calculate variance as \[ \operatorname{Var}(x)=\frac{1}{n-1}\left(\sum_{i=1}^{n} x_{i}^{2}-n \bar{x}^{2}\right). \] Sometimes variance is defined with $n$ instead of $n - 1$ in the denominator. We use $n - 1$ because this is the default in R. We will discuss this more in Chapter 4.

The standard deviation (sd) is the square root of the variance: \[s = \operatorname{sd}(x) = \sqrt{\operatorname{Var}(x)}.\] The standard deviation for the AM service times is 23.2 seconds. It has the same unit as the data.

The interquartile range (IQR) is the difference between the third quartile, $Q_3$, and the first quartile, $Q_1$. These are the observations ranked $\frac{1}{4}(3n + 1)$ and $\frac{1}{4}(n + 3)$ in the ordered list. The median is the second quartile, $Q_2$. When $n$ is even, the definitions of $Q_3$ and $Q_1$ are similar to that of the median, $Q_2$. The IQR for the AM service times is $83.75 - 53.75 = 30$ seconds.

1.4.4 Summarising data in R

We can easily calculate the mean, median, variance, and standard deviation in R.

For example, suppose the weekly counts of computer failures from Example 1.2 are stored in an object called compfail in R. You will learn how to read the data into R during the practical sessions.

We can calculate these quantities with:

mean(compfail)

## [1] 3.75

median(compfail)

## [1] 3

var(compfail)

## [1] 11.43204

sd(compfail)

## [1] 3.38113

R saves us a lot of effort compared to calculating these by hand.

The summary command provides several useful summary statistics:

summary(compfail)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    3.00    3.75    5.00   17.00

We can create a table summarising the counts by typing:

table(compfail)

## compfail
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 17 
## 12 16 21 12 11  8  7  2  4  2  3  2  2  1  1

1.5 Exploratory data plots

1.5.1 Introduction

Often, the best way to gain insight is to plot the data. Plots help us check the shape of a variable’s distribution and explore relationships between variables in a dataset.

The best type of plot depends on whether we are looking at a single variable or relationships between variables, and whether the variables are discrete or continuous.

Here, we show some plots produced by R. You will learn how to make these plots in the practical sessions.

1.5.2 Distribution of a single discrete variable

To explore the distribution of a single discrete variable, we can use a bar plot. For example, we can plot the weekly counts of computer failures from Example 1.2:

This plot shows the same information as the summary table of counts, but it is easier to understand visually.

1.5.3 Distribution of a single continuous variable

There are several types of plots we can use to summarise the shape of a single continuous variable’s distribution. For example, suppose we are interested in the weight gain in Example 1.3.

We can use a box-and-whiskers plot to summarise the distribution of the weight difference:

The line in the middle of the box is the median weight difference, and the box runs from the first quartile to the third quartile. The whiskers extend from the smallest value no more than $1.5 \times \text{IQR}$ below the first quartile to the largest value no more than $1.5 \times \text{IQR}$ above the third quartile.

Any outliers outside the whiskers are shown as separate points.

We see that most students in the sample gained weight during the study.

To get more information about the shape of the distribution, we could plot a histogram of the weight difference:

The histogram divides the $x$-axis into small intervals or “bins” (here, each of width 0.5) and counts how many times the variable (here, difference in weight) falls in each bin. Histograms can look very different depending on the bin width.

Alternatively, we could draw a density plot:

This shows an estimate of the probability density function, which we will learn about in Chapter 3.

1.5.4 Relationship between continuous and discrete variables

We can also use a box-and-whiskers plot to show how a continuous variable changes with a discrete one. For example, for the service time data from Example 1.1, we are interested in how AM and PM service times compare:

Since no outliers are shown in this case, the whiskers show the range of the data.

We see that AM service times are, on average, shorter but also more spread out than PM service times.

We can also use any of the other methods for plotting a single continuous variable and draw separate plots for each group in the discrete variable.

For example, we can use histograms:

Or density plots:

We could also plot the densities for the different groups on a single plot, using line type to distinguish between the groups:

1.5.5 Relationship between two continuous variables

A scatter plot is a good way to look at the relationship between two continuous variables. For example, for the weight data from Example 1.3, we can plot final weights against initial weights:

We can add the straight line $y = x$ to the plot:

We see that most points lie above the $y = x$ line, so most students in the sample gained weight during the study, as we saw before. This seems to be true regardless of the student’s initial weight.

1.5.6 Relationships between more than two variables

We can combine the methods we’ve seen to plot relationships between more than two variables.

For example, in the billionaires data in Example 1.4, we can plot wealth against age as a scatter plot, using the shape of the points to show different regions:

This plot lets us quickly see some interesting features of the data. For example, there were a few very young billionaires, the richest person in the world at this time was age 50 and from the Middle East, and the richest few people had much more wealth than most others on the list.

MATH1063: Introduction to Statistics