job
dataset.
Very Dissatisfied | A Little Dissatisfied | Moderately Satisfied | Very Satisfied | |
---|---|---|---|---|
<6000 | 20 | 24 | 80 | 82 |
6000-15000 | 22 | 38 | 104 | 125 |
15000-25000 | 13 | 28 | 81 | 113 |
>25000 | 7 | 18 | 54 | 92 |
A particularly important application of generalized linear models is the analysis of categorical data. Here, the data consist of observations of one or more categorical variables for each of a number of units (often individuals). Therefore, each unit is cross-classified by the categorical variables
For example, the job
dataset gives the job satisfaction and income band of 901 individuals from the 1984 General Social Survey, which is summarised in Table 6.1.
job
dataset.
Very Dissatisfied | A Little Dissatisfied | Moderately Satisfied | Very Satisfied | |
---|---|---|---|---|
<6000 | 20 | 24 | 80 | 82 |
6000-15000 | 22 | 38 | 104 | 125 |
15000-25000 | 13 | 28 | 81 | 113 |
>25000 | 7 | 18 | 54 | 92 |
A cross-classification table like this is called a contingency table. This is a two-way table, as there are two classifying variables. It might also be describe as a \(4\times 4\) contingency table (as each of the classifying variables takes one of four possible levels).
Each position in a contingency table is called a cell and the number of individuals in a particular cell is the cell count.
Partial classifications derived from the table are called margins. For a two-way table, these are often displayed in the margins, as in Table 6.2. These are one-way margins as they represent the classification of items by a single variable; income group and job satisfaction, respectively.
job
dataset, including one-way margins.
Very Dissatisfied | A Little Dissatisfied | Moderately Satisfied | Very Satisfied | Sum | |
---|---|---|---|---|---|
<6000 | 20 | 24 | 80 | 82 | 206 |
6000-15000 | 22 | 38 | 104 | 125 | 289 |
15000-25000 | 13 | 28 | 81 | 113 | 235 |
>25000 | 7 | 18 | 54 | 92 | 171 |
Sum | 62 | 108 | 319 | 412 | 901 |
The lymphoma
dataset gives information about 30 patients, classified by cell type of lymphoma, sex, and response to treatment, as shown in Table 6.3. This is an example of a three-way contingency table. It is a \(2\times 2\times 2\) or \(2^3\) table.
lymphoma
dataset.
Cell | Sex | No remission | Remission |
---|---|---|---|
Diffuse | Female | 3 | 1 |
Nodular | Female | 2 | 6 |
Diffuse | Male | 12 | 1 |
Nodular | Male | 1 | 4 |
For multiway tables, higher order margins may be calculated. For example, for lymphoma
, the two-way Cell type/Sex margin is shown in Table 6.4.
lymphoma
dataset.
Female | Male | |
---|---|---|
Diffuse | 4 | 13 |
Nodular | 8 | 5 |
We can model contingency table data using generalised linear models. To do this, we assume that the cell counts are observations of independent Poisson random variables. This is intuitively sensible as the cell counts are non-negative integers (the sample space for the Poisson distribution). Therefore, if the table has \(n\) cells, which we label \(1,\ldots ,n\), then the observed cell counts \(y_1,\ldots ,y_n\) are assumed to be observations of independent Poisson random variables \(Y_1,\ldots ,Y_n\). We denote the means of these Poisson random variables by \(\mu_1,\ldots ,\mu_n\). The canonical link function for the Poisson distribution is the log function, and we assume this link function throughout. A generalised linear model for Poisson data using the log link function is called a log-linear model, as we have already seen in Section 5.4.3.
The explanatory variables in a log-linear model for contingency table data are the cross-classifying variables, or factors. As usual with categorical variables, we can include interactions in the model as well as just main effects (see Example 3.6). Such a model will describe how the expected count in each cell depends on the classifying variables, and any interactions between them. Interpretation of these models will be discussed further in Section 6.5.
Table 6.5 shows the original data structure of the job
dataset. It provides exactly the same data as the contingency table in Table 6.1, but in a different format, sometimes called long format. A log-linear model is just a Poisson GLM, where the response variable is Count
, and Income
and Satisfaction
are explanatory variables.
job
dataset.
Income | Satisfaction | Count |
---|---|---|
<6000 | Very Dissatisfied | 20 |
<6000 | A Little Dissatisfied | 24 |
<6000 | Moderately Satisfied | 80 |
<6000 | Very Satisfied | 82 |
6000-15000 | Very Dissatisfied | 22 |
6000-15000 | A Little Dissatisfied | 38 |
6000-15000 | Moderately Satisfied | 104 |
6000-15000 | Very Satisfied | 125 |
15000-25000 | Very Dissatisfied | 13 |
15000-25000 | A Little Dissatisfied | 28 |
15000-25000 | Moderately Satisfied | 81 |
15000-25000 | Very Satisfied | 113 |
>25000 | Very Dissatisfied | 7 |
>25000 | A Little Dissatisfied | 18 |
>25000 | Moderately Satisfied | 54 |
>25000 | Very Satisfied | 92 |
Table 6.6 shows the lymphoma
dataset in long format. Again, a log-linear model for the contingency table (Table 6.3) is just a Poisson GLM for this data, where in this case the response variable is Count
, and Cell
, Sex
, and Remis
are explanatory variables.
lymphoma
dataset.
Cell | Sex | Remis | Count |
---|---|---|---|
Nodular | Male | No | 1 |
Nodular | Male | Yes | 4 |
Nodular | Female | No | 2 |
Nodular | Female | Yes | 6 |
Diffuse | Male | No | 12 |
Diffuse | Male | Yes | 1 |
Diffuse | Female | No | 3 |
Diffuse | Female | Yes | 1 |
Although the assumption of Poisson distributed observations is convenient for the purposes of modelling, it might not be a realistic assumption, because of the way in which the data have been collected. Frequently, when contingency table data are obtained, the total number of observations (i.e., the grand total, or the sum of all the cell counts) is fixed in advance. In this case, no individual cell count can exceed the pre-specified fixed total, so the assumption of Poisson sampling is invalid as the sample space is bounded. Furthermore, with a fixed total, the observed data can no longer be observations of independent random variables.
For example, consider the lymphoma
contingency table from Table 6.3. It may be that these data were collected over a fixed period of time, and that in that time there happened to be 30 patients. In this case, the Poisson assumption is perfectly valid. However, it may have been decided at the outset to collect data on 30 patients, in which case the grand total is fixed at 30, and the Poisson assumption is not valid.
When the grand total is fixed, a more appropriate distribution for the cell counts is the multinomial distribution. The multinomial distribution is the distribution of cell counts arising when a prespecified total of \(N\) items are each independently assigned to one of \(n\) cells, where the probability of being classified into cell \(i\) is \(p_i\), \({i=1,\ldots ,n}\), so \(\sum_{i=1}^n p_i=1.\) The probability function for the multinomial distribution is
\[ \begin{aligned} f_{\boldsymbol{Y}}(\boldsymbol{y};{\boldsymbol p}) &= P(Y_1=y_1,\ldots ,Y_n=y_n) \cr &= \begin{cases} N!\prod_{i=1}^n \frac{p_i^{y_i}}{y_i!} & \text{if $\sum_{i=1}^n y_i=N$} \\ 0 & \text{otherwise.} \end{cases} \end{aligned} \tag{6.1}\]
The binomial is the special case of the multinomial with two cells (\(n=2\)).
We can still use a log-linear model for contingency table data when the data have been obtained by multinomial sampling. We model \(\log\mu_i=\log(Np_i)\), \({i=1,\ldots ,n}\) as a linear function of explanatory variables. However, such a model must preserve \(\sum_{i=1}^n \mu_i=N\), the grand total which is fixed in advance.
From Equation 6.1, the log-likelihood for a multinomial log-linear model is
\[ \ell(\boldsymbol{\mu})= \sum_{i=1}^n y_i\log\mu_i -N\log N -\sum_{i=1}^n \log y_i!+\log N!. \]
Therefore, the maximum likelihood estimates \(\hat{\boldsymbol{\mu}}\) maximise \(\sum_{i=1}^n y_i\log\mu_i\) subject to the constraints \(\sum_{i=1}^n \mu_i=N=\sum_{i=1}^n y_i\) (multinomial sampling) and \(\log\boldsymbol{\mu}=\boldsymbol{X}\boldsymbol{\beta}\) (imposed by the model).
For a Poisson log-linear model,
\[ L(\boldsymbol{\mu})=\prod_{i=1}^n {{e^{-\mu_i} \mu_i^{y_i}}\over{y_i!}}. \]
Therefore,
\[ \begin{aligned} \ell(\boldsymbol{\mu})&=-\sum_{i=1}^n \mu_i +\sum_{i=1}^ny_i\log\mu_i-\sum_{i=1}^n \log y_i! \\ &=-\sum_{i=1}^n \exp(\log\mu_i) +\sum_{i=1}^ny_i\log\mu_i-\sum_{i=1}^n \log y_i!. \end{aligned} \tag{6.2}\]
Now any Poisson log-linear model with an intercept can be expressed as
\[ \log\mu_i=\alpha + \text{other terms depending on $i$},\qquad {i=1,\ldots ,n} \]
so that
\[ \begin{aligned} &\qquad{\partial\over{\partial\alpha}} \ell(\boldsymbol{\mu})=-\sum_{i=1}^n \exp(\log\mu_i) +\sum_{i=1}^ny_i. \\ &\Rightarrow\qquad \sum_{i=1}^n \hat{\mu}_i=\sum_{i=1}^n y_i. \end{aligned} \tag{6.3}\]
From Equation 6.2, we notice that, at \(\alpha=\hat{\alpha}\) the log-likelihood takes the form
\[ \ell(\boldsymbol{\mu})=-\sum_{i=1}^n y_i +\sum_{i=1}^ny_i\log{\mu}_i-\sum_{i=1}^n \log y_i!. \]
Hence, when we maximise the log-likelihood, for a Poisson log-linear model with intercept, with respect to the other log-linear parameters, we are maximising \(\sum_{i=1}^ny_i\log{\mu}_i\) subject to the constraints \(\sum_{i=1}^n \mu_i=\sum_{i=1}^n y_i\) from Equation 6.3 and \(\log\boldsymbol{\mu}=\boldsymbol{X}\boldsymbol{\beta}\) (imposed by the model).
Thus, the maximum likelihood estimates for multinomial log-linear parameters are identical to those for Poisson log-linear parameters. Furthermore, the maximised log-likelihoods for both Poisson and multinomial models take the form \(\sum_{i=1}^ny_i\log\hat{\mu}_i\) as functions of the log-linear parameter estimates. Therefore any inferences based on maximised log-likelihoods (such as likelihood ratio tests) will be the same.
Therefore, we can analyse contingency table data using Poisson log-linear models, even when the data has been obtained by multinomial sampling. All that is required is that we ensure that the Poisson model contains an intercept term.
Sometimes margins other than just the grand total may be prespecified. For example, consider the lymphoma
contingency table in Table 6.3. It may have been decided at the outset to collect data on 18 male patients and 12 female patients. Alternatively, perhaps the distribution of both the Sex and Cell type of the patients was fixed in advance as in Table 6.4. In cases where a margin is fixed by design, the data consist of a number of fixed total subgroups, defined by the fixed margin. Neither Poisson nor multinomial sampling assumptions are valid. The appropriate distribution combines a separate, independent multinomial for each subgroup. For example, if just the Sex margin is fixed as above, then the appropriate distribution for modelling the data is two independent multinomials, one for males with \(N=18\) and one for females with \(N=12\). Each of these multinomials has four cells, representing the cross-classification of the relevant patients by Cell Type and Remission. Alternatively, if it is the Cell type/Sex margin which has been fixed, then the appropriate distribution is four independent two-cell multinomials (binomials) representing the remission classification for each of the four fixed-total patient subgroups.
When the data are modelled using independent multinomials, then the joint distribution of the cell counts \(Y_1, \ldots, Y_n\) is the product of terms of the same form as Equation 6.1, one for each fixed-total subgroup. We call this a distribution a product multinomial. Each subgroup has its own fixed total. The full joint density is a product of \(n\) terms, as before, with each cell count appearing exactly once.
For example, if the Sex margin is fixed for lymphoma
, then the product multinomial distribution has the form
\[f_{\boldsymbol{Y}}(\boldsymbol{y};{\boldsymbol p})= \begin{cases} N_m!\prod_{i=1}^4 {{p_{mi}^{y_{mi}}}\over{y_{mi}!}} N_f!\prod_{i=1}^4 {{p_{fi}^{y_{fi}}}\over{y_{fi}!}} & \text{if $\sum_{i=1}^4 y_{mi}=N_m$ and $\sum_{i=1}^4 y_{fi}=N_f$} \\ 0 & \text{otherwise,} \end{cases} \]
where \(N_m\) and \(N_f\) are the two fixed marginal totals (18 and 12 respectively), \(y_{m1},\ldots ,y_{m4}\) are the cell counts for the Cell type/Remission cross-classification for males and \(y_{f1},\ldots ,y_{f4}\) are the corresponding cell counts for females. Here \(\sum_{i=1}^4 p_{mi}=\sum_{i=1}^4 p_{fi}=1\), \(E(Y_{mi})=N_mp_{mi}\), \(i=1,\ldots ,4\), and \(E(Y_{fi})=N_fp_{fi}\), \(i=1,\ldots ,4\).
Using similar results to those used in Section 6.3 (but not proved here), we can analyse contingency table data using Poisson log-linear models, even when the data has been obtained by product multinomial sampling. However, we must ensure that the Poisson model contains a term corresponding to the fixed margin (and all marginal terms). Then, the estimated means for the specified margin are equal to the corresponding fixed totals.
For example, for the lymphoma
dataset, for inferences obtained using a Poisson model to be valid when the Sex margin is fixed in advance, the Poisson model must contain the Sex main effect (and the intercept). For inferences obtained using a Poisson model to be valid when the Cell type/Sex margin is fixed in advance, the Poisson model must contain the Cell type/Sex interaction, and all marginal terms (the Cell type main effect, the Sex main effect and the intercept).
Therefore, when analysing product multinomial data using a Poisson log-linear model, certain terms must be present in any model we fit. If they are removed, the inferences would no longer be valid.
Log-linear models for contingency tables enable us to determine important properties concerning the joint distribution of the classifying variables. In particular, the form of our preferred log linear model for a table will have implications for how the variables are associated.
Each combination of the classifying variables occurs exactly once in a contingency table. Therefore, the model with the highest order interaction (between all the variables) and all marginal terms (all other interactions) is the saturated model. The implication of this model is that every combination of levels of the variables has its own mean (probability) and that there are no relationships between these means (no structure). The variables are highly dependent.
To consider the implications of simpler models, we first consider a two-way \(r\times c\) table where the two classifying variables \(R\) and \(C\) have \(r\) and \(c\) levels respectively. The saturated model \(R*C\) implies that the two variables are associated. If we remove the RC interaction, we have the model \(R+C\),
\[ \log\mu_i=\alpha+\beta_R(r_i)+\beta_C(c_i),\qquad{i=1,\ldots ,n} \]
where \(n=rc\) is the total number of cells in the table. Because of the equivalence of Poisson and multinomial sampling, we can think of each cell mean \(\mu_i\) as equal to \(Np_i\) where \(N\) is the total number of observations in the table, and \(p_i\) is a cell probability. As each combination of levels of \(R\) and \(C\) is represented in exactly one cell, it is also convenient to replace the cell label \(i\) by the pair of labels \(j\) and \(k\) representing the corresponding levels of \(R\) and \(C\) respectively. Hence
\[ \log p_{jk}=\alpha+\beta_R(j)+\beta_C(k)-\log N, \quad j=1,\ldots ,r,\quad k=1,\ldots ,c. \]
Therefore
\[P(R=j,C=k)=\exp[\alpha+\beta_R(j)+\beta_C(k)-\log N], \quad j=1,\ldots ,r,\quad k=1,\ldots ,c,\]
so
\[\begin{align*} 1&=\sum_{j=1}^r\sum_{k=1}^c\exp[\alpha+\beta_R(j)+\beta_C(k)-\log N]\cr &={1\over N}\exp[\alpha]\sum_{j=1}^r\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)]. \end{align*}\]
Furthermore \[\begin{align*} P(R=j)&=\sum_{k=1}^c\exp[\alpha+\beta_R(j)+\beta_C(k)-\log N]\cr &={1\over N}\exp[\alpha]\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)], \quad j=1,\ldots,r, \end{align*}\] and \[\begin{align*} P(C=k)&=\sum_{j=1}^r\exp[\alpha+\beta_R(j)+\beta_C(k)-\log N]\cr &={1\over N}\exp[\alpha]\exp[\beta_C(k)]\sum_{j=1}^r\exp[\beta_R(j)], \quad k=1,\ldots,c. \end{align*}\] Therefore \[\begin{align*} P(R=j)P(C=k)&= {1\over N}\exp[\alpha]\exp[\beta_C(k)]\exp[\beta_R(j)]\times 1\cr &=P(R=j,C=k), \quad j=1,\ldots ,r,\quad k=1,\ldots ,c. \end{align*}\]
Absence of the interaction \(R*C\) in a log-linear model implies that \(R\) and \(C\) are independent variables. Absence of main effects is generally less interesting, and main effects are typically not removed from a log-linear model.
In multiway tables, absence of a two-factor interaction does not necessarily mean that the two variables are independent. For example, consider the lymphoma
dataset, with 3 binary classifying variables Sex (\(S\)), Cell type (\(C\)) and Remission (\(R\)). After comparing the fit of several possible models, we find that a reasonable log-linear model for these data is \(R * C + C * S\). Hence the interaction between remission and sex is absent. The fitted cell probabilities from this log-linear model are shown in Table 6.7.
lymphoma
dataset.
Cell | Sex | No remission | Remission |
---|---|---|---|
Diffuse | Female | 0.1176 | 0.0157 |
Nodular | Female | 0.0615 | 0.2051 |
Diffuse | Male | 0.3824 | 0.0510 |
Nodular | Male | 0.0385 | 0.1282 |
The estimated probabilities for the two-way Sex/Remission margin (together with the corresponding one-way margins) are shown in Table 6.8.
lymphoma
dataset.
No | Yes | Sum | |
---|---|---|---|
Female | 0.1792 | 0.2208 | 0.4 |
Male | 0.4208 | 0.1792 | 0.6 |
Sum | 0.6000 | 0.4000 | 1.0 |
It can immediately be seen that this model does not imply independence of \(R\) and \(S\), as \(\hat{P}(R,S)\ne\hat{P}(R)\hat{P}(S)\). What the model \(R*C+C*S\) implies is that \(R\) is independent of \(S\) conditional on \(C\), that is
\[ P(R,S|C)=P(R|C)P(S|C). \]
Another way of expressing this is
\[ P(R|S,C)=P(R|C), \]
that is, the probability of each level of \(R\) given a particular combination of \(S\) and \(C\), does not depend on which level \(S\) takes. Table 6.9 shows the estimated conditional probabilities for the lymphoma
data. The probability of remission depends only on a patient’s cell type, and not on their sex.
lymphoma
dataset.
Female | Male | |
---|---|---|
Diffuse | 0.118 | 0.118 |
Nodular | 0.769 | 0.769 |
In general, if we have an \(r\)-way contingency table with classifying variables \(X_1,\ldots ,X_r\), then a log linear model which does not contain the \(X_1 * X_2\) interaction (and therefore by the principle of marginality contains no interaction involving both \(X_1\) and \(X_2\)) implies that \(X_1\) and \(X_2\) are conditionally independent given \(X_3,\ldots ,X_r\), that is
\[ P(X_1,X_2|X_3,\ldots ,X_r)=P(X_1|X_3,\ldots ,X_r)P(X_2|X_3,\ldots ,X_r). \]
The proof of this is similar to the proof in the two-way case. Again, an alternative way of expressing conditional independence is
\[ P(X_1|X_2,X_3,\ldots ,X_r)=P(X_1|X_3,\ldots ,X_r) \]
or
\[ P(X_2|X_1,X_3,\ldots ,X_r)=P(X_2|X_3,\ldots ,X_r). \]
Although for the lymphoma
dataset \(R\) and \(S\) are conditionally independent given \(C\), they are not marginally independent. Using the marginal cell probabilities from Table 6.8, we find that the probability of remission is \(0.30\) for men and \(0.55\) for women. Male patients have a much lower probability of remission. The reason for this is that, although \(R\) and \(S\) are not directly associated, they are both associated with \(C\). Observing the estimated values it is clear that patients with nodular cell type have a greater probability of remission, and furthermore, that female patients are more likely to have this cell type than males. Hence, women are more likely to have remission than men.
Suppose the factors for a three-way tables are \(X_1\), \(X_2\) and \(X_3\). We can list all possible models and the implications for the conditional independence structure as follows
Conditional and marginal association of two variables can therefore often appear somewhat different. Sometimes, the association can be ‘reversed’ so that what looks like a positive association marginally, becomes a negative association conditionally. This is known as Simpson’s paradox.
In 1972-74, a survey of women was carried out in an area of Newcastle. A follow-up survey was carried out 20 years later. Among the variables observed in the initial survey was whether or not the individual was a smoker and among those in the follow-up survey was whether the individual was still alive, or had died in the intervening period. A summary of the responses is shown in Table 6.10.
Smoker | Number dead | Number alive | Proportion dead |
---|---|---|---|
yes | 139 | 443 | 0.239 |
no | 230 | 502 | 0.314 |
Looking at this table, it appears that the non-smokers had a greater probability of dying. However, there is an important extra variable to be considered, related to both smoking habit and mortality – age (at the time of the initial survey). When we consider this variable, we get Table 6.11. Conditional on every age at outset, it is now the smokers who have a higher probability of dying. The marginal association is reversed in the table conditional on age, because mortality (obviously) and smoking are associated with age. There are proportionally many fewer smokers in the older age-groups (where the probability of death is greater).
Age | Smoker | Number dead | Number alive | Proportion dead |
---|---|---|---|---|
18-24 | yes | 2 | 53 | 0.036 |
18-24 | no | 1 | 61 | 0.016 |
25-34 | yes | 3 | 121 | 0.024 |
25-34 | no | 5 | 152 | 0.032 |
35-44 | yes | 14 | 95 | 0.128 |
35-44 | no | 7 | 114 | 0.058 |
45-54 | yes | 27 | 103 | 0.208 |
45-54 | no | 12 | 66 | 0.154 |
55-64 | yes | 51 | 64 | 0.443 |
55-64 | no | 40 | 81 | 0.331 |
65-74 | yes | 29 | 7 | 0.806 |
65-74 | no | 101 | 28 | 0.783 |
75+ | yes | 13 | 0 | 1.000 |
75+ | no | 64 | 0 | 1.000 |
When making inferences about associations between variables, it is important that all other variables which are relevant are considered. Marginal inferences may lead to misleading conclusions.