6 Models for Categorical Data

6.1 Contingency tables

A particularly important application of generalized linear models is the analysis of categorical data. Here, the data consist of observations of one or more categorical variables for each of a number of units (often individuals). Therefore, each unit is cross-classified by the categorical variables

For example, the job dataset gives the job satisfaction and income band of 901 individuals from the 1984 General Social Survey, which is summarised in Table 6.1.

Table 6.1: A contingency table of the job dataset.

	Very Dissatisfied	A Little Dissatisfied	Moderately Satisfied	Very Satisfied
<6000	20	24	80	82
6000-15000	22	38	104	125
15000-25000	13	28	81	113
>25000	7	18	54	92

A cross-classification table like this is called a contingency table. This is a two-way table, as there are two classifying variables. It might also be describe as a $4\times 4$ contingency table (as each of the classifying variables takes one of four possible levels).

Each position in a contingency table is called a cell and the number of individuals in a particular cell is the cell count.

Partial classifications derived from the table are called margins. For a two-way table, these are often displayed in the margins, as in Table 6.2. These are one-way margins as they represent the classification of items by a single variable; income group and job satisfaction, respectively.

Table 6.2: A contingency table of the job dataset, including one-way margins.

	Very Dissatisfied	A Little Dissatisfied	Moderately Satisfied	Very Satisfied	Sum
<6000	20	24	80	82	206
6000-15000	22	38	104	125	289
15000-25000	13	28	81	113	235
>25000	7	18	54	92	171
Sum	62	108	319	412	901

The lymphoma dataset gives information about 30 patients, classified by cell type of lymphoma, sex, and response to treatment, as shown in Table 6.3. This is an example of a three-way contingency table. It is a $2\times 2\times 2$ or $2^3$ table.

Table 6.3: A contingency table of the lymphoma dataset.

Cell	Sex	No remission	Remission
Diffuse	Female	3	1
Nodular	Female	2	6
Diffuse	Male	12	1
Nodular	Male	1	4

For multiway tables, higher order margins may be calculated. For example, for lymphoma, the two-way Cell type/Sex margin is shown in Table 6.4.

Table 6.4: The two-way Cell type/Sex margin for the lymphoma dataset.

	Female	Male
Diffuse	4	13
Nodular	8	5

6.2 Log-linear models

We can model contingency table data using generalised linear models. To do this, we assume that the cell counts are observations of independent Poisson random variables. This is intuitively sensible as the cell counts are non-negative integers (the sample space for the Poisson distribution). Therefore, if the table has $n$ cells, which we label $1,\ldots ,n$, then the observed cell counts $y_1,\ldots ,y_n$ are assumed to be observations of independent Poisson random variables $Y_1,\ldots ,Y_n$. We denote the means of these Poisson random variables by $\mu_1,\ldots ,\mu_n$. The canonical link function for the Poisson distribution is the log function, and we assume this link function throughout. A generalised linear model for Poisson data using the log link function is called a log-linear model, as we have already seen in Section 5.4.3.

The explanatory variables in a log-linear model for contingency table data are the cross-classifying variables, or factors. As usual with categorical variables, we can include interactions in the model as well as just main effects (see Example 3.6). Such a model will describe how the expected count in each cell depends on the classifying variables, and any interactions between them. Interpretation of these models will be discussed further in Section 6.5.

Table 6.5 shows the original data structure of the job dataset. It provides exactly the same data as the contingency table in Table 6.1, but in a different format, sometimes called long format. A log-linear model is just a Poisson GLM, where the response variable is Count, and Income and Satisfaction are explanatory variables.

Table 6.5: The job dataset.

Income	Satisfaction	Count
<6000	Very Dissatisfied	20
<6000	A Little Dissatisfied	24
<6000	Moderately Satisfied	80
<6000	Very Satisfied	82
6000-15000	Very Dissatisfied	22
6000-15000	A Little Dissatisfied	38
6000-15000	Moderately Satisfied	104
6000-15000	Very Satisfied	125
15000-25000	Very Dissatisfied	13
15000-25000	A Little Dissatisfied	28
15000-25000	Moderately Satisfied	81
15000-25000	Very Satisfied	113
>25000	Very Dissatisfied	7
>25000	A Little Dissatisfied	18
>25000	Moderately Satisfied	54
>25000	Very Satisfied	92

Table 6.6 shows the lymphoma dataset in long format. Again, a log-linear model for the contingency table (Table 6.3) is just a Poisson GLM for this data, where in this case the response variable is Count, and Cell, Sex, and Remis are explanatory variables.

Table 6.6: The lymphoma dataset.

Cell	Sex	Remis	Count
Nodular	Male	No	1
Nodular	Male	Yes	4
Nodular	Female	No	2
Nodular	Female	Yes	6
Diffuse	Male	No	12
Diffuse	Male	Yes	1
Diffuse	Female	No	3
Diffuse	Female	Yes	1

6.3 Multinomial sampling

Although the assumption of Poisson distributed observations is convenient for the purposes of modelling, it might not be a realistic assumption, because of the way in which the data have been collected. Frequently, when contingency table data are obtained, the total number of observations (i.e., the grand total, or the sum of all the cell counts) is fixed in advance. In this case, no individual cell count can exceed the pre-specified fixed total, so the assumption of Poisson sampling is invalid as the sample space is bounded. Furthermore, with a fixed total, the observed data can no longer be observations of independent random variables.

For example, consider the lymphoma contingency table from Table 6.3. It may be that these data were collected over a fixed period of time, and that in that time there happened to be 30 patients. In this case, the Poisson assumption is perfectly valid. However, it may have been decided at the outset to collect data on 30 patients, in which case the grand total is fixed at 30, and the Poisson assumption is not valid.

When the grand total is fixed, a more appropriate distribution for the cell counts is the multinomial distribution. The multinomial distribution is the distribution of cell counts arising when a prespecified total of $N$ items are each independently assigned to one of $n$ cells, where the probability of being classified into cell $i$ is $p_i$, ${i=1,\ldots ,n}$, so $\sum_{i=1}^n p_i=1.$ The probability function for the multinomial distribution is

\[ \begin{aligned} f_{\boldsymbol{Y}}(\boldsymbol{y};{\boldsymbol p}) &= P(Y_1=y_1,\ldots ,Y_n=y_n) \cr &= \begin{cases} N!\prod_{i=1}^n \frac{p_i^{y_i}}{y_i!} & \text{if $\sum_{i=1}^n y_i=N$} \\ 0 & \text{otherwise.} \end{cases} \end{aligned} \tag{6.1}\]

The binomial is the special case of the multinomial with two cells ($n=2$).

We can still use a log-linear model for contingency table data when the data have been obtained by multinomial sampling. We model $\log\mu_i=\log(Np_i)$, ${i=1,\ldots ,n}$ as a linear function of explanatory variables. However, such a model must preserve $\sum_{i=1}^n \mu_i=N$, the grand total which is fixed in advance.

From Equation 6.1, the log-likelihood for a multinomial log-linear model is

\[ \ell(\boldsymbol{\mu})= \sum_{i=1}^n y_i\log\mu_i -N\log N -\sum_{i=1}^n \log y_i!+\log N!. \]

Therefore, the maximum likelihood estimates $\hat{\boldsymbol{\mu}}$ maximise $\sum_{i=1}^n y_i\log\mu_i$ subject to the constraints $\sum_{i=1}^n \mu_i=N=\sum_{i=1}^n y_i$ (multinomial sampling) and $\log\boldsymbol{\mu}=\boldsymbol{X}\boldsymbol{\beta}$ (imposed by the model).

For a Poisson log-linear model,

\[ L(\boldsymbol{\mu})=\prod_{i=1}^n {{e^{-\mu_i} \mu_i^{y_i}}\over{y_i!}}. \]

Therefore,

\[ \begin{aligned} \ell(\boldsymbol{\mu})&=-\sum_{i=1}^n \mu_i +\sum_{i=1}^ny_i\log\mu_i-\sum_{i=1}^n \log y_i! \\ &=-\sum_{i=1}^n \exp(\log\mu_i) +\sum_{i=1}^ny_i\log\mu_i-\sum_{i=1}^n \log y_i!. \end{aligned} \tag{6.2}\]

Now any Poisson log-linear model with an intercept can be expressed as

\[ \log\mu_i=\alpha + \text{other terms depending on $i$},\qquad {i=1,\ldots ,n} \]

so that

\[ \begin{aligned} &\qquad{\partial\over{\partial\alpha}} \ell(\boldsymbol{\mu})=-\sum_{i=1}^n \exp(\log\mu_i) +\sum_{i=1}^ny_i. \\ &\Rightarrow\qquad \sum_{i=1}^n \hat{\mu}_i=\sum_{i=1}^n y_i. \end{aligned} \tag{6.3}\]

From Equation 6.2, we notice that, at $\alpha=\hat{\alpha}$ the log-likelihood takes the form

\[ \ell(\boldsymbol{\mu})=-\sum_{i=1}^n y_i +\sum_{i=1}^ny_i\log{\mu}_i-\sum_{i=1}^n \log y_i!. \]

Hence, when we maximise the log-likelihood, for a Poisson log-linear model with intercept, with respect to the other log-linear parameters, we are maximising $\sum_{i=1}^ny_i\log{\mu}_i$ subject to the constraints $\sum_{i=1}^n \mu_i=\sum_{i=1}^n y_i$ from Equation 6.3 and $\log\boldsymbol{\mu}=\boldsymbol{X}\boldsymbol{\beta}$ (imposed by the model).

Thus, the maximum likelihood estimates for multinomial log-linear parameters are identical to those for Poisson log-linear parameters. Furthermore, the maximised log-likelihoods for both Poisson and multinomial models take the form $\sum_{i=1}^ny_i\log\hat{\mu}_i$ as functions of the log-linear parameter estimates. Therefore any inferences based on maximised log-likelihoods (such as likelihood ratio tests) will be the same.

Therefore, we can analyse contingency table data using Poisson log-linear models, even when the data has been obtained by multinomial sampling. All that is required is that we ensure that the Poisson model contains an intercept term.

6.4 Product multinomial sampling

Sometimes margins other than just the grand total may be prespecified. For example, consider the lymphoma contingency table in Table 6.3. It may have been decided at the outset to collect data on 18 male patients and 12 female patients. Alternatively, perhaps the distribution of both the Sex and Cell type of the patients was fixed in advance as in Table 6.4. In cases where a margin is fixed by design, the data consist of a number of fixed total subgroups, defined by the fixed margin. Neither Poisson nor multinomial sampling assumptions are valid. The appropriate distribution combines a separate, independent multinomial for each subgroup. For example, if just the Sex margin is fixed as above, then the appropriate distribution for modelling the data is two independent multinomials, one for males with $N=18$ and one for females with $N=12$. Each of these multinomials has four cells, representing the cross-classification of the relevant patients by Cell Type and Remission. Alternatively, if it is the Cell type/Sex margin which has been fixed, then the appropriate distribution is four independent two-cell multinomials (binomials) representing the remission classification for each of the four fixed-total patient subgroups.

When the data are modelled using independent multinomials, then the joint distribution of the cell counts $Y_1, \ldots, Y_n$ is the product of terms of the same form as Equation 6.1, one for each fixed-total subgroup. We call this a distribution a product multinomial. Each subgroup has its own fixed total. The full joint density is a product of $n$ terms, as before, with each cell count appearing exactly once.

For example, if the Sex margin is fixed for lymphoma, then the product multinomial distribution has the form

\[f_{\boldsymbol{Y}}(\boldsymbol{y};{\boldsymbol p})= \begin{cases} N_m!\prod_{i=1}^4 {{p_{mi}^{y_{mi}}}\over{y_{mi}!}} N_f!\prod_{i=1}^4 {{p_{fi}^{y_{fi}}}\over{y_{fi}!}} & \text{if $\sum_{i=1}^4 y_{mi}=N_m$ and $\sum_{i=1}^4 y_{fi}=N_f$} \\ 0 & \text{otherwise,} \end{cases} \]

where $N_m$ and $N_f$ are the two fixed marginal totals (18 and 12 respectively), $y_{m1},\ldots ,y_{m4}$ are the cell counts for the Cell type/Remission cross-classification for males and $y_{f1},\ldots ,y_{f4}$ are the corresponding cell counts for females. Here $\sum_{i=1}^4 p_{mi}=\sum_{i=1}^4 p_{fi}=1$, $E(Y_{mi})=N_mp_{mi}$, $i=1,\ldots ,4$, and $E(Y_{fi})=N_fp_{fi}$, $i=1,\ldots ,4$.

Using similar results to those used in Section 6.3 (but not proved here), we can analyse contingency table data using Poisson log-linear models, even when the data has been obtained by product multinomial sampling. However, we must ensure that the Poisson model contains a term corresponding to the fixed margin (and all marginal terms). Then, the estimated means for the specified margin are equal to the corresponding fixed totals.

For example, for the lymphoma dataset, for inferences obtained using a Poisson model to be valid when the Sex margin is fixed in advance, the Poisson model must contain the Sex main effect (and the intercept). For inferences obtained using a Poisson model to be valid when the Cell type/Sex margin is fixed in advance, the Poisson model must contain the Cell type/Sex interaction, and all marginal terms (the Cell type main effect, the Sex main effect and the intercept).

Therefore, when analysing product multinomial data using a Poisson log-linear model, certain terms must be present in any model we fit. If they are removed, the inferences would no longer be valid.

6.5 Interpreting log-linear models for two-way tables

Log-linear models for contingency tables enable us to determine important properties concerning the joint distribution of the classifying variables. In particular, the form of our preferred log linear model for a table will have implications for how the variables are associated.

Each combination of the classifying variables occurs exactly once in a contingency table. Therefore, the model with the highest order interaction (between all the variables) and all marginal terms (all other interactions) is the saturated model. The implication of this model is that every combination of levels of the variables has its own mean (probability) and that there are no relationships between these means (no structure). The variables are highly dependent.

To consider the implications of simpler models, we first consider a two-way $r\times c$ table where the two classifying variables $R$ and $C$ have $r$ and $c$ levels respectively. The saturated model $R*C$ implies that the two variables are associated. If we remove the RC interaction, we have the model $R+C$,

\[ \log\mu_i=\alpha+\beta_R(r_i)+\beta_C(c_i),\qquad{i=1,\ldots ,n} \]

where $n=rc$ is the total number of cells in the table. Because of the equivalence of Poisson and multinomial sampling, we can think of each cell mean $\mu_i$ as equal to $Np_i$ where $N$ is the total number of observations in the table, and $p_i$ is a cell probability. As each combination of levels of $R$ and $C$ is represented in exactly one cell, it is also convenient to replace the cell label $i$ by the pair of labels $j$ and $k$ representing the corresponding levels of $R$ and $C$ respectively. Hence

\[ \log p_{jk}=\alpha+\beta_R(j)+\beta_C(k)-\log N, \quad j=1,\ldots ,r,\quad k=1,\ldots ,c. \]

Therefore

\[P(R=j,C=k)=\exp[\alpha+\beta_R(j)+\beta_C(k)-\log N], \quad j=1,\ldots ,r,\quad k=1,\ldots ,c,\]

\[\begin{align*} 1&=\sum_{j=1}^r\sum_{k=1}^c\exp[\alpha+\beta_R(j)+\beta_C(k)-\log N]\cr &={1\over N}\exp[\alpha]\sum_{j=1}^r\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)]. \end{align*}\]

Furthermore \[\begin{align*} P(R=j)&=\sum_{k=1}^c\exp[\alpha+\beta_R(j)+\beta_C(k)-\log N]\cr &={1\over N}\exp[\alpha]\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)], \quad j=1,\ldots,r, \end{align*}\] and \[\begin{align*} P(C=k)&=\sum_{j=1}^r\exp[\alpha+\beta_R(j)+\beta_C(k)-\log N]\cr &={1\over N}\exp[\alpha]\exp[\beta_C(k)]\sum_{j=1}^r\exp[\beta_R(j)], \quad k=1,\ldots,c. \end{align*}\] Therefore \[\begin{align*} P(R=j)P(C=k)&= {1\over N}\exp[\alpha]\exp[\beta_C(k)]\exp[\beta_R(j)]\times 1\cr &=P(R=j,C=k), \quad j=1,\ldots ,r,\quad k=1,\ldots ,c. \end{align*}\]

Absence of the interaction $R*C$ in a log-linear model implies that $R$ and $C$ are independent variables. Absence of main effects is generally less interesting, and main effects are typically not removed from a log-linear model.

6.6 Interpreting log-linear models for multiway tables

In multiway tables, absence of a two-factor interaction does not necessarily mean that the two variables are independent. For example, consider the lymphoma dataset, with 3 binary classifying variables Sex ($S$), Cell type ($C$) and Remission ($R$). After comparing the fit of several possible models, we find that a reasonable log-linear model for these data is $R * C + C * S$. Hence the interaction between remission and sex is absent. The fitted cell probabilities from this log-linear model are shown in Table 6.7.

Table 6.7: Fitted probabilities of each cell in the lymphoma dataset.

Cell	Sex	No remission	Remission
Diffuse	Female	0.1176	0.0157
Nodular	Female	0.0615	0.2051
Diffuse	Male	0.3824	0.0510
Nodular	Male	0.0385	0.1282

The estimated probabilities for the two-way Sex/Remission margin (together with the corresponding one-way margins) are shown in Table 6.8.

Table 6.8: Fitted marginal probabilities for the lymphoma dataset.

	No	Yes	Sum
Female	0.1792	0.2208	0.4
Male	0.4208	0.1792	0.6
Sum	0.6000	0.4000	1.0

It can immediately be seen that this model does not imply independence of $R$ and $S$, as $\hat{P}(R,S)\ne\hat{P}(R)\hat{P}(S)$. What the model $R*C+C*S$ implies is that $R$ is independent of $S$ conditional on $C$, that is

\[ P(R,S|C)=P(R|C)P(S|C). \]

Another way of expressing this is

\[ P(R|S,C)=P(R|C), \]

that is, the probability of each level of $R$ given a particular combination of $S$ and $C$, does not depend on which level $S$ takes. Table 6.9 shows the estimated conditional probabilities for the lymphoma data. The probability of remission depends only on a patient’s cell type, and not on their sex.

Table 6.9: Conditional probability of remission in the lymphoma dataset.

	Female	Male
Diffuse	0.118	0.118
Nodular	0.769	0.769

In general, if we have an $r$-way contingency table with classifying variables $X_1,\ldots ,X_r$, then a log linear model which does not contain the $X_1 * X_2$ interaction (and therefore by the principle of marginality contains no interaction involving both $X_1$ and $X_2$) implies that $X_1$ and $X_2$ are conditionally independent given $X_3,\ldots ,X_r$, that is

\[ P(X_1,X_2|X_3,\ldots ,X_r)=P(X_1|X_3,\ldots ,X_r)P(X_2|X_3,\ldots ,X_r). \]

The proof of this is similar to the proof in the two-way case. Again, an alternative way of expressing conditional independence is

\[ P(X_1|X_2,X_3,\ldots ,X_r)=P(X_1|X_3,\ldots ,X_r) \]

\[ P(X_2|X_1,X_3,\ldots ,X_r)=P(X_2|X_3,\ldots ,X_r). \]

Although for the lymphoma dataset $R$ and $S$ are conditionally independent given $C$, they are not marginally independent. Using the marginal cell probabilities from Table 6.8, we find that the probability of remission is $0.30$ for men and $0.55$ for women. Male patients have a much lower probability of remission. The reason for this is that, although $R$ and $S$ are not directly associated, they are both associated with $C$. Observing the estimated values it is clear that patients with nodular cell type have a greater probability of remission, and furthermore, that female patients are more likely to have this cell type than males. Hence, women are more likely to have remission than men.

Suppose the factors for a three-way tables are $X_1$, $X_2$ and $X_3$. We can list all possible models and the implications for the conditional independence structure as follows

Model 1 containing the terms $X_1, X_2, X_3$. All factors are mutually independent.
Model 2 containing the terms $X_1*X_2, X_3$. The factor $X_3$ is jointly independent of $X_1$ and $X_2$.
Model 3 containing the terms $X_1*X_2, X_2*X_3$. The factors $X_1$ and $X_3$ are conditionally independent given $X_2$.
Model 4 containing the terms $X_1*X_2, X_2*X_3, X_1*X_3$. There is no conditional independence structure. This is the model without the highest order interaction term.
Model 5 containing $X_1*X_2*X_3$. This is the saturated model. No more simplification of dependence structure is possible.

6.7 Simpson’s paradox

Conditional and marginal association of two variables can therefore often appear somewhat different. Sometimes, the association can be ‘reversed’ so that what looks like a positive association marginally, becomes a negative association conditionally. This is known as Simpson’s paradox.

In 1972-74, a survey of women was carried out in an area of Newcastle. A follow-up survey was carried out 20 years later. Among the variables observed in the initial survey was whether or not the individual was a smoker and among those in the follow-up survey was whether the individual was still alive, or had died in the intervening period. A summary of the responses is shown in Table 6.10.

Table 6.10: Number of respondents dead or alive at follow up, by smoking status

Smoker	Number dead	Number alive	Proportion dead
yes	139	443	0.239
no	230	502	0.314

Looking at this table, it appears that the non-smokers had a greater probability of dying. However, there is an important extra variable to be considered, related to both smoking habit and mortality – age (at the time of the initial survey). When we consider this variable, we get Table 6.11. Conditional on every age at outset, it is now the smokers who have a higher probability of dying. The marginal association is reversed in the table conditional on age, because mortality (obviously) and smoking are associated with age. There are proportionally many fewer smokers in the older age-groups (where the probability of death is greater).

Table 6.11: Number of respondents dead or alive at follow up, by smoking status and age.

Age	Smoker	Number dead	Number alive	Proportion dead
18-24	yes	2	53	0.036
18-24	no	1	61	0.016
25-34	yes	3	121	0.024
25-34	no	5	152	0.032
35-44	yes	14	95	0.128
35-44	no	7	114	0.058
45-54	yes	27	103	0.208
45-54	no	12	66	0.154
55-64	yes	51	64	0.443
55-64	no	40	81	0.331
65-74	yes	29	7	0.806
65-74	no	101	28	0.783
75+	yes	13	0	1.000
75+	no	64	0	1.000

When making inferences about associations between variables, it is important that all other variables which are relevant are considered. Marginal inferences may lead to misleading conclusions.