Goodness-of-fit tests for categorical data
Abstract. A significant aspect of data modeling with categorical predictors is the
definition of a saturated model. In fact, there are different ways of
specifying it—the casewise, the contingency table, and the collapsing
approaches—and they strictly depend on the unit of analysis considered.
The analytical units of reference could be the subjects or, alternatively,
groups of subjects that have the same covariate pattern. In the first case,
the goal is to predict the probability of success (failure) for each
individual; in the second case, the goal is to predict the proportion of
successes (failures) in each group. The analytical unit adopted does not
affect the estimation process; however, it does affect the definition of a
saturated model. Consequently, measures and tests of goodness of fit can lead
to different results and interpretations. Thus one must carefully consider
which approach to choose.
In this article, we focus on the deviance test for logistic regression models.
However, the results and the conclusions are easily applicable to other linear
models involving categorical regressors.
We show how Stata 12.1 performs when implementing goodness of fit. In this
situation, it is important to clarify which one of the three approaches is
implemented as default. Furthermore, a prominent role is played by the shape
of the dataset considered (individual format or events–trials format) in
accordance with the analytical unit choice. In fact, the same procedure
applied to different data structures leads to different approaches to a
saturated model. Thus one must attend to practical and theoretical
statistical issues to avoid inappropriate analyses.
University of Milano–Bicocca
Texas A&M University
College Station, TX
View all articles by these authors:
Rino Bellocco, Sara Algeri
View all articles with these keywords:
saturated models, categorical data, deviance, goodness-of-fit tests
Download citation: BibTeX RIS
Download citation and abstract: BibTeX RIS