{smcl}
{* 28dec2007}{...}
{hline}
{hi:help ice}, {hi:help uvis}{right:(SJ7-4: st0067_3; SJ5-4: st0067_2;}
{right:SJ5-2: st0067_1; SJ4-3: st0067)}
{hline}

{title:Multiple imputation by the MICE system of chained equations}


{title:Syntax}

{phang2}
{cmd:ice}
{it:mainvarlist}
{ifin}
{weight}
[{cmd:,} {it:ice_major_options ice_minor_options}]


{phang2}
{cmd:uvis}
{it:cmd}
{{it:yvar}|{it:llvar ulvar}}
{it:xvars}
{ifin}
{weight}
[{cmd:,} {it:uvis_options}]


{synoptset 29 tabbed}{...}
{synopthdr:ice_major_options}
{synoptline}
{p2coldent:* {cmdab:sav:ing(}{it:filename}[{opt , replace}]{cmd:)}}imputed and nonimputed variables are stored to {it:filename}{p_end}
{synopt :{opt cm:d(cmdlist)}}defines regression command(s) to be used for imputation{p_end}
{synopt :{opt dry:run}}reports the prediction equations - no imputations are done{p_end}
{synopt :{opt eq(eqlist)}}defines customized prediction equations{p_end}
{synopt :{opt m(#)}}defines the number of imputations{p_end}
{synopt :{opt ma:tch(varlist)}}prediction matching for each member of {it:varlist}{p_end}
{synopt :{opt pass:ive(passivelist)}}passive imputation{p_end}
{synopt :{opt sub:stitute(sublist)}}substitutes dummy variables for
multilevel categorical variables{p_end}
{synoptline}
{p 4 6 2}* {opt saving()} is required.


{synopthdr:ice_minor_options}
{synoptline}
{synopt :{opt bo:ot(varlist)}}estimates regression coefficients
for {it:varlist} in a bootstrap sample{p_end}
{synopt :{opt cc(varlist)}}prevents imputation of missing data in observations
in which {it:varlist} has a missing value{p_end}
{synopt :{opt cond:itional(condlist)}}conditional imputation{p_end}
{synopt :{opt cy:cles(#)}}determines number of cycles of regression switching{p_end}
{synopt :{opt drop:missing}}omits all observations
not in the estimation sample from the output{p_end}
{synopt :{opt g:enmiss(string)}}creates missingness indicator variable(s){p_end}
{synopt :{opt i:d(varname)}}creates {it:varname} containing
the original sort order of the data{p_end}
{synopt :{opt int:erval(intlist)}}imputes interval-censored variables{p_end}
{synopt :{opt nocons:tant}}suppresses the regression constant{p_end}
{synopt :{opt nopp}}suppresses special treatment of perfect prediction{p_end}
{synopt :{opt nosh:oweq}}suppresses presentation of prediction equations{p_end}
{synopt :{opt nowarn:ing}}suppresses warning messages{p_end}
{synopt :{opt on(varlist)}}imputes each member of {it:mainvarlist} univariately{p_end}
{synopt :{opt ord:erasis}}enters the variables in the order given{p_end}
{synopt :{opt s:eed(#)}}sets random-number seed{p_end}
{synopt :{opt tr:ace(trace_filename)}}monitors convergence of the imputation algorithm{p_end}
{synoptline}


{synopthdr:uvis_options}
{synoptline}
{p2coldent:* {opt g:en(newvarname)}}creates a variable containing imputations{p_end}
{synopt :{opt bo:ot}}estimates regression coefficients in a bootstrap sample{p_end}
{synopt :{opt ma:tch}}does prediction matching{p_end}
{synopt :{opt nocons:tant}}suppresses the regression constant{p_end}
{synopt :{opt nopp}}suppresses special treatment of perfect prediction{p_end}
{synopt :{opt replace}}overwrites {it:newvarname} if it exists{p_end}
{synopt :{opt se:ed(#)}}sets random-number seed{p_end}
{synoptline}
{p2colreset}{...}
{p 4 6 2}* {cmd:gen()} is required.


{pstd}
where {it:cmd} (with {opt uvis}) may be
{helpb intreg},
{helpb logistic},
{helpb logit},
{helpb mlogit},
{helpb ologit},
or
{helpb regress}. {it:llvar} {it:ulvar} are required with {cmd:intreg}.

{pstd}
All weight types supported by {it:regression_cmd} are allowed; see {help weight}.


{title:Description}

{pstd}
{cmd:ice} imputes missing values
in {it:mainvarlist} by using switching regression, an iterative multivariable
regression technique. The abbreviation MICE means multiple imputation by
chained equations and was apparently coined by Steff van Buuren. {cmd:ice}
implements MICE for Stata. Sets of imputed and nonimputed variables are
stored to a new file called {it:filename}. Any number of complete imputations
may be created. The original data are stored in {it:filename} as
"imputation number 0" and the new variable {cmd:_mj} is set to 0 for these
observations.

{pstd}
{cmd:uvis} (univariate imputation sampling) imputes missing values in the
single variable {it:yvar} based on multiple regression on {it:xvars}.
{cmd:uvis} is called repeatedly by {cmd:ice} in a regression switching mode to
perform multivariate imputation.

{pstd}
The missing observations are assumed to be missing at random (MAR) or
missing completely at random (MCAR), according to the jargon. See, for
example, van Buuren et al. (1999) for an explanation of these concepts.

{pstd}
{cmd:ice} and {cmd:uvis} require Stata 8 or later.
There have been incompatibility issues with Stata 7 and earlier.


{title:Options for ice}

{phang}
{cmd:saving(}{it:filename} [{cmd:,replace}]{cmd:)} saves the imputation to
{it:filename}. {opt replace} allows {it:filename} to be overwritten
with new data.

{phang}
{opt cmd(cmdlist)} defines the regression commands to be used for
each variable in {it:mainvarlist} when it becomes the dependent variable in
the switching regression procedure used by {cmd:uvis} (see {hi:Remarks}).  The
first item in {it:cmdlist} may be a command, such as {cmd:regress}, or may have
the syntax {it:varlist}{cmd::}{it:cmd}, specifying that command {it:cmd}
applies to all the variables in {it:varlist}.  Subsequent items in
{it:cmdlist} must follow the latter syntax, and each item should be followed
by a comma.

{pin}
The default {it:cmd} for a variable is {cmd:logit} when there are two distinct
values, {cmd:mlogit} when there are 3-5 and {cmd:regress} otherwise.

{phang2} Example:  {cmd:cmd(regress)} specifies that all variables are 
to be imputed by {cmd:regress}, overriding the defaults.

{phang2} Example:  {cmd:cmd(x1 x2:logit, x3:regress)} specifies that
{cmd:x1} and {cmd:x2} are to be imputed by {cmd:logit}, {cmd:x3} by
{cmd:regress} and all others by their default choices.

{phang}
{opt dryrun} does a "dry run"; that is, {cmd:ice} 
reports the prediction equations it has constructed from the various
inputs. No imputation is done, and no files are created. It is not
mandatory to specify an output file with {cmd:using} for a dry run.
Sometimes the prediction equation set up needs to be carefully
checked before running what may be a lengthy imputation process.

{phang}
{opt eq(eqlist)} allows one to define customized prediction
equations for any subset of variables in {it:mainvarlist}. The option,
particularly when used with {cmd:passive()}, allows
great flexibility in the possible imputation schemes. The
syntax of {it:eqlist} is {it:varname1}{cmd::}{it:varlist1}
[{cmd:,}{it:varname2}{cmd::}{it:varlist2} ...], where each
{it:varname#} (or {it:varlist#})
is a member (or subset) of {it:mainvarlist}. It is your responsibility to ensure
that each equation is sensible. {cmd:ice} places no restrictions
except to check that all variables mentioned are indeed in
{it:mainvarlist} and that an equation is not defined
for a variable specified to be passively imputed
(see the {cmd:passive()} option. {cmd:eq()} takes
precedence over all default definitions and assumptions about 
the way a given variable in {cmd:mainvarlist} will be imputed.
The default, if the {cmd:passive()} and {cmd:substitute()}
options are not invoked, is that each
variable in {it:mainvarlist} with any missing data is imputed from all
the other variables in {it:mainvarlist}.

{phang}
{opt m(#)} set the number of imputations required
(minimum 1, no upper limit). The default is {cmd:m(1)}.

{phang}
{cmd:match}[{cmd:(}{it:varlist}{cmd:)}] specifies that each member of
{it:varlist} be imputed with the {cmd:match} option of {cmd:uvis}.
This provides prediction matching for each member of {it:varlist}.
If {cmd:(}{it:varlist}{cmd:)} is omitted then all relevant variables are
imputed with the {cmd:match} option of {cmd:uvis}. The default, if
{cmd:match()} is not specified, is to draw from the posterior
predictive distribution of each variable requiring imputation.

{phang}
{opt passive(passivelist)} allows the use of "passive" imputation
of variables that depend on other variables, some of which are imputed.
The syntax of {it:passivelist} is {it:varname}{cmd::}{it:exp}
[{cmd:\}{it:varname}{cmd::}{it:exp} ...]. Notice the requirement to use "\" as
a separator between items in {it:passivelist}, rather than the usual comma; the
reason is that a comma may be a valid part of an expression.  The option is
easily explained by example. Suppose that {cmd:x1} is a categorical variable
with 3 levels and that two dummy variables {cmd:x1a}, {cmd:x1b} have been
created by the commands

{pin}
     {cmd:. generate byte x1a=(x1==2)}{break}
     {cmd:. generate byte x1b=(x1==3)}

{pin}
Now suppose that {cmd:x1} is to be imputed by the {cmd:mlogit} command and is
to be treated as the two dummy variables {cmd:x1a} and {cmd:x1b} when
predicting other variables.  Use of {cmd:mlogit} is achieved by the option
{cmd:cmd(x1:mlogit)}.  When {cmd:x1} is imputed, we want {cmd:x1a} and
{cmd:x1b} to be updated with new values which depend on the imputed values of
{cmd:x1}.  This may be achieved by specifying
{cmd:passive(x1a:x1==2 \ x1b:x1==3)}.  It is necessary also to remove {cmd:x1}
from the list of predictors when variables other than {cmd:x1} are being
imputed, and this is done by using the {cmd:substitute()} option; in the
present example, you would specify {cmd:substitute(x1:x1a x1b)}.

{pin}
Although in this example {cmd:x1a} will take the (possibly
unintended) value of 0 when {cmd:x1} is missing, {cmd:ice} is careful to
ensure that {cmd:x1a} (and {cmd:x1b}) inherit the missingness of {cmd:x1} and
are passively imputed following active imputation of missing values of
{cmd:x1}. If this were not done, incorrect results could occur. The
responsibility of the user is to create {cmd:x1a} and {cmd:x1b} before running
{cmd:ice} such that their missing values are identical to those of {cmd:x1}.

{pin}
A second example is multiplicative interactions between variables, for
example, between {cmd:x1} and {cmd:x2} (e.g., {cmd:x12}={cmd:x1}*{cmd:x2});
this could be entered as {cmd:passive(x12:x1*x2)}. It would cause the
interaction term {cmd:x12} to be omitted when either {cmd:x1} or {cmd:x2} was
being imputed, since it would make no sense to impute {cmd:x1} from its
interaction with {cmd:x2}.  {cmd:substitute()} is not needed here.

{pin}
It should be stressed that variables to be imputed passively must already
exist and must be included in {it:mainvarlist}; otherwise, they will not be
recognized.

{phang}
{opt substitute(sublist)} is typically used with the 
{cmd:passive()} option to represent multilevel categorical variables
as dummy variables in models for predicting other variables. See
{cmd:passive()} for more details. The syntax of {it:sublist} is
{it:varname}{cmd::}{it:dummyvarlist}
[{cmd:,}{it:varname}{cmd::}{it:dummyvarlist} ...], where {it:varname} is the
name of a variable to be substituted and {it:dummyvarlist} is the list of
dummy variables representing it.

{pin}
Note, however, the following important convenience feature:
{cmd:substitute()} may be used without corresponding expressions
in {cmd:passive()} to recreate dummy variables automatically.
If the values of variables in {it:dummyvarlist} are NOT defined
through expressions involving {it:varname} in the {cmd:passive()} option,
the variables in {it:dummyvarlist} are calculated according to the
actual range of values of {it:varname}. For example, suppose that the options
{cmd:passive(x1a:x1==2 \ x1b:x1==3)}
and {cmd:substitute(x1:x1a x1b) were specified. Provided that all
the nonmissing values of {cmd:x1} were 2 when {cmd:x1a}==1 and all
the nonmissing values of {cmd:x1} were 3 when {cmd:x1b}==1, then
{cmd:passive(x1a:x1==2 \ x1b:x1==3)} is implied by {cmd:substitute(x1:x1a x1b)}
and can be omitted. The rule applied by {cmd:substitute(x:dummy1 [dummy2...])}
for defining dummy variables dummy1, dummy2, ..., is as follows:

{phang2}
1. Determine the range of values [xmin, xmax] of x for which dummy1 > 0.

{phang2}
2a. If xmin < xmax, define dummy1 to be 1 if xmin <= x <= xmax and 0 otherwise.

{phang2}
2b. If xmin = xmax, define dummy1 to be 1 if x = xmin and 0 otherwise.

{phang2}
3. Repeat steps 1 and 2a,b for dummy2, dummy3, ..., as necessary.

{pin}
With many such categorical variables this feature can save a lot of typing. 

{phang}
{cmd:boot}[{cmd:(}{it:varlist}{cmd:)}] specifies that each member of
{it:varlist}, a subset of {it:mainvarlist}, be imputed with the {cmd:boot}
option of {cmd:uvis} activated. If {cmd:(}{it:varlist}{cmd:)} is omitted,
all members of {it:mainvarlist} with missing observations are imputed using
the {cmd:boot} option of {cmd:uvis}.

{phang}
{opt cc(varlist)} prevents imputation of missing data in
{it:mainvarlist} where any member of {it:varlist} has a missing
value. "cc" signifies "complete case". Members of {it:varlist} are
used for imputation if they appear in {it:mainvarlist}, but not otherwise. Use
of this option is equivalent to entering {cmd:if}
{cmd:~missing(}{it:var1}{cmd:) &} {cmd:~missing(}{it:var2}{cmd:)} ..., where
{it:var1}, {it:var2}, ... denote the members of {it:varlist}.

{phang}
{opt conditional(condlist)} invokes conditional imputation. Each item of
{it:condlist} has the form
{it:conditional_var}{cmd::}{it:conditioning_var }[{hi:@}]{it:#}|{it:varname},
and items are separated by commas. Suppose that the {it:conditional_var} is
called {hi:y} and the binary {it:conditioning_var} is called {hi:z}. Then
{hi:z} is defined to be 0 if {hi:y} <= {it:#} and 1 if {hi:y} > {it:#}, and
similarly if {it:varname} is supplied instead of {it:#}. In the latter case,
cutoff values are stored in {it:varname} and may therefore vary among
observations.  Either {hi:y} or {hi:z} or both may have missing values, but
{hi:z} cannot be missing when {hi:y} is observed. The reason is that some
missing values of {hi:z} may then be deduced from observed values of {hi:y}
without the need for imputation.

{pin}
The presence of {hi:@} in the alternative syntax
{it:conditional_var}{cmd::}{it:conditioning_var }{hi:@}{it:#}|{it:varname}
modifies the scenario considerably. Now {hi:z} is imputed from other
variables in the usual way, but {hi:y} is imputed from other variables
(except {hi:z}) only for values of {hi:y} greater than {it:#} or {it:varname}.
At the same time, imputation for variables other than {hi:y} and {hi:z} can
include both {hi:y} and {hi:z} as predictors. This scenario is
appropriate when {hi:y} has a substantial proportion of 
observations which take the same value, typically 0. Then the observations
with {hi:y} <= {it:#} (equivalently, with {hi:z} = 0) are regarded
as a separate subpopulation, with a possibly different relationship holding
for values of {hi:y} > {it:#}. An example is when {hi:y} is the amount of
alcohol consumed per week, where {hi:z} would be 1 for drinkers and
0 for teetotallers. Missing values of {hi:z} would be imputed by
logistic regression on the other variables except for {hi:y},
and missing values of {hi:y} by regression on other variables in the
drinking subset {hi:z} = 1 only. It is guaranteed that imputed values
of {hi:y} for {hi:z} = 0 will equal {it:#} and for {hi:z} = 1 will
be greater than {it:#}.

{pin}
See {hi:Remarks} for further information on conditional imputation.

{phang}
{opt cycles(#)} determines the number of cycles of regression switching to be
carried out. The default is {cmd:cycles(10)}.

{phang}
{opt dropmissing} is a feature designed to save memory when using
the file of imputed data created by {cmd:ice}. It omits from {it:filename} all
observations which are not in the estimation sample, that is for which either
(i) they are filtered out by {cmd:if} or {cmd:in}, or a nonpositive
weight, or
(ii) the values of all variables in {it:mainvarlist} are missing.
This option provides a "clean" analysis file of imputations, with
no missing values. The observations not in the
estimation sample are also omitted from
the original data and stored as imputation #0 in {it:filename}.

{phang}
{opt genmiss(string)} creates an indicator variable for the
missingness of data in any variable in {it:mainvarlist} for which at least one
value has been imputed. The indicator variable is set to missing for
observations excluded by {cmd:if}, {cmd:in}, etc.  The indicator variable for
{it:xvar} is named {it:string}{it:xvar}.
This option is left for backward compatibility, but now that the
original data are stored in the output file, it is no longer really
needed. The information on missingness is implicit in the original
data stored as "imputation 0".

{phang}
{opt id(newvarname)} creates a variable called {it:newvarname} containing
the original sort order of the data. The default is {cmd:id(_mi)}.

{phang}
{opt interval(intlist)} imputes interval-censored variables.
An interval-censored value is known to lie in an interval [a, b],
where a and b are finite and a <= b; in (-infinity, b]; or in [a, infinity).
When either terminal is infinite, we have left or right censoring,
respectively. 
{it:intlist} has the syntax {it:varname}{hi::}{it:llvar ulvar}
[{hi:,} {it:varname}:{it:it:llvar ulvar} ...],
where each {it:varname} is an interval-censored variable, each
{it:llvar} contains the lower bound (a) for {it:varname}, and each
{it:ulvar} contains the upper bound (b) for {it:varname} (or a missing
value to represent plus or minus infinity).
The supplied values of {it:varname} are irrelevant since they will be
replaced anyway; it is only required that {it:varname} exist. Observations
with {it:llvar} missing and {it:ulvar} present are left-censored
for {it:varname}. Observations with {it:llvar} present and {it:ulvar}
missing are right-censored for {it:varname}. Observations with
{it:llvar} = {it:ulvar} are complete, and no imputation is done for
them. Observations with both {it:llvar} and {it:ulvar} missing
are imputed assuming an uncensored normal distribution. See {hi:Remarks}
for further information.

{phang}
{opt noconstant} suppresses the regression constant in all regressions.

{phang}
{opt nopp} suppresses treatment of the perfect prediction bug; see
{help ice##nopp:Avoiding the perfect prediction bug}.

{phang}
{opt noshoweq} suppresses the presentation of the prediction equations.

{phang}
{opt nowarning} suppresses the warning messages.

{phang}
{opt on(varlist)} changes the operation of {cmd:ice} in a major
way.  With this option, {cmd:uvis} imputes each member of {it:mainvarlist}
univariately on {it:varlist}. This provides a convenient way of producing
multiple imputations when imputation for each variable in {it:mainvarlist} is
to be done univariately on a set of complete predictors.

{phang}
{opt orderasis} enters the variables in {it:mainvarlist} into the MICE
algorithm in the order given. The default is to order them according
to the number of missing values: the variable with least missingness
gets imputed first, and so on.

{phang}
{opt seed(#)} sets the random-number seed to {it:#}.
To reproduce a set of imputations, the same random-number seed should be used. 
The default is {cmd:seed(0)}, meaning no seed is set by the program.

{phang}
{opt trace(trace_filename)} monitors the convergence of the imputation
algorithm. For each original variable with missing values, the mean of the
imputed values is stored as a variable in {it:trace_filename}, together
with the cycle number at which that
mean was calculated. The results are stored only for the final imputation.
For diagnostic purposes, it is sensible to run {cmd:trace()}
with {cmd:m(1)} and many cycles, such as {cmd:cycles(100)}.
When the run is complete, it is helpful to load {it:trace_filename}
into memory and plot the mean for each imputed
variable against the cycle number. If necessary, smoothing may be applied
to clarify any apparent pattern. Convergence is judged to have occurred
when the pattern of the imputed means is random.  The number of cycles needed
for convergence is usually obvious from the appearance of the plot.


{title:Options for uvis}

{phang}
{opt gen(newvarname)} is required and creates a variable containing imputations.
{it:newvar} contains original (nonmissing) and imputed (originally missing)
values of {it:yvar}.

{phang}
{opt boot} invokes a bootstrap method for creating imputed values (see
{hi:Remarks}).

{phang}
{opt match} creates imputations by prediction matching. The default is to
draw imputations at random from the posterior distribution of the
missing values of {it:yvar}, conditional on the observed values and the members
of {it:xvars}. See {hi:Remarks} for further details.

{phang}
{opt noconstant} suppresses the regression constant in all regressions.

{phang}
{opt nopp} suppresses treatment of the perfect prediction bug; see
{help ice##nopp:Avoiding the perfect prediction bug}.

{phang}
{opt replace} permits {it:newvar} (see {cmd:gen(}{it:newvar}{cmd:)})
to be overwritten with new data.

{phang}
{opt seed(#)} sets the random-number seed to {it:#}.
See {hi:Remarks} for comments on how to ensure reproducible imputations
by using the {cmd:seed()} option.
The default is {cmd:seed(0)}, meaning no seed is set by the program.


{title:Remarks}

{pstd}
{cmd:uvis} imputes {it:yvar} from {it:xvars} according to the following
algorithm (see van Buuren et al. (1999, sec. 3.2) for further technical
details):

{phang2}
1. Estimate the vector of coefficients (beta) and the residual variance
by regressing the nonmissing values of {it:yvar} on the current "completed"
version of {it:xvars}. Predict the fitted values {it:etaobs} at the
nonmissing observations of {it:yvar}.

{phang2}
2. Draw at random a value (sigma_star) from the posterior distribution of the
residual standard deviation.

{phang2}
3. Draw at random a value (beta_star) from the posterior distribution of beta,
allowing, through sigma_star, for uncertainty in beta.

{phang2}
4. Use beta_star to predict the fitted values {it:etamis}
at the missing observations of {it:yvar}.

{phang2}
5. The imputed values are predicted directly from beta_star, sigma_star, and
the covariates. When imputation is by linear regression ({cmd:regress}
command), this step assumes that {it:yvar} is normally distributed, given the
covariates.  For other types of imputation, samples are drawn from the
appropriate distribution.

{pstd}
With the {cmd:match} option, step 5 is replaced by the following.
For each missing observation of {it:yvar} with prediction {it:etamis},
find the nonmissing observation of {it:yvar} whose prediction
({it:etaobs}) on observed data is closest to {it:etamis}. This closest
nonmissing observation is used to impute the missing value of {it:yvar}.

{pstd}
The default draw method is not robust to departures from normality and
may produce implausible imputations. For example, if the original distribution
is skew and positive-valued, the imputed distribution will not necessarily
have the appropriate amount of skewness nor will all the imputed values
necessarily be positive. Log transformation of positive variables may greatly
improve the appropriateness of the imputations.

{pstd}
The alternative {cmd:match} method is recommended only for continuous variables
when the normality assumption is clearly untenable, even approximately.
It is not necessary, nor is it recommended, for binary, ordered categorical, or
nominal variables. {cmd:match} may work well when the distribution of a
continuous variable is nonnormal, but it may sometimes result in biased
imputations.

{pstd}
With the {cmd:boot} option, steps 2-4 are replaced by a bootstrap estimation of
beta_star; beta_star is estimated by regressing {it:yvar} on {it:xvars} after
taking a bootstrap sample of the nonmissing observations. This has the
advantage of robustness because the distribution of beta is no longer assumed
to be multivariate normal.

{pstd}
{cmd:uvis} will not impute observations for which a value of a variable in
{it:xvars} is missing. However, all original (missing or nonmissing)
observations of {it:yvar} will be copied into {it:newvarname} in such cases.
This is a change from the first release of {cmd:uvis} (with {cmd:mvis}).
Previously, {it:newvarname} would be set to missing whenever a value of a
variable in {it:xvars} was missing, irrespective of the value of {it:yvar}.

{pstd}
Missing data for ordered (or unordered) categorical covariates should
be imputed by using the {cmd:ologit} (or {cmd:mlogit}) command. In these cases,
prediction matching is done on the scale of the mean absolute difference
in the predicted class probabilities, preceded by logit transformation.

{pstd}
{cmd:ice} carries out multivariate imputation in {it:mainvarlist} using
regression switching (van Buuren et al. 1999) as follows:

{phang2}
1. Ignore any observations for which {it:mainvarlist} has only missing values,
   or if the {cmd:cc(}{it:varlist}{cmd:)} option has been specified, for
   which any member of {it:varlist} has a missing value.

{phang2}
2. For each variable in {it:mainvarlist} with any missing data, randomly order
   that variable and replicate the observed values across the missing cases.
   This step initializes the iterative procedure by ensuing that no relevant
   values are missing.

{phang2}
3. For each variable in {it:mainvarlist} in turn, impute missing values by
   applying {cmd:uvis} with the remaining variables as covariates.

{phang2}
4. Repeat step 3 {cmd:cycles()} times, replacing the imputed values with updated
   values at the end of each cycle.

{pstd}
One imputation sample is created for each variable with any relevant
missing values.

{pstd}
Van Buuren recommends {cmd:cycles(20)} but goes on to say that 10, or even 5,
iterations are probably sufficient. We have chosen a default of 10.

{pstd}
Multiple imputation (MI) implies the creation and analysis of several
imputed datasets. To do this, one would run {cmd:ice} with {cmd:m()} set
to a suitable number, for example, 5. To obtain final estimates
of the parameters of interest and their standard errors, one would fit a model
in each imputation and carry out the appropriate post-MI averaging procedure on
the results from the {cmd:m()} separate imputations. A suitable estimation tool
for this purpose is {helpb mim} (if installed).

{pstd}
{hi:{ul:Handling the outcome variable}}

{pstd}
To avoid bias, the outcome variable must always be included in the
list of variables to be used for imputation. In survival analysis,
in particular, it is essential to include the censoring indicator 
as well as the survival time.
van Buuren et al (1999) recommend a log transformation of the survival
time, although the "correct" functional form is an open research question.
Van Buuren et al (1999) give a detailed discussion of the different types
of covariate that can be included in the imputation model and discuss the
important issue of how to deal with variables which are missing completely at
random (MCAR), missing at random (MAR), and not missing at random (NMAR).

{pstd}
{hi:{ul:Handling categorical variables}}

{pstd}
Binary variables present no difficulty. By default, in the MICE
procedure when such a variable is the response, it is
predicted from other variables by using logistic regression;
when it is a covariate, it is modeled in the only way possible, 
effectively as a dummy variable. Categorical variables with 3 or
more levels may in principle be treated in different ways.
By default, {cmd:ice} variables with 3-5 levels are modeled
as multinomial logistic regression ({cmd:mlogit} command) when
they are the response and as a single linear term when they are a covariate.
The same behavior occurs with the ordered logistic model ({cmd:ologit}
command), requested with the {cmd:cmd()} option. The use of dummy variables
instead of a linear term may be imposed as described under the
{cmd:passive()} option. The requisite dummy variables must be created before
{cmd:ice} is invoked. Variables with 6 or more levels are treated as ordered
and continuous, but again different choices may be imposed by use of the
{cmd:cmd()}, {cmd:passive()}, and {cmd:substitute()} options.

{pstd}
You should be aware that unless the dataset is large, use of the {cmd:mlogit}
command may produce unstable estimates if the number of levels is too large
and may compromise the accuracy of the imputations. It is hard to predict when
this will occur.

{pstd}
Due to a peculiarity of the way the {cmd:mlogit} command works, variables with
score labels cause problems to {cmd:ice} and {cmd:uvis} when missing data are
imputed using {cmd:mlogit}.  Score labels for such variables are removed in the
file of imputed data.

{pstd}
{hi:{ul:Conditional imputation}}

{pstd}
The first type of conditional imputation ({cmd:conditional()} option)
is implemented by a type of rejection sampling. Unfortunately, 
rejection sampling can be slow and inefficient, since many candidate
imputed values may be rejected. The process may therefore take a
noticeable amount of time. {cmd:ice} proceeds as follows when
imputing a variable y conditional on a binary variable z according to
a cutoff value {it:#}:

{p 9 12 2}
1. Initialize y and z by random sampling from the observed distribution
of each variable.

{p 9 12 2}
2. Impute values of z missing in the estimation sample
according to z's prediction equation and other options.

{p 9 12 2}
3. Impute values of y missing in the estimation sample
according to y's prediction equation and other options.
Accept only imputed values v for y that satisfy v <= {it:#} when z = 0 and
v > {it:#} when z = 1.

{p 9 12 2}
4. Repeat step 3 until no missing values of y remain
in the estimation sample. This may take many calls to {cmd:uvis}.

{p 9 12 2}
5. Repeat steps 2-4 {cmd:cycles()} times.

{pstd}
Of course, this subprocedure is part of the process of imputing
other variables with missing values. y may be binary,
ordinal, or continuous. At present, only dichotomizations of
y are supported, although different cutoff values may be
used for different observations.

{pstd}
The second type of conditional imputation for imputing according to a
subpopulation defined by {hi:y} <= {it:#} or {hi:y} > {it:#} does not involve
rejection sampling and is no slower than any other imputation procedure
performed by {cmd:ice}.

{pstd}
{hi:{ul:Interval censoring}}

{pstd}
Values of a variable y that are interval censored are imputed under the
assumption that y is normally distributed with unknown mean and variance.
The method, which is fast and efficient, is essentially as described
for right-censored variables in section 3.3 of Royston (2001).
A minor extension to allow left or interval censoring is employed.
For example, if A < y < B and A and B are both finite, the imputed
value for y will follow a truncated normal distribution with bounds
A and B, variance parameter estimated from the data and mean given by the
linear predictor for the imputation model for y. Stata's {cmd:intreg} command
is used to estimate the mean and variance of y. When A and B are both
missing (infinite), imputation of y simply assumes the normal
distribution just mentioned, but without bounds.

{pstd}
If you wish to impose range limits on the imputed values, the lower and upper
bound variables may be set accordingly. For example, to impute right-censored
(e.g., survival) data, you would set {it:llvar} equal to all
the observed times to event, whether censored or not, and {it:ulvar} to all
the uncensored event times and missing for the censored times.
This would cause the right-censored values to be imputed without restriction.
If you wanted to bound the imputed values above, say, by 10,
you would specify {it:ulvar} to be 10 (rather than missing) for all
the censored observations.

{pstd}
{marker nopp}{...}
{hi:{ul:Avoiding the perfect prediction bug}}

{pstd}
Perfect prediction may arise in {cmd:logistic}, {cmd:ologit}, or {cmd:mlogit}
regression models when a (usually categorical) predictor variable perfectly
predicts success or failure in the outcome variable.  In {cmd:ice}, perfect
prediction may occur without the user's knowledge because many 
regression models are run silently. Perfect prediction may lead to entirely
inappropriate imputations. To avoid this, {cmd:uvis} checks for perfect
prediction; if it is detected, {cmd:uvis} temporarily augments the data with a
few extra observations with low weight, in such a way as to remove
the perfect prediction.  A message is displayed noting the variable that has
the perfect prediction issue and that the problem has been resolved.  Such
treatment of the perfect prediction bug may be switched off, if desired, by
using the {opt nopp} option.

{pstd}
{hi:{ul:Errors and diagnostics}}

{pstd}
{cmd:ice} may occasionally detect an anomaly when running
{cmd:uvis} with a particular variable as response and a particular
regression command. {cmd:ice} will then stop and report the {cmd:uvis}
command it was running and the error number returned. Often the problem
lies in a regression of a binary or categorical variable where the
estimation procedure fails to converge; this is usually caused by
sparse cell occupancy of the response variable. If you obtain this
error you should either omit the offending variable from the
imputation, or seek to combine a sparse category with another category.

{pstd}
Another possibility is that, again due to a defect in a particular
regression command in the chained equations structure, the number
of values imputed for a particular variable is less than expected.
This is a serious error and again may arise from estimation problems
involving a binary or categorical variable. In this situation, {cmd:ice}
saves in the working directory
a snapshot of the data it was using in the attempted estimation
to a file called {hi:_ice_dump.dta}, whereas also reporting the {cmd:uvis}
command it was executing. You can then investigate what may have gone
wrong with the command by loading the data in  {hi:_ice_dump.dta} and
rerunning the offending regression command.

{pstd}
{hi:{ul:Further notes}}

{pstd}
{opt ice} saves all the variables in the current data to the output
{it:filename}, whether they are involved in the imputation
procedure. This can make the resulting file large. It may
therefore be sensible to drop variables not subsequently needed 
for modeling before running {opt ice}.

{pstd}
{cmd:ice} determines the order of imputing variables in the round
of chained equations according to the amount of missing data.
Variables with the least missingness are imputed first. 
If {opt ice} is run twice using identical variables and the same
random-number seed, but with the variables in {it:mainvarlist}
in a different order, slightly different imputations will be
generated. The differences will be purely random and will not produce
bias in subsequent parameter estimates. If the {opt boot()} option
is applied to all variables, the order of variables
no longer affects the results.

{pstd}
An important application of MI is to investigate possible models, for example,
prognostic models, in which selection of influential variables is required
(Clark and Altman 2003). For example, the stability of the final model across
the imputation samples is of interest. This area of inquiry is in its infancy.

{pstd}
See also Van Buuren's web site http://www.multiple-imputation.com for further
information and software sources.


{title:Examples}

{phang}
{cmd:. uvis regress y x1 x2 x3, gen(ym)}

{phang}
{cmd:. uvis intreg ll ul x1 x2 x3, gen(y)}

{phang}
{cmd:. ice x1 x2 x3 using imputed, m(5)}

{phang}
{cmd:. ice x1 x2 x3 using imputed, m(5) cycles(20) cc(x4 x5)}

{phang}
{cmd:. ice x1-x5 using imputed, m(10) boot match(x1 x2 x3) cmd(x1 x2:mlogit, x3:ologit) id(pid) seed(101) genmiss(m_)}

{phang}
{cmd:. ice x1 x1a x1b x2 x3 x23 using imputed, m(5) cmd(x1:ologit) passive(x1a:x1==2 \x1b:x1==3 \x23=x2*x3) substitute(x1:x1a x1b)}

{phang}
{cmd:. ice y1 y2 y3 x1 x2 x3 x4 using imputed, m(5) eq(y1:x1 x2 y2, y2:y1 x3 x4, y3:y1 y2) match(y3)}

{phang}
{cmd:. ice x1 x2 x3 using imputed, m(5) cmd(x1:ologit) match(x2) dropmissing}

{phang}
{cmd:. ice x1 ll2 ul2 x2 ll3 ul3 x3 using imputed, m(5) interval(x2:ll2 ul2, x3:ll3 ul3)}


{title:Author}

{pstd}
Patrick Royston, MRC Clinical Trials Unit, London.{break}
pr@ctu.mrc.ac.uk


{title:References}

{phang}
van Buuren, S., H. C. Boshuizen, and D. L. Knook. 1999. Multiple imputation of
    missing blood pressure covariates in survival analysis.
    {it:Statistics in Medicine} 18: 681-694. 

{phang}
Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing
multiple imputed datasets. {it:Stata Journal} 3: 226-244.

{phang}
Clark, T. G., and D. G. Altman. 2003. Developing a prognostic model
in the presence of missing data: an ovarian cancer case-study.
{it:Journal of Clinical Epidemiology} 56: 28-37.

{phang}
Royston, P. 2001. The lognormal distribution as a model for survival
time in cancer, with an emphasis on prognostic factors.
{it:Statistica Neelandica} 55: 89-104.

{phang}
Royston, P. 2004. Multiple imputation of missing values.
{it:Stata Journal} 4: 227-241.

{phang}
Royston, P. 2005a. Multiple imputation of missing values: update.
{it:Stata Journal} 5: 188-201.

{phang}
Royston, P. 2005b. Multiple imputation of missing values: update of ice.
{it:Stata Journal} 5: 527-536.


{title:Acknowledgments}

{pstd}
Ian White has made substantial contributions to the understanding and practical
use of multiple imputation and to the programming of {cmd:ice} and {cmd:uvis}.
Ian wrote the base of the {opt draw()} option; the idea and code for coping
with perfect prediction are essentially all his. I am extremely grateful to him
for his ongoing commitment to this project.

{pstd}
I am grateful also to Gillian Raab for pointing out certain issues with the
prediction matching approach, particularly that it is only useful with
continuous variables.  As a result, the default imputation method has been
changed from matching to drawing from the predictive distribution. Gillian also
suggested imputing the variables in reverse order of the amount of missingness,
and selecting the imputed value at random from the set determined by the
available matching predictions. Both suggestions have been implemented. 


{title:Also see}

{psee}
Online:  {helpb mim} (if installed)
{p_end}