{smcl} {* 28dec2007}{...} {hline} {hi:help ice}, {hi:help uvis}{right:(SJ7-4: st0067_3; SJ5-4: st0067_2;} {right:SJ5-2: st0067_1; SJ4-3: st0067)} {hline} {title:Multiple imputation by the MICE system of chained equations} {title:Syntax} {phang2} {cmd:ice} {it:mainvarlist} {ifin} {weight} [{cmd:,} {it:ice_major_options ice_minor_options}] {phang2} {cmd:uvis} {it:cmd} {{it:yvar}|{it:llvar ulvar}} {it:xvars} {ifin} {weight} [{cmd:,} {it:uvis_options}] {synoptset 29 tabbed}{...} {synopthdr:ice_major_options} {synoptline} {p2coldent:* {cmdab:sav:ing(}{it:filename}[{opt , replace}]{cmd:)}}imputed and nonimputed variables are stored to {it:filename}{p_end} {synopt :{opt cm:d(cmdlist)}}defines regression command(s) to be used for imputation{p_end} {synopt :{opt dry:run}}reports the prediction equations - no imputations are done{p_end} {synopt :{opt eq(eqlist)}}defines customized prediction equations{p_end} {synopt :{opt m(#)}}defines the number of imputations{p_end} {synopt :{opt ma:tch(varlist)}}prediction matching for each member of {it:varlist}{p_end} {synopt :{opt pass:ive(passivelist)}}passive imputation{p_end} {synopt :{opt sub:stitute(sublist)}}substitutes dummy variables for multilevel categorical variables{p_end} {synoptline} {p 4 6 2}* {opt saving()} is required. {synopthdr:ice_minor_options} {synoptline} {synopt :{opt bo:ot(varlist)}}estimates regression coefficients for {it:varlist} in a bootstrap sample{p_end} {synopt :{opt cc(varlist)}}prevents imputation of missing data in observations in which {it:varlist} has a missing value{p_end} {synopt :{opt cond:itional(condlist)}}conditional imputation{p_end} {synopt :{opt cy:cles(#)}}determines number of cycles of regression switching{p_end} {synopt :{opt drop:missing}}omits all observations not in the estimation sample from the output{p_end} {synopt :{opt g:enmiss(string)}}creates missingness indicator variable(s){p_end} {synopt :{opt i:d(varname)}}creates {it:varname} containing the original sort order of the data{p_end} {synopt :{opt int:erval(intlist)}}imputes interval-censored variables{p_end} {synopt :{opt nocons:tant}}suppresses the regression constant{p_end} {synopt :{opt nopp}}suppresses special treatment of perfect prediction{p_end} {synopt :{opt nosh:oweq}}suppresses presentation of prediction equations{p_end} {synopt :{opt nowarn:ing}}suppresses warning messages{p_end} {synopt :{opt on(varlist)}}imputes each member of {it:mainvarlist} univariately{p_end} {synopt :{opt ord:erasis}}enters the variables in the order given{p_end} {synopt :{opt s:eed(#)}}sets random-number seed{p_end} {synopt :{opt tr:ace(trace_filename)}}monitors convergence of the imputation algorithm{p_end} {synoptline} {synopthdr:uvis_options} {synoptline} {p2coldent:* {opt g:en(newvarname)}}creates a variable containing imputations{p_end} {synopt :{opt bo:ot}}estimates regression coefficients in a bootstrap sample{p_end} {synopt :{opt ma:tch}}does prediction matching{p_end} {synopt :{opt nocons:tant}}suppresses the regression constant{p_end} {synopt :{opt nopp}}suppresses special treatment of perfect prediction{p_end} {synopt :{opt replace}}overwrites {it:newvarname} if it exists{p_end} {synopt :{opt se:ed(#)}}sets random-number seed{p_end} {synoptline} {p2colreset}{...} {p 4 6 2}* {cmd:gen()} is required. {pstd} where {it:cmd} (with {opt uvis}) may be {helpb intreg}, {helpb logistic}, {helpb logit}, {helpb mlogit}, {helpb ologit}, or {helpb regress}. {it:llvar} {it:ulvar} are required with {cmd:intreg}. {pstd} All weight types supported by {it:regression_cmd} are allowed; see {help weight}. {title:Description} {pstd} {cmd:ice} imputes missing values in {it:mainvarlist} by using switching regression, an iterative multivariable regression technique. The abbreviation MICE means multiple imputation by chained equations and was apparently coined by Steff van Buuren. {cmd:ice} implements MICE for Stata. Sets of imputed and nonimputed variables are stored to a new file called {it:filename}. Any number of complete imputations may be created. The original data are stored in {it:filename} as "imputation number 0" and the new variable {cmd:_mj} is set to 0 for these observations. {pstd} {cmd:uvis} (univariate imputation sampling) imputes missing values in the single variable {it:yvar} based on multiple regression on {it:xvars}. {cmd:uvis} is called repeatedly by {cmd:ice} in a regression switching mode to perform multivariate imputation. {pstd} The missing observations are assumed to be missing at random (MAR) or missing completely at random (MCAR), according to the jargon. See, for example, van Buuren et al. (1999) for an explanation of these concepts. {pstd} {cmd:ice} and {cmd:uvis} require Stata 8 or later. There have been incompatibility issues with Stata 7 and earlier. {title:Options for ice} {phang} {cmd:saving(}{it:filename} [{cmd:,replace}]{cmd:)} saves the imputation to {it:filename}. {opt replace} allows {it:filename} to be overwritten with new data. {phang} {opt cmd(cmdlist)} defines the regression commands to be used for each variable in {it:mainvarlist} when it becomes the dependent variable in the switching regression procedure used by {cmd:uvis} (see {hi:Remarks}). The first item in {it:cmdlist} may be a command, such as {cmd:regress}, or may have the syntax {it:varlist}{cmd::}{it:cmd}, specifying that command {it:cmd} applies to all the variables in {it:varlist}. Subsequent items in {it:cmdlist} must follow the latter syntax, and each item should be followed by a comma. {pin} The default {it:cmd} for a variable is {cmd:logit} when there are two distinct values, {cmd:mlogit} when there are 3-5 and {cmd:regress} otherwise. {phang2} Example: {cmd:cmd(regress)} specifies that all variables are to be imputed by {cmd:regress}, overriding the defaults. {phang2} Example: {cmd:cmd(x1 x2:logit, x3:regress)} specifies that {cmd:x1} and {cmd:x2} are to be imputed by {cmd:logit}, {cmd:x3} by {cmd:regress} and all others by their default choices. {phang} {opt dryrun} does a "dry run"; that is, {cmd:ice} reports the prediction equations it has constructed from the various inputs. No imputation is done, and no files are created. It is not mandatory to specify an output file with {cmd:using} for a dry run. Sometimes the prediction equation set up needs to be carefully checked before running what may be a lengthy imputation process. {phang} {opt eq(eqlist)} allows one to define customized prediction equations for any subset of variables in {it:mainvarlist}. The option, particularly when used with {cmd:passive()}, allows great flexibility in the possible imputation schemes. The syntax of {it:eqlist} is {it:varname1}{cmd::}{it:varlist1} [{cmd:,}{it:varname2}{cmd::}{it:varlist2} ...], where each {it:varname#} (or {it:varlist#}) is a member (or subset) of {it:mainvarlist}. It is your responsibility to ensure that each equation is sensible. {cmd:ice} places no restrictions except to check that all variables mentioned are indeed in {it:mainvarlist} and that an equation is not defined for a variable specified to be passively imputed (see the {cmd:passive()} option. {cmd:eq()} takes precedence over all default definitions and assumptions about the way a given variable in {cmd:mainvarlist} will be imputed. The default, if the {cmd:passive()} and {cmd:substitute()} options are not invoked, is that each variable in {it:mainvarlist} with any missing data is imputed from all the other variables in {it:mainvarlist}. {phang} {opt m(#)} set the number of imputations required (minimum 1, no upper limit). The default is {cmd:m(1)}. {phang} {cmd:match}[{cmd:(}{it:varlist}{cmd:)}] specifies that each member of {it:varlist} be imputed with the {cmd:match} option of {cmd:uvis}. This provides prediction matching for each member of {it:varlist}. If {cmd:(}{it:varlist}{cmd:)} is omitted then all relevant variables are imputed with the {cmd:match} option of {cmd:uvis}. The default, if {cmd:match()} is not specified, is to draw from the posterior predictive distribution of each variable requiring imputation. {phang} {opt passive(passivelist)} allows the use of "passive" imputation of variables that depend on other variables, some of which are imputed. The syntax of {it:passivelist} is {it:varname}{cmd::}{it:exp} [{cmd:\}{it:varname}{cmd::}{it:exp} ...]. Notice the requirement to use "\" as a separator between items in {it:passivelist}, rather than the usual comma; the reason is that a comma may be a valid part of an expression. The option is easily explained by example. Suppose that {cmd:x1} is a categorical variable with 3 levels and that two dummy variables {cmd:x1a}, {cmd:x1b} have been created by the commands {pin} {cmd:. generate byte x1a=(x1==2)}{break} {cmd:. generate byte x1b=(x1==3)} {pin} Now suppose that {cmd:x1} is to be imputed by the {cmd:mlogit} command and is to be treated as the two dummy variables {cmd:x1a} and {cmd:x1b} when predicting other variables. Use of {cmd:mlogit} is achieved by the option {cmd:cmd(x1:mlogit)}. When {cmd:x1} is imputed, we want {cmd:x1a} and {cmd:x1b} to be updated with new values which depend on the imputed values of {cmd:x1}. This may be achieved by specifying {cmd:passive(x1a:x1==2 \ x1b:x1==3)}. It is necessary also to remove {cmd:x1} from the list of predictors when variables other than {cmd:x1} are being imputed, and this is done by using the {cmd:substitute()} option; in the present example, you would specify {cmd:substitute(x1:x1a x1b)}. {pin} Although in this example {cmd:x1a} will take the (possibly unintended) value of 0 when {cmd:x1} is missing, {cmd:ice} is careful to ensure that {cmd:x1a} (and {cmd:x1b}) inherit the missingness of {cmd:x1} and are passively imputed following active imputation of missing values of {cmd:x1}. If this were not done, incorrect results could occur. The responsibility of the user is to create {cmd:x1a} and {cmd:x1b} before running {cmd:ice} such that their missing values are identical to those of {cmd:x1}. {pin} A second example is multiplicative interactions between variables, for example, between {cmd:x1} and {cmd:x2} (e.g., {cmd:x12}={cmd:x1}*{cmd:x2}); this could be entered as {cmd:passive(x12:x1*x2)}. It would cause the interaction term {cmd:x12} to be omitted when either {cmd:x1} or {cmd:x2} was being imputed, since it would make no sense to impute {cmd:x1} from its interaction with {cmd:x2}. {cmd:substitute()} is not needed here. {pin} It should be stressed that variables to be imputed passively must already exist and must be included in {it:mainvarlist}; otherwise, they will not be recognized. {phang} {opt substitute(sublist)} is typically used with the {cmd:passive()} option to represent multilevel categorical variables as dummy variables in models for predicting other variables. See {cmd:passive()} for more details. The syntax of {it:sublist} is {it:varname}{cmd::}{it:dummyvarlist} [{cmd:,}{it:varname}{cmd::}{it:dummyvarlist} ...], where {it:varname} is the name of a variable to be substituted and {it:dummyvarlist} is the list of dummy variables representing it. {pin} Note, however, the following important convenience feature: {cmd:substitute()} may be used without corresponding expressions in {cmd:passive()} to recreate dummy variables automatically. If the values of variables in {it:dummyvarlist} are NOT defined through expressions involving {it:varname} in the {cmd:passive()} option, the variables in {it:dummyvarlist} are calculated according to the actual range of values of {it:varname}. For example, suppose that the options {cmd:passive(x1a:x1==2 \ x1b:x1==3)} and {cmd:substitute(x1:x1a x1b) were specified. Provided that all the nonmissing values of {cmd:x1} were 2 when {cmd:x1a}==1 and all the nonmissing values of {cmd:x1} were 3 when {cmd:x1b}==1, then {cmd:passive(x1a:x1==2 \ x1b:x1==3)} is implied by {cmd:substitute(x1:x1a x1b)} and can be omitted. The rule applied by {cmd:substitute(x:dummy1 [dummy2...])} for defining dummy variables dummy1, dummy2, ..., is as follows: {phang2} 1. Determine the range of values [xmin, xmax] of x for which dummy1 > 0. {phang2} 2a. If xmin < xmax, define dummy1 to be 1 if xmin <= x <= xmax and 0 otherwise. {phang2} 2b. If xmin = xmax, define dummy1 to be 1 if x = xmin and 0 otherwise. {phang2} 3. Repeat steps 1 and 2a,b for dummy2, dummy3, ..., as necessary. {pin} With many such categorical variables this feature can save a lot of typing. {phang} {cmd:boot}[{cmd:(}{it:varlist}{cmd:)}] specifies that each member of {it:varlist}, a subset of {it:mainvarlist}, be imputed with the {cmd:boot} option of {cmd:uvis} activated. If {cmd:(}{it:varlist}{cmd:)} is omitted, all members of {it:mainvarlist} with missing observations are imputed using the {cmd:boot} option of {cmd:uvis}. {phang} {opt cc(varlist)} prevents imputation of missing data in {it:mainvarlist} where any member of {it:varlist} has a missing value. "cc" signifies "complete case". Members of {it:varlist} are used for imputation if they appear in {it:mainvarlist}, but not otherwise. Use of this option is equivalent to entering {cmd:if} {cmd:~missing(}{it:var1}{cmd:) &} {cmd:~missing(}{it:var2}{cmd:)} ..., where {it:var1}, {it:var2}, ... denote the members of {it:varlist}. {phang} {opt conditional(condlist)} invokes conditional imputation. Each item of {it:condlist} has the form {it:conditional_var}{cmd::}{it:conditioning_var }[{hi:@}]{it:#}|{it:varname}, and items are separated by commas. Suppose that the {it:conditional_var} is called {hi:y} and the binary {it:conditioning_var} is called {hi:z}. Then {hi:z} is defined to be 0 if {hi:y} <= {it:#} and 1 if {hi:y} > {it:#}, and similarly if {it:varname} is supplied instead of {it:#}. In the latter case, cutoff values are stored in {it:varname} and may therefore vary among observations. Either {hi:y} or {hi:z} or both may have missing values, but {hi:z} cannot be missing when {hi:y} is observed. The reason is that some missing values of {hi:z} may then be deduced from observed values of {hi:y} without the need for imputation. {pin} The presence of {hi:@} in the alternative syntax {it:conditional_var}{cmd::}{it:conditioning_var }{hi:@}{it:#}|{it:varname} modifies the scenario considerably. Now {hi:z} is imputed from other variables in the usual way, but {hi:y} is imputed from other variables (except {hi:z}) only for values of {hi:y} greater than {it:#} or {it:varname}. At the same time, imputation for variables other than {hi:y} and {hi:z} can include both {hi:y} and {hi:z} as predictors. This scenario is appropriate when {hi:y} has a substantial proportion of observations which take the same value, typically 0. Then the observations with {hi:y} <= {it:#} (equivalently, with {hi:z} = 0) are regarded as a separate subpopulation, with a possibly different relationship holding for values of {hi:y} > {it:#}. An example is when {hi:y} is the amount of alcohol consumed per week, where {hi:z} would be 1 for drinkers and 0 for teetotallers. Missing values of {hi:z} would be imputed by logistic regression on the other variables except for {hi:y}, and missing values of {hi:y} by regression on other variables in the drinking subset {hi:z} = 1 only. It is guaranteed that imputed values of {hi:y} for {hi:z} = 0 will equal {it:#} and for {hi:z} = 1 will be greater than {it:#}. {pin} See {hi:Remarks} for further information on conditional imputation. {phang} {opt cycles(#)} determines the number of cycles of regression switching to be carried out. The default is {cmd:cycles(10)}. {phang} {opt dropmissing} is a feature designed to save memory when using the file of imputed data created by {cmd:ice}. It omits from {it:filename} all observations which are not in the estimation sample, that is for which either (i) they are filtered out by {cmd:if} or {cmd:in}, or a nonpositive weight, or (ii) the values of all variables in {it:mainvarlist} are missing. This option provides a "clean" analysis file of imputations, with no missing values. The observations not in the estimation sample are also omitted from the original data and stored as imputation #0 in {it:filename}. {phang} {opt genmiss(string)} creates an indicator variable for the missingness of data in any variable in {it:mainvarlist} for which at least one value has been imputed. The indicator variable is set to missing for observations excluded by {cmd:if}, {cmd:in}, etc. The indicator variable for {it:xvar} is named {it:string}{it:xvar}. This option is left for backward compatibility, but now that the original data are stored in the output file, it is no longer really needed. The information on missingness is implicit in the original data stored as "imputation 0". {phang} {opt id(newvarname)} creates a variable called {it:newvarname} containing the original sort order of the data. The default is {cmd:id(_mi)}. {phang} {opt interval(intlist)} imputes interval-censored variables. An interval-censored value is known to lie in an interval [a, b], where a and b are finite and a <= b; in (-infinity, b]; or in [a, infinity). When either terminal is infinite, we have left or right censoring, respectively. {it:intlist} has the syntax {it:varname}{hi::}{it:llvar ulvar} [{hi:,} {it:varname}:{it:it:llvar ulvar} ...], where each {it:varname} is an interval-censored variable, each {it:llvar} contains the lower bound (a) for {it:varname}, and each {it:ulvar} contains the upper bound (b) for {it:varname} (or a missing value to represent plus or minus infinity). The supplied values of {it:varname} are irrelevant since they will be replaced anyway; it is only required that {it:varname} exist. Observations with {it:llvar} missing and {it:ulvar} present are left-censored for {it:varname}. Observations with {it:llvar} present and {it:ulvar} missing are right-censored for {it:varname}. Observations with {it:llvar} = {it:ulvar} are complete, and no imputation is done for them. Observations with both {it:llvar} and {it:ulvar} missing are imputed assuming an uncensored normal distribution. See {hi:Remarks} for further information. {phang} {opt noconstant} suppresses the regression constant in all regressions. {phang} {opt nopp} suppresses treatment of the perfect prediction bug; see {help ice##nopp:Avoiding the perfect prediction bug}. {phang} {opt noshoweq} suppresses the presentation of the prediction equations. {phang} {opt nowarning} suppresses the warning messages. {phang} {opt on(varlist)} changes the operation of {cmd:ice} in a major way. With this option, {cmd:uvis} imputes each member of {it:mainvarlist} univariately on {it:varlist}. This provides a convenient way of producing multiple imputations when imputation for each variable in {it:mainvarlist} is to be done univariately on a set of complete predictors. {phang} {opt orderasis} enters the variables in {it:mainvarlist} into the MICE algorithm in the order given. The default is to order them according to the number of missing values: the variable with least missingness gets imputed first, and so on. {phang} {opt seed(#)} sets the random-number seed to {it:#}. To reproduce a set of imputations, the same random-number seed should be used. The default is {cmd:seed(0)}, meaning no seed is set by the program. {phang} {opt trace(trace_filename)} monitors the convergence of the imputation algorithm. For each original variable with missing values, the mean of the imputed values is stored as a variable in {it:trace_filename}, together with the cycle number at which that mean was calculated. The results are stored only for the final imputation. For diagnostic purposes, it is sensible to run {cmd:trace()} with {cmd:m(1)} and many cycles, such as {cmd:cycles(100)}. When the run is complete, it is helpful to load {it:trace_filename} into memory and plot the mean for each imputed variable against the cycle number. If necessary, smoothing may be applied to clarify any apparent pattern. Convergence is judged to have occurred when the pattern of the imputed means is random. The number of cycles needed for convergence is usually obvious from the appearance of the plot. {title:Options for uvis} {phang} {opt gen(newvarname)} is required and creates a variable containing imputations. {it:newvar} contains original (nonmissing) and imputed (originally missing) values of {it:yvar}. {phang} {opt boot} invokes a bootstrap method for creating imputed values (see {hi:Remarks}). {phang} {opt match} creates imputations by prediction matching. The default is to draw imputations at random from the posterior distribution of the missing values of {it:yvar}, conditional on the observed values and the members of {it:xvars}. See {hi:Remarks} for further details. {phang} {opt noconstant} suppresses the regression constant in all regressions. {phang} {opt nopp} suppresses treatment of the perfect prediction bug; see {help ice##nopp:Avoiding the perfect prediction bug}. {phang} {opt replace} permits {it:newvar} (see {cmd:gen(}{it:newvar}{cmd:)}) to be overwritten with new data. {phang} {opt seed(#)} sets the random-number seed to {it:#}. See {hi:Remarks} for comments on how to ensure reproducible imputations by using the {cmd:seed()} option. The default is {cmd:seed(0)}, meaning no seed is set by the program. {title:Remarks} {pstd} {cmd:uvis} imputes {it:yvar} from {it:xvars} according to the following algorithm (see van Buuren et al. (1999, sec. 3.2) for further technical details): {phang2} 1. Estimate the vector of coefficients (beta) and the residual variance by regressing the nonmissing values of {it:yvar} on the current "completed" version of {it:xvars}. Predict the fitted values {it:etaobs} at the nonmissing observations of {it:yvar}. {phang2} 2. Draw at random a value (sigma_star) from the posterior distribution of the residual standard deviation. {phang2} 3. Draw at random a value (beta_star) from the posterior distribution of beta, allowing, through sigma_star, for uncertainty in beta. {phang2} 4. Use beta_star to predict the fitted values {it:etamis} at the missing observations of {it:yvar}. {phang2} 5. The imputed values are predicted directly from beta_star, sigma_star, and the covariates. When imputation is by linear regression ({cmd:regress} command), this step assumes that {it:yvar} is normally distributed, given the covariates. For other types of imputation, samples are drawn from the appropriate distribution. {pstd} With the {cmd:match} option, step 5 is replaced by the following. For each missing observation of {it:yvar} with prediction {it:etamis}, find the nonmissing observation of {it:yvar} whose prediction ({it:etaobs}) on observed data is closest to {it:etamis}. This closest nonmissing observation is used to impute the missing value of {it:yvar}. {pstd} The default draw method is not robust to departures from normality and may produce implausible imputations. For example, if the original distribution is skew and positive-valued, the imputed distribution will not necessarily have the appropriate amount of skewness nor will all the imputed values necessarily be positive. Log transformation of positive variables may greatly improve the appropriateness of the imputations. {pstd} The alternative {cmd:match} method is recommended only for continuous variables when the normality assumption is clearly untenable, even approximately. It is not necessary, nor is it recommended, for binary, ordered categorical, or nominal variables. {cmd:match} may work well when the distribution of a continuous variable is nonnormal, but it may sometimes result in biased imputations. {pstd} With the {cmd:boot} option, steps 2-4 are replaced by a bootstrap estimation of beta_star; beta_star is estimated by regressing {it:yvar} on {it:xvars} after taking a bootstrap sample of the nonmissing observations. This has the advantage of robustness because the distribution of beta is no longer assumed to be multivariate normal. {pstd} {cmd:uvis} will not impute observations for which a value of a variable in {it:xvars} is missing. However, all original (missing or nonmissing) observations of {it:yvar} will be copied into {it:newvarname} in such cases. This is a change from the first release of {cmd:uvis} (with {cmd:mvis}). Previously, {it:newvarname} would be set to missing whenever a value of a variable in {it:xvars} was missing, irrespective of the value of {it:yvar}. {pstd} Missing data for ordered (or unordered) categorical covariates should be imputed by using the {cmd:ologit} (or {cmd:mlogit}) command. In these cases, prediction matching is done on the scale of the mean absolute difference in the predicted class probabilities, preceded by logit transformation. {pstd} {cmd:ice} carries out multivariate imputation in {it:mainvarlist} using regression switching (van Buuren et al. 1999) as follows: {phang2} 1. Ignore any observations for which {it:mainvarlist} has only missing values, or if the {cmd:cc(}{it:varlist}{cmd:)} option has been specified, for which any member of {it:varlist} has a missing value. {phang2} 2. For each variable in {it:mainvarlist} with any missing data, randomly order that variable and replicate the observed values across the missing cases. This step initializes the iterative procedure by ensuing that no relevant values are missing. {phang2} 3. For each variable in {it:mainvarlist} in turn, impute missing values by applying {cmd:uvis} with the remaining variables as covariates. {phang2} 4. Repeat step 3 {cmd:cycles()} times, replacing the imputed values with updated values at the end of each cycle. {pstd} One imputation sample is created for each variable with any relevant missing values. {pstd} Van Buuren recommends {cmd:cycles(20)} but goes on to say that 10, or even 5, iterations are probably sufficient. We have chosen a default of 10. {pstd} Multiple imputation (MI) implies the creation and analysis of several imputed datasets. To do this, one would run {cmd:ice} with {cmd:m()} set to a suitable number, for example, 5. To obtain final estimates of the parameters of interest and their standard errors, one would fit a model in each imputation and carry out the appropriate post-MI averaging procedure on the results from the {cmd:m()} separate imputations. A suitable estimation tool for this purpose is {helpb mim} (if installed). {pstd} {hi:{ul:Handling the outcome variable}} {pstd} To avoid bias, the outcome variable must always be included in the list of variables to be used for imputation. In survival analysis, in particular, it is essential to include the censoring indicator as well as the survival time. van Buuren et al (1999) recommend a log transformation of the survival time, although the "correct" functional form is an open research question. Van Buuren et al (1999) give a detailed discussion of the different types of covariate that can be included in the imputation model and discuss the important issue of how to deal with variables which are missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). {pstd} {hi:{ul:Handling categorical variables}} {pstd} Binary variables present no difficulty. By default, in the MICE procedure when such a variable is the response, it is predicted from other variables by using logistic regression; when it is a covariate, it is modeled in the only way possible, effectively as a dummy variable. Categorical variables with 3 or more levels may in principle be treated in different ways. By default, {cmd:ice} variables with 3-5 levels are modeled as multinomial logistic regression ({cmd:mlogit} command) when they are the response and as a single linear term when they are a covariate. The same behavior occurs with the ordered logistic model ({cmd:ologit} command), requested with the {cmd:cmd()} option. The use of dummy variables instead of a linear term may be imposed as described under the {cmd:passive()} option. The requisite dummy variables must be created before {cmd:ice} is invoked. Variables with 6 or more levels are treated as ordered and continuous, but again different choices may be imposed by use of the {cmd:cmd()}, {cmd:passive()}, and {cmd:substitute()} options. {pstd} You should be aware that unless the dataset is large, use of the {cmd:mlogit} command may produce unstable estimates if the number of levels is too large and may compromise the accuracy of the imputations. It is hard to predict when this will occur. {pstd} Due to a peculiarity of the way the {cmd:mlogit} command works, variables with score labels cause problems to {cmd:ice} and {cmd:uvis} when missing data are imputed using {cmd:mlogit}. Score labels for such variables are removed in the file of imputed data. {pstd} {hi:{ul:Conditional imputation}} {pstd} The first type of conditional imputation ({cmd:conditional()} option) is implemented by a type of rejection sampling. Unfortunately, rejection sampling can be slow and inefficient, since many candidate imputed values may be rejected. The process may therefore take a noticeable amount of time. {cmd:ice} proceeds as follows when imputing a variable y conditional on a binary variable z according to a cutoff value {it:#}: {p 9 12 2} 1. Initialize y and z by random sampling from the observed distribution of each variable. {p 9 12 2} 2. Impute values of z missing in the estimation sample according to z's prediction equation and other options. {p 9 12 2} 3. Impute values of y missing in the estimation sample according to y's prediction equation and other options. Accept only imputed values v for y that satisfy v <= {it:#} when z = 0 and v > {it:#} when z = 1. {p 9 12 2} 4. Repeat step 3 until no missing values of y remain in the estimation sample. This may take many calls to {cmd:uvis}. {p 9 12 2} 5. Repeat steps 2-4 {cmd:cycles()} times. {pstd} Of course, this subprocedure is part of the process of imputing other variables with missing values. y may be binary, ordinal, or continuous. At present, only dichotomizations of y are supported, although different cutoff values may be used for different observations. {pstd} The second type of conditional imputation for imputing according to a subpopulation defined by {hi:y} <= {it:#} or {hi:y} > {it:#} does not involve rejection sampling and is no slower than any other imputation procedure performed by {cmd:ice}. {pstd} {hi:{ul:Interval censoring}} {pstd} Values of a variable y that are interval censored are imputed under the assumption that y is normally distributed with unknown mean and variance. The method, which is fast and efficient, is essentially as described for right-censored variables in section 3.3 of Royston (2001). A minor extension to allow left or interval censoring is employed. For example, if A < y < B and A and B are both finite, the imputed value for y will follow a truncated normal distribution with bounds A and B, variance parameter estimated from the data and mean given by the linear predictor for the imputation model for y. Stata's {cmd:intreg} command is used to estimate the mean and variance of y. When A and B are both missing (infinite), imputation of y simply assumes the normal distribution just mentioned, but without bounds. {pstd} If you wish to impose range limits on the imputed values, the lower and upper bound variables may be set accordingly. For example, to impute right-censored (e.g., survival) data, you would set {it:llvar} equal to all the observed times to event, whether censored or not, and {it:ulvar} to all the uncensored event times and missing for the censored times. This would cause the right-censored values to be imputed without restriction. If you wanted to bound the imputed values above, say, by 10, you would specify {it:ulvar} to be 10 (rather than missing) for all the censored observations. {pstd} {marker nopp}{...} {hi:{ul:Avoiding the perfect prediction bug}} {pstd} Perfect prediction may arise in {cmd:logistic}, {cmd:ologit}, or {cmd:mlogit} regression models when a (usually categorical) predictor variable perfectly predicts success or failure in the outcome variable. In {cmd:ice}, perfect prediction may occur without the user's knowledge because many regression models are run silently. Perfect prediction may lead to entirely inappropriate imputations. To avoid this, {cmd:uvis} checks for perfect prediction; if it is detected, {cmd:uvis} temporarily augments the data with a few extra observations with low weight, in such a way as to remove the perfect prediction. A message is displayed noting the variable that has the perfect prediction issue and that the problem has been resolved. Such treatment of the perfect prediction bug may be switched off, if desired, by using the {opt nopp} option. {pstd} {hi:{ul:Errors and diagnostics}} {pstd} {cmd:ice} may occasionally detect an anomaly when running {cmd:uvis} with a particular variable as response and a particular regression command. {cmd:ice} will then stop and report the {cmd:uvis} command it was running and the error number returned. Often the problem lies in a regression of a binary or categorical variable where the estimation procedure fails to converge; this is usually caused by sparse cell occupancy of the response variable. If you obtain this error you should either omit the offending variable from the imputation, or seek to combine a sparse category with another category. {pstd} Another possibility is that, again due to a defect in a particular regression command in the chained equations structure, the number of values imputed for a particular variable is less than expected. This is a serious error and again may arise from estimation problems involving a binary or categorical variable. In this situation, {cmd:ice} saves in the working directory a snapshot of the data it was using in the attempted estimation to a file called {hi:_ice_dump.dta}, whereas also reporting the {cmd:uvis} command it was executing. You can then investigate what may have gone wrong with the command by loading the data in {hi:_ice_dump.dta} and rerunning the offending regression command. {pstd} {hi:{ul:Further notes}} {pstd} {opt ice} saves all the variables in the current data to the output {it:filename}, whether they are involved in the imputation procedure. This can make the resulting file large. It may therefore be sensible to drop variables not subsequently needed for modeling before running {opt ice}. {pstd} {cmd:ice} determines the order of imputing variables in the round of chained equations according to the amount of missing data. Variables with the least missingness are imputed first. If {opt ice} is run twice using identical variables and the same random-number seed, but with the variables in {it:mainvarlist} in a different order, slightly different imputations will be generated. The differences will be purely random and will not produce bias in subsequent parameter estimates. If the {opt boot()} option is applied to all variables, the order of variables no longer affects the results. {pstd} An important application of MI is to investigate possible models, for example, prognostic models, in which selection of influential variables is required (Clark and Altman 2003). For example, the stability of the final model across the imputation samples is of interest. This area of inquiry is in its infancy. {pstd} See also Van Buuren's web site http://www.multiple-imputation.com for further information and software sources. {title:Examples} {phang} {cmd:. uvis regress y x1 x2 x3, gen(ym)} {phang} {cmd:. uvis intreg ll ul x1 x2 x3, gen(y)} {phang} {cmd:. ice x1 x2 x3 using imputed, m(5)} {phang} {cmd:. ice x1 x2 x3 using imputed, m(5) cycles(20) cc(x4 x5)} {phang} {cmd:. ice x1-x5 using imputed, m(10) boot match(x1 x2 x3) cmd(x1 x2:mlogit, x3:ologit) id(pid) seed(101) genmiss(m_)} {phang} {cmd:. ice x1 x1a x1b x2 x3 x23 using imputed, m(5) cmd(x1:ologit) passive(x1a:x1==2 \x1b:x1==3 \x23=x2*x3) substitute(x1:x1a x1b)} {phang} {cmd:. ice y1 y2 y3 x1 x2 x3 x4 using imputed, m(5) eq(y1:x1 x2 y2, y2:y1 x3 x4, y3:y1 y2) match(y3)} {phang} {cmd:. ice x1 x2 x3 using imputed, m(5) cmd(x1:ologit) match(x2) dropmissing} {phang} {cmd:. ice x1 ll2 ul2 x2 ll3 ul3 x3 using imputed, m(5) interval(x2:ll2 ul2, x3:ll3 ul3)} {title:Author} {pstd} Patrick Royston, MRC Clinical Trials Unit, London.{break} pr@ctu.mrc.ac.uk {title:References} {phang} van Buuren, S., H. C. Boshuizen, and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. {it:Statistics in Medicine} 18: 681-694. {phang} Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets. {it:Stata Journal} 3: 226-244. {phang} Clark, T. G., and D. G. Altman. 2003. Developing a prognostic model in the presence of missing data: an ovarian cancer case-study. {it:Journal of Clinical Epidemiology} 56: 28-37. {phang} Royston, P. 2001. The lognormal distribution as a model for survival time in cancer, with an emphasis on prognostic factors. {it:Statistica Neelandica} 55: 89-104. {phang} Royston, P. 2004. Multiple imputation of missing values. {it:Stata Journal} 4: 227-241. {phang} Royston, P. 2005a. Multiple imputation of missing values: update. {it:Stata Journal} 5: 188-201. {phang} Royston, P. 2005b. Multiple imputation of missing values: update of ice. {it:Stata Journal} 5: 527-536. {title:Acknowledgments} {pstd} Ian White has made substantial contributions to the understanding and practical use of multiple imputation and to the programming of {cmd:ice} and {cmd:uvis}. Ian wrote the base of the {opt draw()} option; the idea and code for coping with perfect prediction are essentially all his. I am extremely grateful to him for his ongoing commitment to this project. {pstd} I am grateful also to Gillian Raab for pointing out certain issues with the prediction matching approach, particularly that it is only useful with continuous variables. As a result, the default imputation method has been changed from matching to drawing from the predictive distribution. Gillian also suggested imputing the variables in reverse order of the amount of missingness, and selecting the imputed value at random from the set determined by the available matching predictions. Both suggestions have been implemented. {title:Also see} {psee} Online: {helpb mim} (if installed) {p_end}