{smcl} {hline} help for {hi:boost}{right:(SJ5-3: st0087)} {hline} {title:Boosting (boosted regressions)} {p 8 14 2} {cmd:boost} {it:varlist} {ifin}{cmd:,} {cmdab:dist:ribution(}{it:string}{cmd:)} {cmd:maxiter(}{it:#}{cmd:)} [{cmdab:in:fluence} {cmdab:pred:ict(}{it:varname}{cmd:)} {cmd:shrink(}{it:#}{cmd:)} {cmd:bag(}{it:#}{cmd:)} {cmdab:train:fraction(}{it:#}{cmd:)} {cmdab:inter:action(}{it:#}{cmd:)} {cmd:seed(}{it:#}{cmd:)}] {title:Description} {p 4 4 2} {cmd:boost} implements the MART boosting algorithm described in Hastie et al. (2001). {cmd:boost} accommodates Gaussian (normal), logistic, and Poisson regression. The algorithm is implemented as a C++ plugin and requires Stata 8.1 or higher to run. It only runs under Windows. {p 4 4 2} By default, the model is fit using the first 80% of the data (training data). This percentage can be changed through the option {cmd:trainfraction()}. To ensure that the training data are random 80% sort the data in random order before running boost. {p 4 4 2} {cmd:boost} determines the number of iterations that maximizes the likelihood or, equivalently, the pseudo-R-squared. The pseudo-R2 is defined as R2=1-L1/L0, where L1 and L0 are the log likelihood of the full model and intercept-only model, respectively. Unlike the R2 given in {cmd:regress}, the pseudo-R2 is an out-of-sample statistic. Out-of-sample R2s tend to be lower than in-sample-R2s. {title:Output and Return values} {p 4 4 2} The standard output consists of the best number of iterations, {it:bestiter}; the R-squared value computed on the test dataset, {cmd:test_R2}; and the number of observations used for the training data, {cmd:trainn}. {cmd:trainn} is computed as the number of observations that meet the {cmd:in}/{cmd:if} conditions times {cmd:trainfraction()}. These statistics can also be retrieved Jsing {cmd:ereturn}. In addition, {cmd:ereturn} also stores the training R-squared value, {cmd:train_R2}, as well as the log-likelihood values from which {cmd:train_R2} and {cmd:test_R2} are computed. {title:Details} {p 4 4 2} If for logistic regression the {cmd:train_R2} is missing but the {cmd:test_R2} is not missing, the {cmd:test_R2} can be trusted. The missing {cmd:train_R2} is due to numerical problems in evaluating the log-likelihood functions for very unlikely parameter values. Reset the number of iterations to {it:bestiter}, often this will solve the problem. {title:Options} {p 4 8 2} {cmd:distribution(}{it:string}{cmd:)} specifies the distribution of the inefficiency term. Possible distributions are {cmd:normal}, {cmd:logistic}, and {cmd:poisson}. {p 4 8 2} {cmd:maxiter(}{it:int}{cmd:)} specifies the maximal number of trees to be fitted. The actual number used, {it:bestiter}, can be obtained from the output as {cmd:e(bestiter)}. When {it:bestiter} is too close to {cmd:maxiter()}, the maximum likelihood iteration may be larger than {cmd:maxiter()}. In that case, it is useful to rerun the model with a larger value for {cmd:maxiter()}. When {cmd:trainfraction(1.0)}, all {cmd:maxiter()} observations are used for prediction ({it:bestiter} is missing because it is computed on a test dataset). {p 4 8 2} {cmd:influence} displays the percentage of variation explained (for nonnormal distributions, the percentage of log likelihood explained) by each input variable. The influence matrix is saved in {cmd:e(influence)}. {p 4 8 2} {cmd:predict(}{it:varname}{cmd:)} predicts and saves the predictions in the variable {it:varname}. To allow for out-of-sample predictions {cmd:predict()} ignores {cmd:if} and {cmd:in}. For model fitting only, observations that satisfy {cmd:if} and {cmd:in} are used, predictions are made for all observations. {p 4 8 2} {cmd:shrink(}{it:#}{cmd:)} specifies the shrinkage factor. {cmd:shrink(1)} corresponds to no shrinkage. As a general rule of thumb, reducing the value for {cmd:shrink()} requires increasing the value of {cmd:maxiter()} to achieve a comparable cross-validation R2. The default is {cmd:shrink(0.01). {p 4 8 2} {cmd:bag(}{it:#}{cmd:)} specifies the fraction of training observations that is used to fit an individual tree. {cmd:bag(0.5)} means that half the observations are used for building each tree. To use all observations, specify {cmd:bag(1.0)}. The default is {cmd:bag(0.5)}. {p 4 8 2} {cmd:trainfraction(}{it:#}{cmd:)} specifies the percentage of data to be used as training data. The remainder, the test data, is used to evaluate the best number of iterations. The default is {cmd:trainfraction(0.8)}. {p 4 8 2} {cmd:interaction(}{it:#}{cmd:)} specifies the maximum number of interactions allowed. For example, {cmd:interaction(1)} means that only main effects are fitted; {cmd:interaction(2)} means that main effect and two-way interactions are fitted; and so forth. The number of interactions equals the number of terminal nodes in a tree plus 1. {cmd:interaction(1)} means that each tree has 2 terminal nodes; {cmd:interaction(2)} means that each tree has 3 terminal nodes; and so forth. The default is {cmd:interaction(5)}. {p 4 8 2} {cmd:seed(}{it:#}{cmd:)} specifies the random-number seed to generate the same sequence of random numbers. Random numbers are only used for bagging. Bagging uses random numbers to select a random subset of the observations for each iteration. The default is {cmd:seed(0)}. The {cmd:boost seed()} option is unrelated to Stata's {cmd:set seed} command. {title:More details} {p 4 4 2} The variables may not contain missing values (impute missing values first). When {cmd:predict()} is specified, even the values excluded by {ifin} may not contain missing values. {p 4 4 2} The boosting model itself cannot be saved. For this reason, predictions are specified with an option rather than as a postestimation command. This is different, for example, from {cmd:regress} where {cmd:predict} can be invoked afterwards. {p 4 4 2} The number of iterations that {cmd:boost} uses for prediction/influence, bestiter, cannot be set directly. It is affected indirectly by the choice of {cmd:maxiter()} because {it:bestiter} cannot exceed {cmd:maxiter()}. {title:Examples} {p 4 4 2} Example 1: Put data into random order. Run up to 1000 iterations. Assess contributions of x variables and predict values: {p 4 8 2}{cmd:. gen u=unif()} {p 4 8 2}{cmd:. sort u} {p 4 8 2}{cmd:. boost y x1-x7, distribution(logistic) maxiter(1000) trainfraction(0.8) predict(pred) influence} {p 4 4 2} Example (cont): Determine the percentage of correctly classified observations for both the test and the training datasets: {p 4 8 2}{cmd:. global trainn=e(trainn)} {p 4 8 2}{cmd:. gen class=pred>.5 } {p 4 8 2}{cmd:. gen correct_test= class==y } {p 4 8 2}{cmd:. replace correct_test=. if missing(y)} {p 4 8 2}{cmd:. gen correct_train= correct_test} {p 4 8 2}{cmd:. replace correct_test=. if _n<=$trainn } {p 4 8 2}{cmd:. replace correct_train=. if _n>$trainn} {p 4 8 2}{cmd:. tab1 correct_test correct_train y} {p 4 4 2} Example (cont): Display the variable influences in a barchart: {p 4 8 2}{cmd:. matrix influence = e(influence)} {p 4 8 2}{cmd:. svmat influence} {p 4 8 2}{cmd:. gen id=_n} {p 4 8 2}{cmd:. replace id=. if influence==.} {p 4 8 2}{cmd:. graph bar (mean) influence, over(id) ytitle(Percentage Influence)} {p 4 4 2} Example 2: five-fold cross-validation: in turn, use a different 20% of the data as the test dataset and compute an R-squared value each time: {p 4 8 2}{cmd:. gen u=unif()} {p 4 8 2}{cmd:. sort u} {p 4 8 2}{cmd:. local N=_N} {p 4 8 2}{cmd:. local size=round(`N'/5)} {p 4 8 2}{cmd:. gen group=0} {p 4 8 2}{cmd:. replace group=1 if _n>`size'} {p 4 8 2}{cmd:. replace group=2 if _n>`size'*2} {p 4 8 2}{cmd:. replace group=3 if _n>`size'*3} {p 4 8 2}{cmd:. replace group=4 if _n>`size'*4} {p 4 8 2}{cmd:. matrix input R2 = ( )} {p 4 8 2}{cmd:. forval i=1/5 open bracket } {p 4 8 2}{cmd:. sort group} {p 4 8 2}{cmd:. boost y x x2, dist(normal) maxiter(100) trainfraction(0.8)} {p 4 8 2}{cmd:. replace group= mod(group+1,5) } {p 4 8 2}{cmd:. matrix R2= R2 \ (e(test_R2))} {p 4 8 2}{cmd:. close bracket } {p 4 8 2}{cmd:. svmat R2 } {p 4 8 2}{cmd:. sum R2 } {title:References} {p 4 8 2} Hastie T., R. Tibshirani, and J. Friedman. 2001. {it:The Elements of Statistical Learning}. New York: Springer. Ridgeway, G. 1999. The state of boosting. {it:Computing Science and Statistics} 31: 172-181. Also available at {browse "http://www.i-pensieri.com/gregr/papers.shtml":http://www.i-pensieri.com/gregr/papers.shtml}. {title:Author} Matthias Schonlau, RAND matt@rand.org {browse "http://www.schonlau.net":www.schonlau.net}