{smcl}
{hline}
help for {hi:boost}{right:(SJ5-3: st0087)}
{hline}

{title:Boosting (boosted regressions)}

{p 8 14 2}
{cmd:boost} 
{it:varlist}
{ifin}{cmd:,} 
{cmdab:dist:ribution(}{it:string}{cmd:)}
{cmd:maxiter(}{it:#}{cmd:)}
[{cmdab:in:fluence}
{cmdab:pred:ict(}{it:varname}{cmd:)}
{cmd:shrink(}{it:#}{cmd:)}
{cmd:bag(}{it:#}{cmd:)}
{cmdab:train:fraction(}{it:#}{cmd:)}
{cmdab:inter:action(}{it:#}{cmd:)}
{cmd:seed(}{it:#}{cmd:)}]


{title:Description}

{p 4 4 2}
{cmd:boost} implements the MART boosting algorithm described in Hastie et al.
(2001).  {cmd:boost} accommodates Gaussian (normal), logistic, and Poisson
regression.  The algorithm is implemented as a C++ plugin and requires Stata
8.1 or higher to run. It only runs under Windows.

{p 4 4 2}
By default, the model is fit using the first 80% of the data (training data).
This percentage can be changed through the option {cmd:trainfraction()}. To
ensure that the training data are random 80% sort the data in random order
before running boost. 

{p 4 4 2}
{cmd:boost} determines the number of iterations that maximizes the
likelihood or, equivalently, the pseudo-R-squared. The pseudo-R2 is defined
as R2=1-L1/L0, where L1 and L0 are the log likelihood of the full model and
intercept-only model, respectively.  Unlike the R2 given in {cmd:regress}, the
pseudo-R2 is an out-of-sample statistic.  Out-of-sample R2s tend to be lower
than in-sample-R2s. 


{title:Output and Return values}

{p 4 4 2}
The standard output consists of the best number of iterations, {it:bestiter};
the R-squared value computed on the test dataset, {cmd:test_R2}; and the
number of observations used for the training data, {cmd:trainn}. {cmd:trainn}
is computed as the number of observations that meet the {cmd:in}/{cmd:if}
conditions times {cmd:trainfraction()}. These statistics can also be retrieved
Jsing {cmd:ereturn}. In addition, {cmd:ereturn} also stores the training
R-squared value, {cmd:train_R2}, as well as the log-likelihood values from
which {cmd:train_R2} and {cmd:test_R2} are computed.


{title:Details}

{p 4 4 2}
If for logistic regression the {cmd:train_R2} is missing but the 
{cmd:test_R2} is not missing, the {cmd:test_R2} can be trusted. The missing
{cmd:train_R2} is due to numerical problems in evaluating the log-likelihood
functions for very unlikely parameter values. Reset the number of iterations
to {it:bestiter}, often this will solve the problem.


{title:Options}

{p 4 8 2} {cmd:distribution(}{it:string}{cmd:)} specifies the distribution of
the inefficiency term.  Possible distributions are {cmd:normal},
{cmd:logistic}, and {cmd:poisson}. 

{p 4 8 2}
{cmd:maxiter(}{it:int}{cmd:)} specifies the maximal number of trees to be
fitted. The actual number used, {it:bestiter}, can be obtained from the output
as {cmd:e(bestiter)}.  When {it:bestiter} is too close to {cmd:maxiter()}, the
maximum likelihood iteration may be larger than {cmd:maxiter()}. In that case,
it is useful to rerun the model with a larger value for {cmd:maxiter()}.  When
{cmd:trainfraction(1.0)}, all {cmd:maxiter()} observations are used for
prediction ({it:bestiter} is missing because it is computed on a test
dataset).

{p 4 8 2} {cmd:influence} displays the percentage of variation explained 
(for nonnormal distributions, the percentage of log likelihood explained) 
by each input variable. The influence matrix is saved in {cmd:e(influence)}.

{p 4 8 2} {cmd:predict(}{it:varname}{cmd:)} predicts and saves the predictions
in the variable {it:varname}.  To allow for out-of-sample predictions
{cmd:predict()} ignores {cmd:if} and {cmd:in}. For model fitting only,
observations that satisfy {cmd:if} and {cmd:in} are used, predictions are made
for all observations.

{p 4 8 2}
{cmd:shrink(}{it:#}{cmd:)} specifies the shrinkage factor.  {cmd:shrink(1)}
corresponds to no shrinkage.  As a general rule of thumb, reducing the value
for {cmd:shrink()} requires increasing the value of {cmd:maxiter()} to achieve
a comparable cross-validation R2.  The default is {cmd:shrink(0.01). 

{p 4 8 2}
{cmd:bag(}{it:#}{cmd:)} specifies the fraction of training observations
that is used to fit an individual tree. {cmd:bag(0.5)} means that half the
observations are used for building each tree. To use all observations, specify
{cmd:bag(1.0)}.  The default is {cmd:bag(0.5)}. 

{p 4 8 2}
{cmd:trainfraction(}{it:#}{cmd:)} specifies the percentage of data to be
used as training data.  The remainder, the test data, is used to evaluate the
best number of iterations.  The default is {cmd:trainfraction(0.8)}.  

{p 4 8 2}
{cmd:interaction(}{it:#}{cmd:)} specifies the maximum number of interactions
allowed.  For example, {cmd:interaction(1)} means that only main effects are
fitted; {cmd:interaction(2)} means that main effect and two-way interactions
are fitted; and so forth.  The number of interactions equals the number of
terminal nodes in a tree plus 1.  {cmd:interaction(1)} means that each tree
has 2 terminal nodes; {cmd:interaction(2)} means that each tree has 3 terminal
nodes; and so forth.  The default is {cmd:interaction(5)}.

{p 4 8 2}
{cmd:seed(}{it:#}{cmd:)} specifies the random-number seed to generate the same
sequence of random numbers. Random numbers are only used for bagging.  Bagging
uses random numbers to select a random subset of the observations for each
iteration.  The default is {cmd:seed(0)}.  The {cmd:boost seed()} option is
unrelated to Stata's {cmd:set seed} command.

 
{title:More details}

{p 4 4 2} The variables may not contain missing values (impute missing values
first).  When {cmd:predict()} is specified, even the values excluded by
{ifin} may not contain missing values. 

{p 4 4 2} The boosting model itself cannot be saved. For this reason,
predictions are specified with an option rather than as a postestimation
command. This is different, for example, from {cmd:regress} where
{cmd:predict} can be invoked afterwards. 

{p 4 4 2} The number of iterations that {cmd:boost} uses for
prediction/influence, bestiter, cannot be set directly. It is affected
indirectly by the choice of {cmd:maxiter()} because {it:bestiter} cannot
exceed {cmd:maxiter()}.


{title:Examples}

{p 4 4 2}
Example 1: Put data into random order.  Run up to 1000 iterations.  Assess
contributions of x variables and predict values:

{p 4 8 2}{cmd:. gen u=unif()}

{p 4 8 2}{cmd:. sort u}

{p 4 8 2}{cmd:. boost y x1-x7, distribution(logistic) maxiter(1000)  trainfraction(0.8) predict(pred) influence}

{p 4 4 2} Example (cont): Determine the percentage of correctly classified
observations for both the test and the training datasets:

{p 4 8 2}{cmd:. global trainn=e(trainn)}

{p 4 8 2}{cmd:. gen class=pred>.5 }

{p 4 8 2}{cmd:. gen correct_test= class==y  }

{p 4 8 2}{cmd:. replace correct_test=.   if missing(y)}

{p 4 8 2}{cmd:. gen correct_train=  correct_test}

{p 4 8 2}{cmd:. replace correct_test=.  if _n<=$trainn }

{p 4 8 2}{cmd:. replace correct_train=. if _n>$trainn}

{p 4 8 2}{cmd:. tab1 correct_test correct_train y}

{p 4 4 2} Example (cont): Display the variable influences in a barchart:

{p 4 8 2}{cmd:. matrix influence = e(influence)}

{p 4 8 2}{cmd:. svmat influence}

{p 4 8 2}{cmd:. gen id=_n}

{p 4 8 2}{cmd:. replace id=. if influence==.}

{p 4 8 2}{cmd:. graph bar (mean) influence, over(id) ytitle(Percentage Influence)}

{p 4 4 2} Example 2: five-fold cross-validation: in turn, use a different 20%
of the data as the test dataset and compute an R-squared value each time:

{p 4 8 2}{cmd:. gen u=unif()}

{p 4 8 2}{cmd:. sort u}

{p 4 8 2}{cmd:. local N=_N}

{p 4 8 2}{cmd:. local size=round(`N'/5)}

{p 4 8 2}{cmd:. gen group=0}

{p 4 8 2}{cmd:. replace group=1 if _n>`size'}

{p 4 8 2}{cmd:. replace group=2 if _n>`size'*2}

{p 4 8 2}{cmd:. replace group=3 if _n>`size'*3}

{p 4 8 2}{cmd:. replace group=4 if _n>`size'*4}

{p 4 8 2}{cmd:. matrix  input R2 = (   )}

{p 4 8 2}{cmd:. forval i=1/5 open bracket }

{p 4 8 2}{cmd:. sort group}

{p 4 8 2}{cmd:.	boost y x x2,  dist(normal) maxiter(100) trainfraction(0.8)}

{p 4 8 2}{cmd:.	replace group= mod(group+1,5)	}

{p 4 8 2}{cmd:. matrix R2= R2   \ (e(test_R2))}

{p 4 8 2}{cmd:. close bracket }

{p 4 8 2}{cmd:. svmat R2 }

{p 4 8 2}{cmd:. sum R2 }


{title:References}

{p 4 8 2}
Hastie T., R. Tibshirani, and J. Friedman. 2001.
{it:The Elements of Statistical Learning}. 
New York: Springer.

Ridgeway, G. 1999.  The state of boosting.
{it:Computing Science and Statistics} 31: 172-181. 
Also available at {browse "http://www.i-pensieri.com/gregr/papers.shtml":http://www.i-pensieri.com/gregr/papers.shtml}.


{title:Author}

	Matthias Schonlau, RAND
	matt@rand.org
	{browse "http://www.schonlau.net":www.schonlau.net}