BOOTSTRAP REGRESSION

Bootstrap regression is the process of re-running hundreds to thousands of the same regression model via resampled data to generate the best consensus forecasts. The idea is that in a random selection of data, taking the average forecast of an ensemble of models provides a more accurate prediction than a single sample. This is the same concept as the wisdom of the crowd. Risk Simulator provides variations of this bootstrap regression by resampling the residual errors as well as generating probabilistic Monte Carlo simulation assumptions as a result.

Bootstrapping works well in situations where the dataset consists of independent and identically distributed (i.i.d.) data points. This means that the sequential order of the data points is not important in fitting the underlying process. For example, if we sufficiently resample rows of data (one row may consist of multiple columns of independent variables) with replacement, the fitted parameters will be distributed around the true population parameters. There might be situations where bootstrap regression is problematic, especially when the data points are not i.i.d. such as when the data points are clumpy or sensitive to extreme values. In addition, as we assume that order is not important, bootstrap regression is typically used in cross-sectional data only.

In a traditional Empirical Bootstrap, given the original data of i rows of observations of k independent variables, (Y_i,X_ij,…X_ik), we resample and generate a new set of i.i.d. observations (Y_i^,X_ij^,…X_ik^) with replacement, such that for each l, P(Y_l^=Y_i,X_l^*=X_i)=1/n ∀ i=1,…,n. Each time, we generate n new observations (typically less than the original total number of observations) from the original dataset and the regression is run to obtain the coefficients. The process is then repeated B times in the bootstrap. The fitted coefficients β ̂_i will be approximately normally distributed around the true values of β_i. This method is fast and effective but can run afoul if the original sample size is small and certain data elements are highly susceptible to extreme values. For example, if certain influential and outlier data points are not resampled, the estimated regression equation may increase the spread or the skew of the final coefficients’ distributions. This methodology is available in Risk Simulator | Forecasting | Multiple Regression | Bootstrap Regression | Random X Case Resampling.

An alternative approach is Parametric Residual Bootstrap (sometimes also known as the wild residual bootstrap), which helps reduce the impact of influential outlier data points. Using the residuals from the regression e_i=Y_i-β ̂_0-β ̂_i X_i, we generate i.i.d. e ̂_i^,…,e ̂_n^. Then, we fix the covariates X_i^=X_i for each i and resample only the value of Y_i using the residual e_i. Then, the bootstrap sampling is obtained by Y_i^=β ̂_0+β ̂_i X_i+〖Φ_i e〗_i where Φ~N(0,1). The i.i.d. normal random variable is applied to reduce any heteroskedasticity issues in the event the variance of the error is unstable over time. This approach is preferred when there are influential outliers or heteroskedasticity issues in the data, whereas the empirical bootstrap approach is preferred in almost all other cases. This methodology is available in Risk Simulator | Forecasting | Multiple Regression | Bootstrap Regression | Fixed X Parametric Residual Resampling.

In the machine learning environment, the AI Machine Learning Bagging Linear Fit Bootstrap supervised model in BizStats applies the empirical bootstrap regression approach. This and other artificial intelligence machine learning methods are discussed in the next section. Finally, for Monte Carlo simulation applications and the parametric residual approach, use Risk Simulator’s multiple regression bootstrap method instead.