## Lines, Sines, and Curve Fittings 12 – heteroskedasticity 1

In statistics, a sequence of random variables is heteroscedastic, or heteroskedastic, if the random variables have different variances. The term means “differing variance” and comes from the Greek “hetero” (‘different’) and “skedasis” (‘dispersion’). In contrast, a sequence of random variables is called homoscedastic if it has constant variance.

http://en.wikipedia.org/wiki/Heteroscedasticity

Having built some nice line+sine and exp+sine models, I’ve been wondering how confidence intervals are best applied to the projected trends. And that led me (through some comments at The Blackboard) to start examining heteroscedasticity. In short, if the variance is time dependent, than using standard deviations based on the assumption of time-independent variance would be wrong. If the variance grows over time, than the projected time invariant sd would be too narrow.

The R package “lmtest” provides a Breusch-Pagan Test for heteroscedasticity. An example from the documentation:

## generate a regressor x <- rep(c(-1,1), 50) ## generate heteroskedastic and homoskedastic disturbances err1 <- rnorm(100, sd=rep(c(1,2), 50)) err2 <- rnorm(100) ## generate a linear relationship y1 <- 1 + x + err1 y2 <- 1 + x + err2 ## perform Breusch-Pagan test bptest(y1 ~ x) studentized Breusch-Pagan test data: y1 ~ x BP = 17.0314, df = 1, p-value = 3.677e-05 bptest(y2 ~ x) studentized Breusch-Pagan test data: y2 ~ x BP = 1.9837, df = 1, p-value = 0.159

Very low p-values indicate a heteroskedastic variance.

Thinking about how best to display the time variance, I thought that placing a trend in the square of the residuals might work. Without much ado, here are the trend, residuals, and residual square plots for a simple linear trend and a line+sine trend.

The line has BP values of BP = 13.2193, df = 4, p-value = 0.01025

The line+sine has BP values of BP = 5.9591, df = 7, p-value = 0.5445

The residuals of the simple linear trend trip the magic p-value <= 0.05 trigger. It is heteroskedastic. Building projections using time-invariant standard deviations is questionable. The 95% CI should be expanding over time.

The residuals of the line+sine are homoskedastic – at least at the sensitivity of the BP test. But … we can see in the residual-squared plot that there is some reason to believe that it is also time variant, although at a slower pace than the linear model.

How about confidence and/or prediction intervals for the different models? Or at least confidence intervals on the coefficients of the models. My guess is that prediction intervals for all models will be floor to ceiling by 2100 even if you don’t correct for heteroskedasticity.

Any hints on how best to build CIs for exponential functions?

Monte Carlo, if all else fails. Take your best fit and the residual error and generate a lot of data sets, fit them and look at the range of fit coefficients. Then you can create a 95% confidence interval from the envelope of the extrapolated lines throwing out 2.5% on the top and bottom. I’d try it out with a linear model first to see if you get the right answer for that, since it’s known.