Home > GIStemp, LSCF, Statistics > Lines, Sines, and Curve Fittings 12 – heteroskedasticity 1

## Lines, Sines, and Curve Fittings 12 – heteroskedasticity 1

2011 January 21

In statistics, a sequence of random variables is heteroscedastic, or heteroskedastic, if the random variables have different variances. The term means “differing variance” and comes from the Greek “hetero” (‘different’) and “skedasis” (‘dispersion’). In contrast, a sequence of random variables is called homoscedastic if it has constant variance.

http://en.wikipedia.org/wiki/Heteroscedasticity

Having built some nice line+sine and exp+sine models, I’ve been wondering how confidence intervals are best applied to the projected trends. And that led me (through some comments at The Blackboard) to start examining heteroscedasticity. In short, if the variance is time dependent, than using standard deviations based on the assumption of time-independent variance would be wrong. If the variance grows over time, than the projected time invariant sd would be too narrow.

The R package “lmtest” provides a Breusch-Pagan Test for heteroscedasticity. An example from the documentation:

```## generate a regressor
x <- rep(c(-1,1), 50)
## generate heteroskedastic and homoskedastic disturbances
err1 <- rnorm(100, sd=rep(c(1,2), 50))
err2 <- rnorm(100)
## generate a linear relationship
y1 <- 1 + x + err1
y2 <- 1 + x + err2
## perform Breusch-Pagan test
bptest(y1 ~ x)

studentized Breusch-Pagan test

data:  y1 ~ x
BP = 17.0314, df = 1, p-value = 3.677e-05

bptest(y2 ~ x)

studentized Breusch-Pagan test

data:  y2 ~ x
BP = 1.9837, df = 1, p-value = 0.159
```

Very low p-values indicate a heteroskedastic variance.

Thinking about how best to display the time variance, I thought that placing a trend in the square of the residuals might work. Without much ado, here are the trend, residuals, and residual square plots for a simple linear trend and a line+sine trend.

The line has BP values of BP = 13.2193, df = 4, p-value = 0.01025

The line+sine has BP values of BP = 5.9591, df = 7, p-value = 0.5445

The residuals of the simple linear trend trip the magic p-value <= 0.05 trigger. It is heteroskedastic. Building projections using time-invariant standard deviations is questionable. The 95% CI should be expanding over time.

The residuals of the line+sine are homoskedastic – at least at the sensitivity of the BP test. But … we can see in the residual-squared plot that there is some reason to believe that it is also time variant, although at a slower pace than the linear model.

1. 2011 January 21 at 2:37 pm

How about confidence and/or prediction intervals for the different models? Or at least confidence intervals on the coefficients of the models. My guess is that prediction intervals for all models will be floor to ceiling by 2100 even if you don’t correct for heteroskedasticity.

2. 2011 January 21 at 4:09 pm

Any hints on how best to build CIs for exponential functions?

3. 2011 January 21 at 7:43 pm

Monte Carlo, if all else fails. Take your best fit and the residual error and generate a lot of data sets, fit them and look at the range of fit coefficients. Then you can create a 95% confidence interval from the envelope of the extrapolated lines throwing out 2.5% on the top and bottom. I’d try it out with a linear model first to see if you get the right answer for that, since it’s known.