Lines, Sines, and Curve Fittings 12 – heteroskedasticity 1
In statistics, a sequence of random variables is heteroscedastic, or heteroskedastic, if the random variables have different variances. The term means “differing variance” and comes from the Greek “hetero” (‘different’) and “skedasis” (‘dispersion’). In contrast, a sequence of random variables is called homoscedastic if it has constant variance.
Having built some nice line+sine and exp+sine models, I’ve been wondering how confidence intervals are best applied to the projected trends. And that led me (through some comments at The Blackboard) to start examining heteroscedasticity. In short, if the variance is time dependent, than using standard deviations based on the assumption of time-independent variance would be wrong. If the variance grows over time, than the projected time invariant sd would be too narrow.
The R package “lmtest” provides a Breusch-Pagan Test for heteroscedasticity. An example from the documentation:
## generate a regressor x <- rep(c(-1,1), 50) ## generate heteroskedastic and homoskedastic disturbances err1 <- rnorm(100, sd=rep(c(1,2), 50)) err2 <- rnorm(100) ## generate a linear relationship y1 <- 1 + x + err1 y2 <- 1 + x + err2 ## perform Breusch-Pagan test bptest(y1 ~ x) studentized Breusch-Pagan test data: y1 ~ x BP = 17.0314, df = 1, p-value = 3.677e-05 bptest(y2 ~ x) studentized Breusch-Pagan test data: y2 ~ x BP = 1.9837, df = 1, p-value = 0.159
Very low p-values indicate a heteroskedastic variance.
Thinking about how best to display the time variance, I thought that placing a trend in the square of the residuals might work. Without much ado, here are the trend, residuals, and residual square plots for a simple linear trend and a line+sine trend.
The line has BP values of BP = 13.2193, df = 4, p-value = 0.01025
The line+sine has BP values of BP = 5.9591, df = 7, p-value = 0.5445
The residuals of the simple linear trend trip the magic p-value <= 0.05 trigger. It is heteroskedastic. Building projections using time-invariant standard deviations is questionable. The 95% CI should be expanding over time.
The residuals of the line+sine are homoskedastic – at least at the sensitivity of the BP test. But … we can see in the residual-squared plot that there is some reason to believe that it is also time variant, although at a slower pace than the linear model.