Home > GHCN, Statistics > GHCN: Pearson's Chi-squared test

GHCN: Pearson's Chi-squared test

2010 May 30

Introduction

Various players have looked at changes in trends due to loss of stations, loss of rural stations, loss of high latitude, and loss of high altitude stations. Other cuts have included brightness and GPW population or population density. Recently, Zeke added airports to the list.

Pearson’s Chi-squared test is used to test independence of variables in categories. Make no mistake, I am only playing at being a statistician in this post. I welcome comments and corrections in what follows and suggestions of more appropriate category tests.

Pearson’s Chi-squared test

I took the standard GHCN inventory and mean files. I dropped the Southern Hemisphere. With an eye towards DMSP and GPW data which only extends that far back, I filtered for just the 1990-2009 records. I dropped stations with less than 17 years of records (thanks Nick). That left me with 1960 Northern Hemisphere stations with 17 years of data between 1990 and 2009. I categorized the trend as above or below the mean ( tbin = (trend-gtrend)/(abs(trend-gtrend)) ). The mean of all the station’s trends is -0.005437 C/yr. This is not gridded.

In the following, we are testing the null hypothesis that the two categories are independent. The alternative hypothesis is that they are related. Low p-values (< 0.05) rejects the null hypothesis and suggests that the categories are dependent.

Airports -v- Trends
c11=nrow(inv[inv$tbin>0&inv$air=="A",])
c12=nrow(inv[inv$tbin>0&inv$air=="x",])
c21=nrow(inv[inv$tbin<0&inv$air=="A",])
c22=nrow(inv[inv$tbin<0&inv$air=="x",])
c1=c(c11,c12)
c2=c(c21,c22)
test.df = matrix(c(c1,c2),ncol=2)

# test table
chisq.test(test.df,correct=F)

# results
data: test.df
X-squared = 299.0116, df = 1, p-value < 2.2e-16
test.df
[,1] [,2]
[1,] 520 139
[2,] 489 812

RSU -v- Trends
c11=nrow(inv[inv$tbin>0&inv$rsu=="R",])
c12=nrow(inv[inv$tbin>0&inv$rsu=="S",])
c13=nrow(inv[inv$tbin>0&inv$rsu=="U",])
c21=nrow(inv[inv$tbin<0&inv$rsu=="R",])
c22=nrow(inv[inv$tbin<0&inv$rsu=="S",])
c23=nrow(inv[inv$tbin<0&inv$rsu=="U",])
c1=c(c11,c12,c13)
c2=c(c21,c22,c23)
test.df = matrix(c(c1,c2),ncol=2)

# test table
chisq.test(test.df,correct=F)

# results
data: test.df
X-squared = 436.8572, df = 2, p-value < 2.2e-16
test.df
[,1] [,2]
[1,] 287 627
[2,] 185 228
[3,] 537 96

ABC -v- Trends
c11=nrow(inv[inv$tbin>0&inv$abc=="A",])
c12=nrow(inv[inv$tbin>0&inv$abc=="B",])
c13=nrow(inv[inv$tbin>0&inv$abc=="C",])
c21=nrow(inv[inv$tbin<0&inv$abc=="A",])
c22=nrow(inv[inv$tbin<0&inv$abc=="B",])
c23=nrow(inv[inv$tbin<0&inv$abc=="C",])
c1=c(c11,c12,c13)
c2=c(c21,c22,c23)
test.df = matrix(c(c1,c2),ncol=2)

# test table
chisq.test(test.df,correct=F)

# results
data: test.df
X-squared = 1.9847, df = 2, p-value = 0.3707

> test.df
[,1] [,2]
[1,] 224 231
[2,] 145 146
[3,] 640 574

Elevation -v- Trends
maxi=max(inv$ielev)
mini=min(inv$ielev)
len=maxi-mini+1
col1=c(len+1)
col2=c(len+1)
for (i in 1:len) {
col1[i]=nrow(inv[inv$tbin>0&inv$ielev==(i-1),])
col2[i]=nrow(inv[inv$tbin<0&inv$ielev==(i-1),])
}
test.df = matrix(c(col1,col2),ncol=2)

# test table
chisq.test(test.df,correct=F)

# results
data: test.df
X-squared = NaN, df = 25, p-value = NA

test.df
[,1] [,2]
[1,] 555 300
[2,] 166 282
[3,] 77 96
[4,] 36 62
[5,] 32 36
[6,] 33 32
[7,] 26 41
[8,] 22 27
[9,] 15 28
[10,] 15 16
[11,] 13 17
[12,] 6 7
[13,] 8 3
[14,] 1 0
[15,] 1 2
[16,] 0 2
[17,] 1 0
[18,] 0 0
[19,] 0 0
[20,] 0 0
[21,] 0 0
[22,] 0 0
[23,] 0 0
[24,] 0 0
[25,] 1 0
[26,] 1 0

Elevation (with ceiling) -v- Trends
# this rebins the elev data with a ceiling
len=14
col1=c(len)
col2=c(len)
for (i in 1:len) {
col1[i]=nrow(inv[inv$tbin>0&inv$ielev==(i-1),])
col2[i]=nrow(inv[inv$tbin0&inv$ielev>=(i-1),])
col2[i]=nrow(inv[inv$tbin=(i-1),])
test.df = matrix(c(col1,col2),ncol=2)

# test table
chisq.test(test.df,correct=F)

# results
data: test.df
X-squared = 124.5414, df = 13, p-value < 2.2e-16
test.df
[,1] [,2]
[1,] 555 300
[2,] 166 282
[3,] 77 96
[4,] 36 62
[5,] 32 36
[6,] 33 32
[7,] 26 41
[8,] 22 27
[9,] 15 28
[10,] 15 16
[11,] 13 17
[12,] 6 7
[13,] 8 3
[14,] 5 4

Latitude -v- Trends
maxi=max(inv$ilats)
mini=min(inv$ilats)
len=maxi-mini+1
col1=c(len+1)
col2=c(len+1)
for (i in 1:len) {
col1[i]=nrow(inv[inv$tbin>0&inv$ilat==(i-1),])
col2[i]=nrow(inv[inv$tbin<0&inv$ilat==(i-1),])
}
test.df = matrix(c(col1,col2),ncol=2)

# test table
chisq.test(test.df,correct=F)

# results
data: test.df
X-squared = 572.5856, df = 8, p-value < 2.2e-16

test.df
[,1] [,2]
[1,] 116 1
[2,] 140 5
[3,] 98 34
[4,] 258 441
[5,] 217 465
[6,] 98 1
[7,] 66 4
[8,] 15 0
[9,] 1 0

Warnings
Results for elevation and latitude prompted a warning similar to the the following:

In chisq.test(table(col1, col2), correct = F) :
Chi-squared approximation may be incorrect

Results

Update 1: I originally just passed the category vectors into the chisq test. That didn’t seem to match the examples I was looking at which explicitly use the category counts. So I rewrote the tests to pass in the category counts Edit: but didn’t properly build the row-col table, see next update. Big difference. The only category in which ‘category independence with the binned trend’ is not rejected is the category of elevation.

Update 2: Further reading warns that Pearson’s does not perform well with sparse cells due to the asymptotic nature of the test. Plenty of sparse cells in the original binning of the elevation data shown above. I was also suspicious of the latitude results which was giving me a bogus #degrees of freedom. So I put a ceiling on the elevation bins and created an explicit table to use for the chisq test. Now I am back to rejection of the null hypothesis that the categories are independent of trend in all cases except ABC brightness.

Also note the relatively short time frame tested: 1990-2009.

And a final note, this is an ‘exercise’ for me – my first attempt to look at this statistical test. So … Caveat Lector!

Code:
doCHI.sh
1990.r
preprocessor-mod.r

About these ads
Follow

Get every new post delivered to your Inbox.

Join 27 other followers