Home > GHCN, trb > trb-0.01: GHCN Simple Mean Average

trb-0.01: GHCN Simple Mean Average

2010 July 10

Introduction

While thinking about a CRU reconstruction, I realized that I did not have a framework for doing my own globally gridded temperature anomalies. Back in the first of the year, I had some bash, perl, and R scripts strung together in an ungainly manner that took many many hours to process for analyzing the CRU data. And six months ago, I barely knew how to spell ‘R’. Since then, I’ve been mostly hacking on GISTEMP. And learned a little more R. So I guess its time to build my own, although I don’t expect any cool innovations.

I’m going to do so, publicly, in a step-by-step fashion, mostly because I need to do it that way. I really do hope to get some “U R DOIN ITZ RONG” comments. I’ve chosen R not for speed (for speed, see Chad’s fortran version), but for readability and extensibility, since I’ll be doing major mods as I move along.

In this first step, I’m just going to compute a simple mean average of the GHCN v2 data set. While that may seem brain dead (as in ‘what’s that supposed to show’), both Willmott in 1991 and McKitrick in 2003 used similar mean averages of GHCN v1 data sets.

Methods

###############################################################################
# trb 0.01 : 2010 07 10 : GHCN Simple Mean Average
# https://rhinohide.wordpress.com
###############################################################################

###############################################################################
# Read Data Files
###############################################################################
# ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.mean
###############################################################################

#------------------------------------------------------------------------------
# Temperature File (GHCN formatted monthly means)
#------------------------------------------------------------------------------

# mm for 'monthly means'
mmwids=c(11,1,4,5,5,5,5,5,5,5,5,5,5,5,5)
mmcols=c("id","dupe","year","jan","feb","mar","apr","may","jun",
                            "jul","aug","sep","oct","nov","dec")

# read the file
mm <- read.fwf("../data/v2.mean", widths=mmwids, col.names=mmcols,
                header=FALSE, na.strings="-9999")

###############################################################################
# Calculate Annual Mean Average
###############################################################################

#------------------------------------------------------------------------------
# set up the range of years to calculate
#------------------------------------------------------------------------------
#year0 <- min(mm[,3])
year0 <- 1880
year1 <- max(mm[,3])
range <- year1 - year0 + 1

#------------------------------------------------------------------------------
# initialize some vectors to hold annual results
#------------------------------------------------------------------------------
lmean <- rep(NA, range -1)
lyear <- rep(NA, range -1)
lcount <- rep(NA, range -1)

#------------------------------------------------------------------------------
# loop through the range
#------------------------------------------------------------------------------
for (i in 1:range) {

    year <- i + year0 -1

    idx=mm[,3]==year
    mmyr=mm[idx,]

    # calculate the annual means
    # divide by 10 since means are stored as 10ths of T
    # only use years with data for every month
    lmean[i]=mean(rowMeans(mmyr[,4:15]),na.rm=T)/10

    # use every year, even those with partial data
    #lmean[i]=mean(rowMeans(mmyr[,4:15],na.rm=T),na.rm=T)/10

    lyear[i]=year
    lcount[i]=nrow(mmyr)
}

#------------------------------------------------------------------------------
# calculate the global land temperature
#------------------------------------------------------------------------------
mean(lmean, na.rm=T) # 11.92536

###############################################################################
# Display
###############################################################################

# plot some stuff
png("trb-001-ghcn-ave-temp.png", width=600, height=400)
plot (lmean ~ lyear, type="l", xlab="Year", ylab="Temp (C)",
      main="GHCN Average Temperature\nSimple Mean of Set",
      sub="https://rhinohide.wordpress.com (trb-0.01)")

# Willmott/McKitrick style plot
# index based on 1880 as start year
png("trb-001-temp-count.png", width=600, height=400)
lyr1950 <- lyear[71:131]
lmn1950 <- lmean[71:131]
lcn1950 <- lcount[71:131]
barplot(lmn1950, ylim=c(10,16), xpd=F, col="tan", space=1)
par(new=T)
plot (lcn1950 ~ lyr1950, type="l", xlab="Year", ylab="Temp (C)",
      main="GHCN Average Temperature\nSimple Mean of Set",
      sub="https://rhinohide.wordpress.com", axes=F, ylim=c(1000,10000))
axis(side=4, ylab="Station Count")
axis(side=1)
points(lcn1950 ~ lyr1950, col="blue", pch=18, cex=1.5)
legend(1990,10000, c("Average Temp","Station Count"),
       pch=c(15,18), col=c("tan","blue"), lty=c(0,1)) 

Code available here: trb-0.01.R

Results

GHCN ave temp

GHCN ave temp and station count

\Discussion

I haven’t read Willmott’s paper, so I’m taking ‘carrot eater’s description that a simple mean was used in part of it.

McKitrick relied on a Joe D’Aleo for his data. I’m pretty sure that the station count must be for GHCN v1, however I cannot find a data set for GHCNv1 that includes 1992-1999. There is an archived data set ‘ndp041‘ for GHCN, but it stops in 1990.
http://www.uoguelph.ca/~rmckitri/research/nvst.html

Here is the McKitrick chart for comparison:
McKitrick GHCN station count and mean temp

Update 2010 07 11:
changed

    lmean[i]=mean(rowMeans(mmyr[,4:15]),na.rm=T)/10

to

    # only use years with data for every month
    lmean[i]=mean(rowMeans(mmyr[,4:15]),na.rm=T)/10

    # use every year, even those with partial data
    #lmean[i]=mean(rowMeans(mmyr[,4:15],na.rm=T),na.rm=T)/10

This allows for years with partial data to be included in the mean whereas the first variant does not. Including partial years can strongly skew a particular year’s mean if there are a lot of ‘summer only’ or ‘winter only’ temps – especially in the high lats where seasonal variations are great. This seems to have happened in 2005. So a good case can be made for leaving it the way it was (only use data from full years). At this stage, I don’t really care which, so I include both options here since I spent some time poking at the difference in trb-0.02. This also explains why the 2010 has no value.

Advertisements
  1. carrot eater
    2010 July 10 at 10:39 am

    Willmott used NCAR, not GHCN. GHCN didn’t exist yet.

    I think a look at the timeline is in order.

    1992: GHCN v1.0 is released, as a compilation of various sources of data. It is a static database, going up to 1990 and ending there. I don’t think it offered any homogeneity adjustments, it was just a source of unadjusted data.

    You can learn about 1.0 here
    http://cdiac.ornl.gov/ftp/ndp041/ndp041.pdf

    1997: GHCN v2.0 is released. It adds data since 1990, and is updated regularly thereafter through collection of CLIMAT reports, which are supposed to be submitted by each country’s weather service on a monthly basis. It gives a much more extensive set of metadata, it provides max/min data, and it gives NOAA-adjusted data, in addition to the unadjusted.

    2010/2011 (?) GHCN v3.x will be released. Featured will be a new adjustment technique, as well as more data. This may not all happen at once, but in stages.

    I’d imagine McKitrick was using v2.0.

    As for Willmott, I think he used gridding, but not anomalies. I take it you haven’t tracked down a copy? Um.. is there a way for me to get in touch with you?

    They came up with a dataset of 18,000 stations (Legates and Willmott, Theor. Appl. Climatol. 41, 11-21 (1990)), to get the average absolute temperature at each place. Trends over time do not appear to be considered; they just give the average absolute temperature at each place. Averaged globally, they get a global absolute near-surface temperature of about 14.0x C.

    They then use the locations of stations in the NCAR database in any given year, and see what global average temperature (average absolute temperature, not anomaly) they get from those, and compare it to that global average absolute temperature from the 18,000 station database. If the result is not the same as the 14.0x C they get, they call it a bias due to undersampling.

    Willmott’s analysis doesn’t seem well suited for, well, anything. I think it’s totally meaningless for assessing undersampling with respect to finding trends over time. But d’Aleo/Watts have latched onto it.

  2. carrot eater
    2010 July 10 at 10:52 am

    Ugh, the second last paragraph is garbled.

    Let me try again:

    Willmott et al had a network of 18,000 stations. Averaged globally, you get an absolute temperature of 14.0x C. I can’t tell what x is. They don’t seem to account for the fact that this average may change over time, but I’ve not read all their papers carefully.

    They then consider the station network for each year, in the NCAR database. So say, take the station locations from 1981. Willmott then goes to their 18,000 station database, pick out the locations corresponding to the 1981 NCAR locations, and re-calculate a global absolute mean using only those locations. They consider the 14.0x C to be the true global mean. If the result using 1981 network is anything but that, they call that a bias due to undersampling. This alleged bias comes to a fairly stable 0.2 C for 1980-1990. And that’s what d’Aleo and Watts took away from it.

    But the purpose of GISS, NCDC, NOAA, etc, etc, is NOT to find the absolute mean near-surface temperature of the earth. The purpose is to find out near-surface temperatures are changing over time. Anomalies are used. And so this Willmott paper just isn’t relevant.

  3. carrot eater
    2010 July 10 at 10:56 am

    I’m not R-literate, so I’ll leave the reviewing of your code to somebody else. Am I the only idiot who uses Matlab for these things? It’s a horrible choice, but it’s my everyday language, so it’s my hammer for all nails.

  4. 2010 July 10 at 12:38 pm

    I once saw someone do a globally gridded anomaly totally in a unix shell script. Wish I had bookmarked that.

    I am going to push back on you a little about ‘anomaly’ -v- ‘trend.’

    You say that the purpose is to see if surface temps are changing over time. ‘Changes over time’ are trends. Why not just plot the trends for each station and average those over an appropriate spatial distribution? Why go through what seems to me to be an intermediate step of anomalization?

  5. 2010 July 10 at 12:55 pm

    And anyone is welcome to send stuff for ronbroberg to that yahoo.com thingee.

  6. carrot eater
    2010 July 10 at 1:19 pm

    Ron Broberg :

    I am going to push back on you a little about ‘anomaly’ -v- ‘trend.’
    You say that the purpose is to see if surface temps are changing over time. ‘Changes over time’ are trends. Why not just plot the trends for each station and average those over an appropriate spatial distribution? Why go through what seems to me to be an intermediate step of anomalization?

    Explain more explicitly what you propose as an alternative. Say, actually do it on this toy example:

    Station A temperatures, annual means, deg C: 8 8 10 7 12
    Station B temperatures, annual means, deg C: 6 6 8 5 10

    Work out exactly what you’d do, with the toy data. From my deliberate choice of numbers, you might anticipate where I’m headed, but I want to make sure I see first what you actually mean.

    (Ignore for now that the math is normally done at monthly resolution).

  7. steven Mosher
    2010 July 11 at 7:30 am

    Hey Ron, I got the whole thing done in R if you want it. I need to organize it for more clarity
    and I have a bunch of unused stuff. But it might be a good excercise..

  8. steven Mosher
    2010 July 11 at 7:43 am

    Looking at your code, The first thing you probably want to do is to handle the “duplicates” and create a file with them already processed. In addition the station records have missing years in their sequences. This will cause you problems down the road when you come to do a land mask and have to calculate the area weight on a per month basis.

    If you like I’ll give you the code for that pre preprocessing ( borrows from nick stokes)

  9. 2010 July 11 at 11:41 am

    BTW, do you prefer Steven, Steve, Mosher, or Mosh?

    I wouldn’t consider ‘dupes’ to be the *first* think to ‘fix.’ But its on my list of incremental changes.

    ———-

    CE, here is snapshot why I haven’t yet clearly seen the superiority of ‘anomaly’ -v- ‘absolutes’ It’s all about the trend, and the trend for the anomalies *should* be the same as the trend for the absolutes.

  10. carrot eater
    2010 July 11 at 12:10 pm

    Ron:

    But finding the OLS trend at an individual station for some period of time doesn’t move you in the direction of making a global mean temperature record.

    So you have to combine stations. And that’s where absolutes go wrong.

    So proceed with that. Combine stations A and B. At first, you’ll see that it still doesn’t matter, whether you use anomalies or absolutes.

    But then, pretend there is only A and B, but that Station B caught fire and stopped operating in 2003. Then you’ll clearly see why you can’t use absolutes. Station add/drops will mess it up.

  11. carrot eater
    2010 July 11 at 12:14 pm

    actually, taken to the extreme, you can get a station combining method out of finding trends. That’s the First Difference Method.

  12. 2010 July 11 at 9:28 pm

    Yeah, I can see that. The problem becomes, how do you graph the intermediate years used in the trends. Graph a running 30 year average? I think that might be the strength of the anomaly method – it lets you take a clean snap-shot in time without smoothing across the years. More of a visualization advantage than a analysis advantage. But I’m still thinking about it. Take nothing for granted.

  13. carrot eater
    2010 July 12 at 3:49 am

    Try different things out with toy or real data (I’d do both). It sounds like you’re considering a smoothed variant of the FDM. We can work out some toy examples of different ideas here in comments, before you go coding things, if you want.

    I don’t think it’s a matter of just visualisation. You need to combine information from the different stations at some point. You want to do this without losing information (as happens with smoothing), without decreasing S/N (an issue when you differentiate), and while trying to capture trends despite station drops/adds.

    I think it’s better to combine first, then smooth as desired, instead of the other way around. What if you’re interested in how the short term variability adds up globally?

    Also, I like expressing things as anomalies, because it’s what you’re trying to find out. If you leave the units in absolute terms, you’d be implying you’re finding some mean absolute temperature, but you aren’t. That’s hard. Tracking down the changes is easier.

  14. carrot eater
    2010 July 12 at 3:51 am

    ok, by using monthly means in the first place, we’ve already smoothed a little bit. so I’m saying, you may want to avoid smoothing the individual stations even more than that, before combining.

  15. 2010 July 12 at 6:43 am

    I agree with that using trends over anomalies adds additional smoothing. And using absolutes is an issue when combining stations with different characteristics.

    If we have one cell with two stations,
    one at alt=1000m with an ave temp of 10C
    another at alt=0m with an ave temp of 20C,
    is the cell’s ave temp really 15C?

    What if we add one additional coastal station with an ave temp of 20C
    Did the cell’s real ave temp suddenly change to 16.67?

    So its trends or anomalies, unless you can justify that the stations you use for a cell reflect some sort of reality (temp smoothing over terrain).

    And trends are a smoothing over time. So you lose some data.

    I agree that anomalies make more sense.

  16. carrot eater
    2010 July 12 at 7:15 am

    Ron Broberg :

    If we have one cell with two stations,
    one at alt=1000m with an ave temp of 10C
    another at alt=0m with an ave temp of 20C,
    is the cell’s ave temp really 15C?
    What if we add one additional coastal station with an ave temp of 20C
    Did the cell’s real ave temp suddenly change to 16.67?

    Exactly. Perfect example.

    So then you move on to how to compute the anomalies, or how to combine the stations. Also, how to combine/weed out the duplicates, which are something of an annoyance.

    I suppose you could review the existing methods, or you could pretend you never heard of them, start from scratch with a blank slate, and see what you come up with.

  17. 2010 July 12 at 7:25 am

    KISS!

    I’m not interested in an original or obscure method at this time.
    Going to anomalies and using simple averages for this first pass.

  18. carrot eater
    2010 July 12 at 7:32 am

    ok, then you’ll be in the camp of Zeke, CRU and NCDC.

    GISS, Tamino, Nick and Chad all take a different approach.

    It doesn’t much matter.

  19. 2010 July 12 at 11:32 am

    Yeh, Chad and I get pretty much the same results (as did Nick and I prior to using a land mask). CAM and LSM/RSM are effectively the same globally. The latter only really shines in calculating regional temps with fragmentary records.

  20. carrot eater
    2010 July 12 at 11:54 am

    Yes, the whole point of the latter is to avoid setting a single universal baseline, in case you have non-overlapping, fragmented stations.

  21. carrot eater
    2010 July 12 at 11:55 am

    Is that the official name of that now, LSM? Meaning, would that be recognisable jargon amongst the group?

  22. 2010 July 12 at 2:25 pm

    Carrot,

    Probably not, but I aim to make it so by using that term, unless someone has a better idea for a name (3 words is a plus for acroynmical parity with CAM, FDM, RSM).

  23. carrot eater
    2010 July 12 at 3:23 pm

    I was able to figure out what you meant, so that’s a plus.

    Anybody not following these conversations would have no idea, even if you spelled out the acronym, though.

  1. No trackbacks yet.
Comments are closed.