NCDC DS-9640: CONUS Temperature Anomalies and a couple of ax-grinders
For some reason, two posts at The Blackboard, It’s “Fancy,” Sort of … (Shollenberger) and “To get what he wanted”: Upturned end points. (Lucia), seem to be having difficulty understanding the mechanics of another post at “Open Mind”, In the Classroom (Tamino). But there is nothing unusual or difficult about the methods Tamino used to create the charts which have generated so much smoke and apparent frustration at The Blackboard – resulting in an outbreak of mcintyretude: scorn, derision, insults, and the questioning of motives. Since this is mostly a quick walk through of some code to clear the smoke, I will leave the charts generated to post at the end.
Most will recognize this as R code. First, load some common libraries. “Reshape” is used for restacking the monthly columns as individual rows. “CairoPNG” is a graphics library I use on Windows and the headless CentOS webhost.
Next, read the data. Tamino included a pointer to data portal which held the ftp link to the data. Lucia found it without issue; as did I. While it is not absolutely necessary to use the same data set for verifying general findings, it is necessary when exploring the finer details of another’s methods, such as analyzing end-point behavior.
# read data fd964="ftp://ftp.ncdc.noaa.gov/pub/data/cirs/drd964x.tmpst.txt" fw=c(3,1,2,4,7,7,7,7,7,7,7,7,7,7,7,7) d964col=c("code","div","element","year","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec") d964=read.fwf(fd964,fw) colnames(d964)=d964col
The state.README file in the FTP directory provides the information to find the TOB-adjusted, area weighted national data series, coded as ’110′. Since I missed defining the NA values in the read command, I define them now.
# get lower 48 / CONUS # station id = 110 (see state.README) # unique(d964[d964[,1]==110,c(1:3)]) conus=d964[d964[,1]==110,c(4:16)] conus[conus==-99.90] <- NA
The column format makes it easy to find monthly anomalies. The anomaly is just the difference from the mean of each month for all the years, excluding 2012 since it has only partial data.
# anomalize anoms = t(conus[,2:13]) - colMeans(conus[1:117,2:13],na.rm=T) conus[,2:13] = t(anoms)
Reshape the column matrix into a row series, one row per year-month. Convert the month labels into a numeric value (yyyy.monthfrac). Sort into time order. Convert to celsius. Nothing fancy here.
# transpose months from columns to rows conus_ym=melt(conus, id=c("year")) # convert to numeric values # merge year and months into y.m value conus_ym[,2] = conus_ym[,1]+match(conus_ym[,2],month.abb)/12 - 1/24 # resort into time order conus_ym = conus_ym[order(conus_ym[,2]),] # convert to celsius conus_ym[,3] = conus_ym[,3]*5/9
The monthly chart is very noisy. To better see the behavior of the data, calculate and plot the 12 month averages. But we run into a common data issue: the last year, 2012, only has data for the first four months of the year. How inconvenient. What should we do? There are three rather obvious methods of dealing with that.
1. Drop the data (and lose four months of information)
2. Use the four month mean of the data (Jan12-Apr12) for the annual mean.
(‘dangerous’ with high variability and such few data points)
3. Use the previous 12 months (May11-Apr12) as a proxy for 2012 (uses some data twice)
I usually choose option 1, mostly because I am lazy and I don’t know better. I sometimes use option 2 based on my since of %complete and variability. I have heard of option 3, but don’t recall a time where I have used it. Don’t know how common it is, either, but using data more than once is generally a no-no. Rule-of-thumb stuff and I don’t know the technical merits of when each is most appropriate. I am sure that there are more sophisticated treatments for incomplete data on end points (prediction from trend models for instance), but these are the easy ones.
Did Tamino chose a ‘sophisticated’ method for dealing with the incomplete data for 2012? Nope. He did the same as I would, discarding the data. The lazy man’s way.
# busy monthly chart plot(conus_ym[,3]~conus_ym[,2], type="l") # 12 month averages ave12 = aggregate(conus_ym[,3],list(conus_ym[,1]), FUN=mean, na.rm=T) # But 2012 only has 4 months of data # We could drop the year (and only lose 4 months of data) # or we could display the mean of the 4 months (blue) # or we could use the mean of May2011-Apr2012 (red) idx=1408 # Apr 2012 jan2012apr2012 = ave12[118,2] may2011apr2012 = mean(conus_ym[((idx-12):idx),3]) # drop the last year ave12 = ave12[1:117,]
I am going to produce two charts here. The first is a near replica of Tamino’s yearly average as found here (option 1 from above). The second is what happens if you choose to use option 2 (red) or option 3 (blue) from above.
CairoPNG("conus-loess-ave12-a.png", width=600,height=400) plot(ave12[,2]~ave12[,1],type="l", xlab="",ylab="Temperature Anomaly (deg C)") points(ave12[,2]~ave12[,1],pch=20) title(main="USA48",sub="http://rhinohide.wordpress.com 20120512", font.main=1,cex.main=2,cex.sub=.9) dev.off() CairoPNG("conus-loess-ave12-cd.png", width=600,height=400) plot(ave12[,2]~ave12[,1],type="l",ylim=c(-1.25,3.2),xlim=c(1895,2012), xlab="",ylab="Temperature Anomaly (deg C)") points(ave12[,2]~ave12[,1],pch=20) points(2012,jan2012apr2012,pch=20,col="blue") points(2012,may2011apr2012,pch=20,col="red") title(main="USA48",sub="http://rhinohide.wordpress.com 20120512", font.main=1,cex.main=2,cex.sub=.9) dev.off()
Tamino then moved to a chart of 5 year average of the monthly data, further reducing variability. He also introduced a lowess smooth of the (monthly) data, presumably to demonstrate that these two common smoothing methods produce essentially the same result.
So we are once again faced with the incomplete data issue. The five year bins begin in 1895. The first five year bin then ends in 1899. And so on and so forth until we reach the second to last bin 2005-2009. But the last five year bin is incomplete and only has 28 months of data in it: from Jan 2010 to Apr 2012. We have the same three options as above.
1. Discard the data (and lose 28 months of information)
2. Use the mean of Jan10-Apr12 as the last ‘five year’ ave
3. Use May07-Apr12 as the last five year ave
Did Tamino use an unusual method of dealing with the last 5 year bin? Nope. He did exactly what I did in the previous post regarding Canadian oil sand production. In that post, I only had 9 months but wanted an annual average. I took the took the nine month average and ran with it, but did put a note in the post that I had approximated that data point. That Tamino did not do so in his post is the only valid criticism of his post that I have read.
But that leaves one other problem. What the heck is a lowess smooth? I have poked at them some when trying (and failing) to reconstruct a chart of David Archibald. Tamino apparently uses a hand-coded lowess function, but kindly pointed out that you could use the default R lowess function to generate a similar fit. Unfortunately, the default lowess also uses a default window of 67% of the data points – a huge window that overly smooths this data. To get a better fit, you need to decrease the window. Tamino suggests a window of about 20%, he also notes that a window of 15% will cause the lowess to dip in the final few years – the result that Brandon is searching for to match his intuition. Obviously, the smaller the window the better the ‘fit’. But which is the better ‘smoothed fit’? At this level (off-hand blog posts), it seems pretty subjective. If you want a ‘close fit’, you want smaller values (how small is too small?) but you loose the smoothing. If you want broader, more long term behavior, you want larger windows and with greater smoothing. Since the range of the data is over a century, I think we should be looking at the longer windows, not the shorter ones.
I’ve plotted both on the chart below. For comparison, here is Tamino’s original chart.
# 60 month bins idx60 = as.integer(1+(conus_ym[,2]-conus_ym[1,2])/5) conus_ym = cbind(conus_ym,idx60) ave60 = aggregate(conus_ym[,3],list(conus_ym[,4]), FUN=mean, na.rm=T) ym60 = (1:24)*60 - 30 # error info sem = function(x) sqrt(var(x,na.rm=T)/length(!is.na(x))) sem60 = aggregate(conus_ym[,3],list(conus_ym[,4]), FUN=sem) # we have the same issue as above # do we drop 2010-2012 (which loses 28 months of data) # use the mean of those 28 months # or use Apr 2012 + 59 previous months (which uses some data twice) alt_last_60 = mean(conus_ym[idx-60:idx,3]) CairoPNG("conus-loess-ave60.png", width=600,height=400) plot(ave60[,2]~conus_ym[ym60,1],type="l",ylim=c(-.9,1.4), xlab="",ylab="Temperature Anomaly (deg C)") points(ave60[,2]~conus_ym[ym60,1],pch=20) # R default f = .67, the smaller the f, the smaller the window, the less smoothing lines(lowess(conus_ym[1:1408,2],conus_ym[1:1408,3],f=0.15),col="blue") lines(lowess(conus_ym[1:1408,2],conus_ym[1:1408,3],f=0.20),col="red") lines(c(1895,2012),c(0,0),col="gray") arrows(conus_ym[ym60,1],ave60[,2], conus_ym[ym60,1],ave60[,2]+sem60[,2]*2, angle=90,length=.03) arrows(conus_ym[ym60,1],ave60[,2], conus_ym[ym60,1],ave60[,2]-sem60[,2]*2, angle=90,length=.03) title(main="USA48",sub="http://rhinohide.wordpress.com 20120512", font.main=1,cex.main=2,cex.sub=.9) dev.off()
Approached as a learning activity, this is a neat little exercise. I got a quick look at the NCDC DS-9640 dataset. I also get to read the caveat in the README that the last few years of data are subject to change. While I attempted to use a lowess once before to replicate a chart of Archibald, this is my first post of one and got to play with sliding windows. I don’t recall binning data into 5 year chunks before. I know I have never charted the whisker like error bars before. So there is a lot of opportunity to use Tamino’s post as a touchstone to start an exploration of the data space and analysis methods.
You can find the R script here.