Home > Uncategorized > Smith 2010: Elementary Reconstruction of the Hockey Stick Curve

Smith 2010: Elementary Reconstruction of the Hockey Stick Curve

2010 October 13

Where Angels Fear to Tread

I suppose I should know better to stick my nose into the ‘hockey stick’ business. There are those who have studied this issue for years and a lively (and frequently acerbic) debate echoes through the halls of academia, frequently boiling out into the public courtyards.

Nevertheless, I stumbled into this recent article by Dr Richard L Smith of the University of Northern Carolina on a *relatively* simple “Hockey Stick” reconstruction using what appears to be straightforward and well-documented methods: Elementary Reconstruction of the Hockey Stick Curve: Discussion of Paper by Li, Nychka and Ammann (JASA, 2010). It certainly deserves wider readership than my quiet blog will provide, but perhaps I can serve as an introduction.

The Abstract

The paper by Li, Nychka, and Ammann (2010) has exemplified the power of Bayesian Hierarchical Models to solve fundamental problems in paleoclimatology. However, much can also be learned by more elementary statistical methods. In this discussion, we use principal components analysis, regression, and time series analysis, to reconstruct the temperature signal since 1400 based on tree rings data. Although the “hockey stick” shape is less clear cut than in the original analysis of Mann, Bradley, and Hughes (1998, 1999), there is still substantial evidence that recent decades are among the warmest of the past 600 years.

In short, while Smith gives props to LNH2010 BHM method, he suggests that a simpler approach is available.

The Method

Suppose we have observed temperatures yt for t = 1902, . . . , 1980, and proxy series {xjt, j = 1, . . . , q} (where, here, q = 70) for t = 1400, . . . , 1980. A natural way to think about the paleoclimate reconstruction problem is first to fit the regression

(1) yt = α0 + ∑qj=1 αjxjt + εt
for 1902–1980, and then, having estimated coefficients ˆαj, j = 0, . . . , q, to apply the fitted regression curve, ˆyt = ˆα0 + ∑qj=1αjxjt, to reconstruct the temperature curve prior to 1902.

Direct application of (1), however, is open to the objection that the number of regressors (70) is very close to the number of data points used to fit the regression (79), creating potentially serious overfitting problems. A standard method for dealing with this problem is first to transform the covariates {xjt, j = 1, . . . , 70} to PCs {ukt, k = 1, . . . , 70}, ordered by decreasing variance. Then, we fit the observed temperatures to a subset of the PCs,

(2) yt = β0 + ∑Kk=1 βkukt + εt

where K is to be determined. The argument for this is that with moderate K, chosen to capture most of the variability in the covariates, the right-hand side of (2) contains almost as much information as the right hand side of (1), but with far fewer regression coefficients to be estimated. For the moment, I assume a conventional ordinary least squares (OLS) regression analysis in which the εt are uncorrelated with mean 0 and common unknown variance σ2.

Smith uses HADCRUT3v to calibrate the proxies.

Fig 5

Smith’s method produces a number of hockey-stick-like constructions for various values of K. But, as he points out, there is no obvious way to choose among them. Which is the right number of PCs to include in the construction? Smith turns to standard statistical tests: AIC, BIC, and AICC.

A Series of Unfortunate Diversions

Being pretty much a complete novice to hockey stick issues, I took a stroll through Wahl and Ammann 2007. The following three excerpts from their paper discuss issues involving multiple principal components

2.2 Retention of instrumental eigenvectors/PC series in calibration

When decomposing a temperature field into its eigenvectors and corresponding PCs, a choice has to be made regarding the number of PCs to be retained. Here, the numbers of PCs from the global instrumental data grid to which the various proxy sets were calibrated were held the same as those used in the corresponding calibrations of the MBH98. We do this for all scenarios (cf. 1,5 and 6, in Section 2.4 below) where the numbers of proxies assimilated into calibration are equal or similar to those used in MBH. For scenarios that assimilate larger numbers of individual proxy series into calibration (especially scenarios 2 and 3, cf. Section 2.4 below), the method is justified by experiments in which results for the two 15th century calibrations were replicated by retaining the first four instrumental PCs (rather than the MBH specification of one PC for 1404–1449 and two for 1450–1499). The results of these replications demonstrate robustness of the Northern Hemisphere temperature reconstructions to the number of retained instrumental PCs; the calibration and verification statistics are, at most, only slightly different from those of the original experiments (with no change of meaning), and the actual reconstructed temperature series that result are virtually indistinguishable from those of the original experiments. Thus justified, the use of the same number of instrumental training” PCs in the scenarios vis-a-vis the corresponding calibrations in the MBH emulation allows the pure effects of eliminating/changing proxy summary PCs and truncations of proxy richness to be isolated, as described in Section 2.4.

Wahl, Ammann (2007) Examination of criticisms based on the nature and processing of proxy climate evidence

… The effect of using “princomp” without specifying that the calculation be performed on the correlation matrix (an alternate argument of “princomp”) [rb: as done in MM05b] forces the routine to extract eigenvectors and PCs on the variance-covariance matrix of the unstandardized proxy data, which by its nature will capture information in the first one or two eigenvectors/PCs that is primarily related to the absolute magnitude of the numerically largest-scaled variables in the data matrix (Ammann and Wahl, 2007, supplemental information). We have demonstrated that this method of PC extraction has the effect of shifting the actual temporal information common to the North American ITRDB proxy data into higher-order PCs, … especially the fourth PC (Ammann and Wahl, 2007; cf. MM05a/b where this shifting of information is also noted).

Wahl, Ammann (2007) Examination of criticisms based on the nature and processing of proxy climate evidence

In scenarios 5a/b, proxy PC calculations differ only by being based on series that were centered over a different period, and thus do not include the fact that MM05a/b omit the step of standardizing the individual records. As shown in Ammann and Wahl (2007), the method used in MM05a/b causes the PC analysis to capture the variance in the data in a very different way, with the first PCs mostly picking up time series with the largest amplitudes, but not what is common among the series. Only subsequent PCs (after the series have been indirectly standardized to the same variance level) capture variability that is in most of the individual series (Ammann andWahl, 2007). Thus, the number of PCs required to summarize the underlying proxy data changes depending on the approach chosen. Here we verify the impact of the choice of different numbers of PCs that are included in the climate reconstruction procedure. Systematic examination of the Gaspe-restricted reconstructions using 2–5 proxy PCs derived from MM-centered but unstandardized data demonstrates changes in reconstruction as more PCs are added, indicating a significant change in information provided by the PC series. When two or three PCs are used, the resulting reconstructions (represented by scenario 5d, the pink (1400–1449) and green (1450–1499) curve in Figure 3) are highly similar (supplemental information). As reported below, these reconstructions are functionally equivalent to reconstructions in which the bristlecone/foxtail pine records are directly excluded (cf. pink/blue curve for scenarios 6a/b in Figure 4). When four or five PCs are used, the resulting reconstructions (represented by scenario 5c, within the thick blue range in Figure 3) are virtually indistinguishable (supplemental information) and are very similar to scenario 5b. The convergence of results obtained using four or five PCs, coupled with the closeness of 5c to 5b, indicates that information relevant to the global eigenvector patterns being reconstructed is no longer added by higher-order PCs beyond the level necessary to capture the temporal information structure of the data (four PCs using unstandardized data, or two PCs using standardized data). More generally, the overall similarity of scenarios 5a–c demonstrates that when the full information in the proxy data is represented by the PC series, regardless of the precise ordering of the PCs and which centering convention is used, the impact of PC calculation methods on climate reconstruction in the MBH method is extremely small.

Wahl, Ammann (2007) Examination of criticisms based on the nature and processing of proxy climate evidence

So How Now Principal Cows?

Smith summarizes MM05b as reproducing MBH with a PC=1. But when Mann explained the number of principal components he used, it appears he looked at the explained variance of each principal component and kept the number of PCs which explained more than 5% of the variance. Or so it seems to me. I haven’t seen any confirmation of that reading.

Wahl and Amman make the case for using multiple principal components and we have taken a quick look at the selection rules for the number of PCs provided by Mann … but is there another way …?

A more systematic comparison may be made by computing various measures of fit. In comparing regression and/or time series models, three common criteria for selecting the best model are the Akaike Information Criterion (AIC; Akaike 1973), the Bayesian information criterion (BIC; Akaike 1978) and the bias-corrected Akaike Information Criterion (AICC; Hurvich and Tsai 1989). For the PC regression model with K up to 10, but still for the moment without making any allowance for time series autocorrelations, these criteria are listed in Table 1. It can be seen that as K increases, each of AIC, BIC, AICC drops sharply (indicating improved fit) at K = 2 and again at K = 8. The minimum is at K = 9 for AIC and AICC and K = 2 for BIC: as typically happens, BIC chooses a more parsimonious model.

Table 1. Table of AIC, BIC, AICC values for OLS regression without allowing for autocorrelation

1 	-40.4 	-33.3	-40.1
2 	-58.2 	-48.7 	-57.7
3 	-58.7 	-46.9 	-57.9
4 	-57.3 	-43.0 	-56.1
5 	-55.4 	-38.8 	-53.8
6 	-57.3 	-38.4	-55.3
7 	-58.1 	-36.8 	-55.5
8 	-63.8 	-40.1 	-60.5
9 	-66.4 	-40.4 	-62.5
10 	-66.1 	-37.7 	-61.4

Smith returns to his previous point regarding autocorrelation…

The omission of autocorrelation from the foregoing analyses is potentially a serious matter, since the width of the prediction interval could be substantially larger if autocorrelation is included. Therefore, I now extend the previous analysis to include a time series component.

To solve the autocorrelation issue, Smith tries a generalized least squares (GLS) analysis but it produced unsatisfactory results with high number of parameters unsuitable for interpretation. He tries again …

As an alternative to full GLS time series regression, therefore, I used the same OLS fits for the regression components produced earlier, but selected the optimal ARMA(p, q) model fitted to the residuals, and then recalculated the width of the prediction intervals to take account of the autocorrelation. This produced more easily interpretable results. For example, with K = 2, the optimal ARMA model had p = 1, q = 2 when selecting by AIC and p = 1, q = 0 when using BIC or AICC. With K = 9, all three selection criteria resulted in AR(1) as the optimal time series model—incidentally relevant to LNA (2010), where they used AR(2) as the time series model for residuals, though Dr. Li remarked in her oral presentation that the AR(1) model appears equally suitable in practice.

Fig 6

In Conclusion …

This analysis has used principal components regression combined with time series analysis of the residuals to reconstruct the global mean temperature series back to 1400. I smoothed the reconstructed series using a 25-year triangular moving average, and calculated 90% prediction intervals on the smoothed reconstruction as a measure of uncertainty. Three standard statistical model selection criteria (AIC, BIC, and AICC) were used to select the model orders K (number of PCs), p and q (for the autoregressive and moving average components of the time series model fitted to the residuals). Although these criteria do not lead to clear-cut selection of the best model, the final reconstructions do not appear to depend too sensitively on the model selected. Taking into account the general desire in applied statistics for a parsimonious model, the model with K = 2 PCs and AR(1) residuals appears adequate.

The idea of PC regression as a technique in paleoclimate reconstruction is not new—for example, it was discussed in chapter 9 of NRC (2006)—but it does not appear to have been systematically developed.

The results support an overall conclusion that the temperatures in recent decades have been higher than at any previous time since 1400. On the other hand, none of the recent reconstructions shows as sharp a hockey stick shape as the widely reproduced figure 3(a) of MBH (1999), so in that respect, critics of the hockey stick are also partially vindicated by these

As I read it, Smith has reconstructed MBH style hockey sticks using relatively well-known PC regression techniques and a 70 tree series as prepared by Nychka and Ammann. While AIC, BIC, and AICC tests were called upon to help select model parameters K, p, and q, the actual selection of the ‘best’ model appears to be somewhat subjective.

There is a review of what is described as “the controversy” in the beginning of the paper that I have skipped over in this post. Interested readers are referred to the article.

And of course, there is significant additional detail on the methods in the article itself. Interested readers are again referred to the article.

My thanks to Dr Smith for making his preprint publicly available for review and comment.

Dr Smith has an interest in ‘extreme events’:

A recent, brief appearance by Richard Smith at Congress:
Statistician’s View Given at Congressional Briefing on Climate Science

I’m listing some of Steve McIntyre’s posts on the number of PCs to retain. If, after reading these, you still don’t know what McIntyre believes to be a good rule for the retention of PCs, then at least I know I’m not alone. If I have missed something, please let me know.

Errors Matter #3: Preisendorfer’s Rule N (Feb 13, 2005 )
A Challenge to Tamino: MBH PC Retention Rules (Mar 14, 2008)
Gavin and the PC Stories (Mar 3, 2009)
Steig’s “Tutorial” (Jun 2, 2009 )

About these ads
  1. 2010 October 14 at 2:41 am | #1

    This is very timely Ron. I had been going through W&A myself, so these recent papers will help a lot.

    I had downloaded the W&A code and data from Nychka’s site. It’s amazingly simple and ran first time. Now, like you, I have a lot to figure out.

  2. 2010 October 14 at 8:31 am | #2

    Well-written MS and an interesting analysis by you, Ron. I wish I knew more maths — enough to comment substantively on Smith’s variations on PC analysis of the treering proxies. A lay reader can at least follow the general thrust of his approach, a seemingly simple task that is much more difficult when looking at BHM analyses.

    I’ll stray a bit off-topic to talk a bit about proxy selection, which Smith specifically excludes from this MS. Here’s a validation process that I would really like to see, performed by a disinterested and trustworthy statistician.

    1. Assemble candidate treerings.

    2. Predefine a first step for the selection process that does not include curve-matching to an recent instrumental record. For instance, it could be “trees within 50 meters of the current treeline” for montane sampling areas.

    3. Predefine a second step in the process that uses curve-matching to the instrumental record (say, 1890-1990) to select a subsample of the treerings that passed the first step.

    4. Predefine the conditions that are to be used for the reconstruction analysis. Say, pick the K=10 conditions from Smith’s Fig. 5.

    5. Proceed with the PC analysis and generate the 1400-1890 reconstruction.

    So far, this is only marginally different from the field’s SOPs.

    6. Now pick a ~ century-long interval where you are fairly sure that temperatures were falling rather than rising*; as during 1890-1990. The onset of the Little Ice Age, maybe — 1730 to 1830, perhaps.

    7. For the second routine, repeat steps 3, 4, and 5, above, and generate a reconstruction covering 1400-1730.

    If the method is robust, the 1400-1730 reconstruction generated from curve-fitting of the temperature-rising era should be quite superimposable on the 1400-1730 reconstruction generated from calibrating to the temperature-falling era.

    The point is to follow the superior statistical practices that ere are now routinely used in other fields, e.g. clinical trials, where an important goal is to minimize biases in data selection and choice of analytical method — in particular, seeking to avoid post hoc procedures.

    There are obstacles to implementing this idea, of course. The two big ones, I think, would be assembling an unbiased initial cohort of candidate treerings, and figuring out what the 1730-1830 temperature anomaly curve for calibration should be.

    I suspect (hey, this is a blog comment, who needs any steenking evidence!) that most of the published paleotemperature reconstructions would fail an exercise such as this.

    Thanks for the heads-up to this post. Hope the digression doesn’t annoy.

    * rb: edit of italic tag

  3. 2010 October 15 at 8:03 am | #3

    AMac, digression isn’t annoying. I invited comment since I knew you were interested. And I see what you are getting at with the post-processing calibration. Something along those lines might be possible. Need to think about it more. EDIT to add: Note the code that Nick’s looking at (Wahl and Ammann) doesn’t use a calibrate-to-temp step. If you calibrated that to T, it would be a post-facto calibration.

    AMac, have you seen this site? It might give you a way into some proxy studies even if you aren’t comfortable setting up an R environment. A bit snarky and I haven’t tried it, but it looks like it might give you a way in …

    Nick, I don’t know when you started looking at Wahl and Ammann, but it was the recent Wegman brouhaha that set me on the trail to Smith’s paper. I’ve now run both the Nychka code and Smith’s 2010b stuff. Run, but not analyzed. Working on more reconstruction posts.

  4. 2010 October 15 at 12:19 pm | #4

    As AMac points out by implication, eliminating subjective selection is impossible because you have to select proxy records, so Eli would not be too concerned with a subjective element in selecting the best model as long as the choice is restricted to similar models. If they diverge wildly from each other that is a different case.

    Good post.

  5. 2010 October 15 at 12:42 pm | #5

    Hi Eli.

    Yeah, it surprises me that at the end of the day, some subjective judgement needs to be applied to the analysis. Not shocked, just a little surprised.

  6. Steven Mosher
    2010 October 15 at 2:05 pm | #6

    A couple points.

    1. They used bristlecones.(24 of them)
    Nothing logically wrong with that, but in my mind it comes down to this. There is a colorable claim that bristlecones have some issues which may preclude them from an analysis.

    A. always do a with/without in these cases
    B. update the proxies ( see Abenah)
    And dont substitute something else in its place and play a stupid shell game.

    2. There are always ‘subjective” judgements in the analysis process. These decisions may or may not cause uncertainty. the only way to estimate that uncertainty is to do what I suggested in #1.

    If Smith does a simple A/B and bristlecones dont matter ( they will) then not much gas left in that debate.

    If they do matter, then it seems to me science has a question it can investigate.

    Basically it moves the debate away from Mann and to the need to do more research on these trees.
    If the HS suffers a bit as a consequence, so be it. If we never knew about those trees we would still know we have a problem with AGW.

    Mann’s a lighting rod, move the debate.

  7. 2010 October 15 at 2:13 pm | #7

    Steve, do you know off-hand which proxies in 2004GL021750-NOAMER.1400.txt are the bristlecones?

  8. 2010 October 15 at 2:21 pm | #8

    I wrote the following a week ago for the relevant thread on Climate Audit. It never passed moderation, too many links perhaps. Pursuant to Steve Mosher’s point #1 supra, here’s a link to Matther Salzer et al.’s article on the Sheep Mountain bristlecone pine proxies. It suggests that the Stripbark Battles aren’t over yet.

    “Recent unprecedented tree-ring growth in bristlecone pine at the highest elevations and possible causes” is in the Dec. 1, 2009 issue of PNAS. Link to full text.

    As the title suggests, they find no post-1960 “Divergence Problem” to speak of. High recent temperatures are accompanied by high recent growth rates, for bristlecones at the treeline. On the strip-bark issue, they write:

    At both upper-treeline (Sheep Mountain) and non-upper-treeline (Cottonwood Lower) locations in the White Mountains, we found no differences in modern growth between the whole-bark and strip-bark trees that adequately explain the wide modern rings at the upper treeline (Fig. 3, Table S2). Thus, it is unlikely that the modern wide rings are a result of a change to the strip-bark growth form in recent centuries after partial cambial dieback.

    A search of the Salzer et al. manuscript for the name “Ababneh” returns no hits. In the Supplementary Information (PDF), Figure S4 discusses Graybill’s and her results. The legend reads (emphasis added),

    Graybill and Idso (1) strip-bark (solid line) and whole-bark (dotted line) collections plotted as raw ring widths as in this study (A) and as standardized indices (e.g., figure 5 in reference 1) (B). Mean segment length 1,052 years for strip-bark and 279 years for whole-bark. Note: the divergence in modern period is clearly a result of the standardization technique used by Graybill and Idso (1). A similar result was obtained by Ababneh (2), with little difference between strip-bark and whole-bark raw ring widths… and more substantial differences after data processing…

    Hat tip, commenter Boris at The Blackboard.

  9. TimG
    2010 October 15 at 3:30 pm | #9

    The reconstruction only goes back to 1400. It tells us nothing about the MWP which is the point of contention. The existence of the LIA is not disputed by anyone.

    There is also the issue of variance loss where the magnitude of past variations is suppressed by the algorithm.

  10. bugs
    2010 October 15 at 4:42 pm | #10

    Mann’s a lighting rod, move the debate.
    Mann is not a lightening rod, Mann has been made a lightening rod by McIntyre’s insistence on making science a matter of personalities. By doing so he has broken one of the fundamental rules of advancing science, and degraded the level of debate to the level of a pigpen.

    McIntyre has it completely within his power to realise his error, apologize, and move on.

  11. 2010 October 15 at 5:10 pm | #11

    Okay … thinking about this on the way home, I realized what I’m groping around.

    Last year, partly due to Watts and D’Aleo and partly due to the release of pseudo-CRU and the buzz leading up to it, a bunch of us built GMST code. “Both sides” participated and most of the technical bloggers came to realize that the methods used to create global surface temperature anomalies weren’t causing spurious warming. Some questions remain about data, but GSOD gives us a good hint that any data issues aren’t unique to GHCN.

    Similarly, “hockey sticks” can be broken down into methods and data. Issues with bristlecones are data issues. “Loss of variance” (attenuation) is a methods issue.

    To date, as far as I know, the climate blogging community hasn’t come up with the “methods” to create “paleo-climate reconstructions” (PCRs?) Without robust “methods”, many of the data issues cannot be analyzed. McIntyre can pursue his mud-flinging with Mann and “The Team”, I want to build hockey-sticks (PCRs).

    Smith titled this paper an “Elementary Reconstruction“, one that uses “methods that are routinely taught in first-year graduate courses in statistics.” Nick is poking at Wahl and Ammann which is very simple code. So I think the door is opening for “the blogging community” to make contributions and not just criticisms.

    I’m the slowest kid in the class. Stokes, Zeke, Jeff, and Roman are all much better at the maths. Barnes and Mosher are better at the code. Doesn’t embarrass me. I’m going to build me a better hockey stick. Not today. Not tomorrow. But sometime this winter. And maybe the idea will catch on.

  12. 2010 October 15 at 7:32 pm | #12


    You’re on to something. That’s a potentially powerful idea, and CCC is a compelling model.

    BTW, what is pseudo-CRU?

  13. Carrick
    2010 October 17 at 6:31 am | #13

    AMac: “BTW, what is pseudo-CRU?”

    Start here.

  14. 2010 October 22 at 6:46 am | #14

    @Ron @Amac

    Pretty much from the beginning of CCC I’ve had a “hockey stick” reconstruction in my mind as the 2nd project (ccc-gistemp being the first). Of course, we barely have time to do ccc-gistemp never mind add another project.

    Ron, you’re welcome to join us at clearclimatecode and get a hockey stick reconstruction running under our “brand”. We don’t insist on Python, we almost did ccc-gistemp in R.

  15. Kevin Anchukaitis
    2010 October 22 at 12:18 pm | #15

    Ron et al.

    The full series of Paper, Comment, Rejoinder in which Dr. Smith’s comment appears is here:


    It is an interesting set of papers, but I was surprised to read in Dr. Smith’s contribution the following statement:

    ‘The idea of PC regression as a technique in paleoclimate reconstruction is not new … but it does not appear to have been systematically developed’

    To the contrary, PC regression makes an appearance with some of the first attempts to do large-scale ‘climate field reconstructions’ – see Fritts 1976 and Fritts 1991, see the 1984 paper by Cook, Briffa, and Jones (http://onlinelibrary.wiley.com/doi/10.1002/joc.3370140404/abstract) and look at the climate field reconstruction papers and chapters by Evans and coauthors, for just a few examples.

  16. 2010 October 22 at 3:33 pm | #16

    Hmm, DIY paleo reconstructions would be a fun subject to turn to once I finish up temperature work. Speak of which, you need to check your Gmail more Ron :-p

Comments are closed.

Get every new post delivered to your Inbox.

Join 27 other followers