## MMH10: The charts I wanted to see

__Introduction__

As described by Jeff at the Air Vent, the take-away chart from McKitrick, McIntyre, and Herman 2010 is the following.

But that’s not the chart I wanted to see. Indeed, the charts I wanted to see are not to be found in the paper or the SI. So I made the charts I wanted to see.

The following charts are made from Table 1 as found in MMH10. I have made no effort to validate that information.

First are the 23 models compared to the 4 observation data sets for the lower troposphere. ~~The bars show the trend. The whiskers show the 2 s.d.~~ The bars show the (~95%) confidence interval for the trends. The solid horizontal lines show the maximum and minimum of the observational data sets take as a whole. The dashed horizontal lines show the range which satisfies all four observational data sets. Pink bars are those models whose trend +/- 2 sd do not fall into any part of the observational range. Orange bars are models with +/- 2 sd which fall into some part of the observational range but not within the combined obs range. The green bars are those models whose range of trend +/- 2 sd includes the range which satisfies all four observational data sets.

The same game for the Middle Troposphere models and observations.

Ironically, I think I first encountered this method via analysis of Steve Goddard’s claims about North America snow cover regarding model skill by Zeke Hausfather (scroll down) and Chad Herman.

Chad points me to a model-by-model comparison that he did earlier:

http://treesfortheforest.wordpress.com/2009/12/11/ar4-model-hypothesis-tests-results-now-with-tas/

NOTE: The original charts were lost in a server crash. I have reconstructed them (Jul 2012) per the description above. Code is here

Awesome.

I think in the general discussion here, there’s a somewhat pointless cycle of ‘gotcha’ and ‘not so’. Regardless of how you want to draw the whiskers, and regardless of whether you want to say they overlap or not, there is a discrepancy between model and observation that one would like to narrow a bit more.

Well yes, the question is how to do so. Do the models have the effect of the energy taken to melt ice properly configured? How about the effects of hot damp air rising?

This is just back to the ‘missing’ tropical hot spot thing isn’t it?

yes, it’s the so-called tropical hotspot. Basically, when the surface warms, then, assuming there is moisture available, the air some distance above should warm slightly faster. This isn’t anything fancy that you need a sophisticated model for; it’s something which should just happen. Well, you need a model to estimate by how much, but for the basic principle you don’t need a model.

In short, as the surface air warms, then the air can hold more water. And this has been observed; the specific humidity has been going up over time (see Tamino’s recent post). But as the air moisture goes up, then the lapse rate should decrease. Because the lapse rate decreases, you should be able to see, some ways up there, the air warming faster than the surface air.

On short time scales, it appears that we do observe this. El Nino highs and La Nina lows are exaggerated in UAH, in comparison to the surface air records. Same thing, I think. But for some reason, in the long term trends, you don’t see the expected amplification. This could be because the satellites and balloons aren’t terribly good at picking up long term trends (something you can get when you have to stitch together data from several satellites, because no one satellite lasts for 40 years, or it could be some sort of satellite drift). It could be because there is something missing from the theory, such that over long time scales, the lapse rate for some reason doesn’t change as expected. If anybody has a well-developed theoretical reason for why this could be, I haven’t heard it, but I haven’t really looked, either.

I have a model that predicts the trend will be somewhere between -10 degC/decade and +10 degC per decade. Since the whiskers encompase the observational data my model must be right (at least according to the logic presented here).

From my naive perspective, the purpose of model/obs comparisons is to help identify those models that are performing poorly. And if you have models that you can rely on, you can switch that around to identify poor data sets.

Saying that the model mean fails to match observations doesn’t give you much to work with as to identifying problems. Look at individual models and their performance and you may be able to draw generalizations about model methodology and/or physics that can be used to make improvements.

That’s why (IMO) its important to look at the individual models.

Since the whiskers encompase the observational data my model must be right (at least according to the logic presented here).Not quite. It just means that observations have not invalidated your model, not that the model is correct. 😉

There is a larger question of precision -v- accuracy.

As you say, Ron, this isn’t in MMH. But something very similar is in Santer et al, and in the SI, in Table 7, it’s calculated for up to end 2006. They give numbers of models in your color categories. A version of Santer with SI appended is here.

Ron,

Well, the question that everyone wants to know is whether these models provide useful insights into the future. The argument that the IPCC has used in the past is the individual models are so bad that none of the provide useful insights on their own but when taken together as an ensemble they do provide useful insights.

Secondly, you individual model comparison still shows that 22 of 23 models have a upper limit which exceeds the upper limit of observations and none have a lower limit that exceeds observations. That is pretty clear evidence that the models are biased high even if you can say that the models have been invalidated by observations.

Carrot:

I can easily live in a world where there is a significant difference of models and observations. I really don’t know, would need to follow it all the way through.

But I know if that Figure 2 and 3 are the “proof” then there is something wrong. The spread of models is just way too tight to be a meaningful represtnation of how models predict. I think there must be some subaveraging or something to get that figure so tight.

So, really…all I can do is point out the gotcha. Let’s get that fixed and then we can debate other issues. But that figure is just bizarre with those tight whiskers.

TCO: I agree that the whiskers on the model bar in Fig 2 and 3 are unreasonably tight. Broaden them, and this particular ‘gotcha’ of McIntyre goes away. But the point remains that the models are all over the place, and are generally higher than the observations. So there is clearly some work to be done.

Out of curiosity, what’s happened where there are several runs available for the same model? I haven’t read the paper. I think it’s important to see not only the spread of models, but also the spread of possibilities within a model.

After all, reality is also only one of a spread of possibilities itself.

It would be interesting to produce a similar figure that shows the spread of individual modal runs compared to observations, rather than the mean of runs for specific models (at least for models with more than one run submitted).

This still doesn’t penalize models with high error bars as it should. Over time (modelers you pick how long you need to average out natural variation – 50 yrs? – this number keeps going up), the mean trend should be the same as observations for these models – if not the single factor CO2 forcing sensitivity is too high. Alternatively CO2 is not the only game in town, against all the protestations from Gavin et al.

“Not invalidated” is pretty weak beer. Almost any model with high enough error bars would not be invalidated. That’s why people are looking at the long term trend averages, which is really what’s of interest. The error bars around that average don’t save the day for the models, and are not somehow a strength. It’s the first moment that is interesting here as natural variability gets averaged out. It’s not clear to me how to average down the error bars of the ensemble of models (ie uncorrelated rms or what), but it’s also not particularly interesting. The fact is that the individual model errors are large (bad, not good) and the mean of the ensemble is way off (bad, not good).

Well the models are not independent samplings of a true population. We don’t think that additonal models (with trivial differences) improve our predictive understanding. These are not like independent market surveys, where error and uncertainty is only a matter of sampling.

And we don’t even USE models in the way that Figure 2 and 3 would imply. I mean when we look at IPCC or the like, they don’t report some superprecise standard errror of the mean like 5.4-5.6 degrees expected rise (for a given scenario). AND WE SHOULDN’T!

I full leave open the possibility of observations showing model inconsistency. But figures 2 and 3 are inappropriate to illustrate that. And sure the models have interesting differences from eahc other and a lot of run to run variation. But then write a paper on THAT. Not on nature differeng from “the models” when it’s misleading to lump them as a class that way.

And I also FULLY concede my total lack of hardcore stats ability. That’s why I called in Annan who is heavy on trend analysis and significance tests and the like. This stuff can be tricky even for the people who know all the math. Look at Briggs messing up the wishcasting for McCain-Obama. It’s EASY to plug and chug some math, but to make the wrong assumptions and wrong findings from lack of careful logic.

In fact, since I lack the hard core stats ability, all I can do is look at things and say “does that make sense”, “is it consistent”, etc. That said, that was enough to show the issues with Figure 2/3 in MMH and Briggs’s boner on the McCain wishcasting:

http://wmbriggs.com/blog/?p=344

invalidated versus “not invalidated” is back to the whole silly excluded middle debate. douglas made a very strong assertion. Santer showed it flawed. Now others want the very strong counterassertion to be proven.

It’s as silly as “I can prove X”. How? “Well, you have not proven not-X”. Therefore X is proven. I see this crap on the Internet all the time! GRRRR.

Well, this idea of the ensemble of models comes from the modeling community and it doesn’t pass the smell test (to me anyway) right off the bat.

Here are some questions:

–which models are better than others?

–how do we test them? can we test them?

–do we agree that smaller error bars are better (as long as they encompass natural variability)

–why don’t we just use the best models?

–if some models are better than other, why would we equally weight them?

–if they are highly correlated, why would we use 23 or 45 of them?

The suspicion has to be that a wide spread of models is being used to capture any possible medium term observational outcome. Essentially that the weaknesses of the models is being hidden by smearing out any possible scrutiny on anything but the properties of some vague “ensemble”. And, when anyone from the outside examines any particular “ensemble”, even if chosen by the IPCC (insiders/modelers), the composition of that ensemble is criticized by the same insiders/modelers (thimble, meet pea). An amusing new phenomenon in insiders criticizing themselves as they lose track of the pea.

Doesn’t look real tight.

TimG:

Well, the question that everyone wants to know is whether these models provide useful insights into the future.I think that models can provide useful insight into specific processes even when totally useless at predicting the future-at-large. But given that, I agree that AOCGM are trying to do more (and are presented as doing more) than just giving narrow insight into particular processes.

P/TCO, I feel the same way. My intuition tells me that MMH10 isn’t presenting anything useful when it ‘invalidates a model mean.’ Trying to read meaning into that, it sort of comes out as ‘the model mean includes poorly predictive models.’ Which is what I wanted to get a closer look at.

Zeke: That probably means getting into the CMIP data directly. I am probably on the cusp of being able to do that now that I have some tools to handle netcdf files. You might want to check out the MMH SI which is composed almost entirely of STATA code and data.

Suggested reading:

http://julesandjames.blogspot.com/2010/01/reliability-of-ipcc-ar4-cmip3-ensemble.html

http://julesandjames.blogspot.com/2010/05/assessing-consistency-between-short.html

http://julesandjames.blogspot.com/2010/06/when-is-mean-better-than-all-models.html

http://julesandjames.blogspot.com/2010/08/ipcc-experts-new-clothes.html

“It’s as silly as “I can prove X”. How? “Well, you have not proven not-X”. Therefore X is proven. I see this crap on the Internet all the time! GRRRR.”

You mean like the modelers argument:

I can prove that temperatures will rise 4 degrees C in 100 years. How? Because you have not proven that my high error bar models are not invalid over the last 20 yr period?

Thanks for making the broader and correct point, however unintentionally.

Mesa

“Alternatively CO2 is not the only game in town, against all the protestations from Gavin et al.”

That’s a stupid strawman. If you actually ask Gavin what games are in town, this is what you’d get

http://data.giss.nasa.gov/modelforce/

This is the key point in the Annan methodology ((which is fine):

“The question of reliability of the ensemble then simply amounts to asking whether these uncertainties are well-calibrated or not – which as we have shown, is an eminently testable hypothesis (at least in respect of current and historical data) and does not require anyone to “imagine” such bizarre and spurious constructions as the “space of all possible models”. ”

It doesn’t appear from the results that the calibration of many of the models is particularly good, there fore the value of the ensemble is questionable..

TCO:

Seems like you’re getting also frustrated with the “gotcha” and “no you didn’t” game.

It’s rather distracting. Douglass and McIntyre are going for the “gotcha” of trying to get the whisker bars to not overlap each other. In the process, they make poor decisions themselves. The scientists respond by focusing on those poor decisions. But all the while, regardless of whether the whiskers overlap or not, there’s more of a discrepancy there than anybody can just be happy with.

CE – you know perfectly well that the high CO2 sensitivity is what drives the drift on these models over the past 50 yrs – that is not in question.

But the climate sensitivity isn’t coded in. It arises from the multiple pieces of physics and some parameterization (eg clouds). Some of the physics might be wrong. Some of the parameterization may be oversimplified. Or – most likely imo – something is left out.

I think another possibility is one I mentioned above – methodological. Something in the spatial or temporal gridding is off, or something in the way that the atmos, oceanic, ice and land components are synchronized.

Dunno. This is all unknown country to me.

This from Annan on possible model problems – seems reasonable:

# Natural variability – the obs aren’t really that unlikely anyway, they are still within the model range

# Incorrect forcing – eg some of the models don’t include solar effects, but some of them do (according to Gavin on that post – I haven’t actually looked this up). I don’t think the other major forcings can be wrong enough to matter, though missing mechanisms such as stratospheric water vapour certainly could be a factor, let alone “unknown unknowns”

# Models (collectively) over-estimating the forced response

# Models (collectively) under-estimating the natural variability

# Problems with the obs

Yes the CO2 sensitivity is not just a plugged number – that is a fair point.

CE – two references to “McIntyre” and “gotcha”. You presumably are aware that the lead author is McKitrick and Herman is the third author.

This looks like an unhealthy fixation.

On your diagrams, again, Chad (following Santer) did very similar ones here. Example

Carrot, I agree — there

appearsto be a systematic discrepancy. But where does it come from? Putting it on the models is tempting, but consider that all four observational time series (of which I only really trust, sort-of, RSS) are sampling the same single real climate system, and its same, single realization of natural variability over these 30 years. Suppose it just happens to be trending down?Mesa,

You are extremely confused. To a first approximation, the sum of all forcings is indeed the appropriate thing to look at. If a model is highly sensitive to CO2, it’s also going to be highly sensitive to solar. It’s really just a matter of how big each forcing is.

Per GISS, the models are similarly sensitive to all forcings. Black carbon aerosols are an outlier on the low side; most things are roughly as effective as CO2, if it were present in equal amounts of forcing, W/m^2. Methane is on the high side, and I think tropospheric aerosols are high.

Gavin’s Pussycat: That’s one thing – the observations are themselves only one realisation. However we don’t have access to the entire ensemble of reality. That’s why I’d want to see the multiple runs from a single model, to at least get that side – given the same physics in the same model, what kind of spreads can you get.

It would be useful to express the charts above in terms of tropical tropospheric amplification. Because again, that’s what we’re talking about in the end. This is observed on the short time scale, but it’s harder to see on the long time scale. But even in the models, there is a spread in the predicted amplification.

If I was going to come at this from the point of view of checking the observations, that’s how I’d think about it. Why is this effect seen in the short term, but not the long term? Is there a physical reason why that would be the case? Is there a measurement artifact, with how different satellites are stitched together, that you lose elements of the long term trend?

TCO,

I see what you mean, but the “silly excluded middle debate” is not, at least in logic, that you have “not proven not-X”. It’s that you deduce X after proving that you can’t have not-X. That inference relies on the the law of excluded middle: either you have X or not-X.

This debate is not that silly as it sounds, or else you’re ready to construct back most of all the mathematics that has been done to date.

What you have in mind is the burden-of-proof game. It’s another game. It’s not as silly either. Most arguments rest on that game.

Nick, thanks for the ptrs to Santer and Chad’s earlier. I’m not surprised that Chad had something out there. My posts yesterday were a fishing expedition for any links someone might have close at hand.

Here’s a link to a more recent TLT/TMT/TAS Santer-like analysis.

Thx Chad!

—

This all reminds me that Easterbrook was discussing uncertainty in the next generation:

http://www.easterbrook.ca/steve/?p=1758

Agreed, I mean the burden of proof game and the place it goes wrong is where people say “I have proven X”, as noone has proven “notX”.

I don’t mind a nuanced argument over who has the burden of proof. I do object to the fallacy above.

I mean could I say (pre Wiles), I’ve disproven FLT? Based on no one having proven it yet? Haha!

http://video.google.com/videoplay?docid=8269328330690408516#

James Annan would be the first to tell anyone that climate sensitivity is between 2 and 4 degrees, regardless of the particulars of the tropical troposphere.

http://www.jamstec.go.jp/frcgc/research/d5/jdannan/GRL_sensitivity.pdf

Ask James several times in a row. Then take the SE of the mean on his answers.

2-4, 2-4, 2-4, 2-4, 2-4….

WOW! Now our guess of climate sensitivity has narrowed to 2.9 to 3.1! {/MMH whisker mistake}

wooohooo!

There is an awful lot of confusion here (and on numerous other blogs) about what is being compared and what is sensible to compare. What MMH10 have done is compared the model mean trend with the observed trend, and “shown” (I’ll come back to that) that they are significantly different. As Annan points out, this should not be a surprise. The observed trend is bound to be different to any single trend from an individual model, or a mean from any number of models. The only reason for doing this test is because it is what the IPCC said should be the case – that the model trends should have a distribution centered on the truth. This is clearly not a sensible statement. If you have enough models you will be able to estimate the mean accurately enough to show it isn’t the same as any other modeled or observed series.

The essential problem with the models is that they are deterministic, so only give a single result. Without any measure of uncertainty this is not helpful, because it is guaranteed to be different to the truth. Modeller A says the increase in the LT temperature in the next 20 years will be 0.2134C, and he is wrong. End of story. If he says it will be 0.2134 +/- 0.1, then he might be right, and in 20 years time we can find out. Or if he is “predicting” the past we can find out right now. The trouble is, the predictions never come with this kind of uncertainty (I’ll come back to mmh’s sds later), hence the attempt to say the collection of models form a random distribution. This is a somewhat dubious assertion, because there is nothing random about any of the models, but nevertheless it does provide some means of quantifying uncertainty. There are then two hypotheses that could be made and tested. One is that the distribution is centered on the truth. As we have seen this is immediately rejected, although the test does enable you to say that on average the models overestimate the trend. The other is that the distribution contains the truth. This is done by comparing the modeled mean to the truth using the standard deviation rather than the standard error, and would mean calculating the probability of a randomly chosen model from the distribution giving a trend at least as extreme as the truth (the p-value). This hypothesis isn’t immediately rejected as being obviously false, but nor is it very informative. If it is rejected then certainly you can say that the models aren’t doing a good job, but if it isn’t rejected all you are saying is that somewhere among your models (including those that don’t actually exist yet) is one or more that are close to the truth. You don’t know which models these might be, and if you create enough variability among your models it is bound to be true.

What MMH attempt to do is to find the uncertainty in the modeled trends and the observations by assuming they are autoregressive time series. In other word they are realizations from a random process which has some true but unobserved linear trend term and an autocorrelated random error. The standard error they calculate is a measure of the range of possible trends that are consistent with the data given this model. I think this is wrong for different reasons for the models and the observations. It is wrong for the models because they are plainly not random processes. They are produced by a set of mathematical equations, and the result could not have been anything other than what it was. They may look similar to an AR1 or AR6 process, but they look similar to any number of processes. In particular they look similar to a long-memory process with no trend term at all, and also to one with a very large trend (imagine a very smooth series with long periods of increase then of decrease, then add an AR process to it). The standard error derived is totally dependent on the choice of time series model, and since this isn’t a random process, there is no correct choice. For the observations all the above applies, but there is a bigger problem: Modelers often say that the observations are just one random realization of the “climate”, which is really a way of saying that their model is correct but the world is wrong. In some philosophical sense the observed climate may well be one random version of a bigger reality, but it is the one we are interested in. This means it is the one we want to predict. There is therefore no uncertainty (apart from measurement error) in the observed trend. If the models get it wrong, they get it wrong.

“The essential problem with the models is that they are deterministic, so only give a single result. ”

This really isn’t true. The whole point of doing multiple runs for an initial condition ensemble is to get a variety of results from the same model.

Thank you. snowrunner. That was well written, interesting, and informative.

I tripped over your ‘no uncertainity in the observed trend’ for a while before the parenthetical ‘apart from measurement error’ sunk in.

But I’m going to have to chew further on this statement: “

The standard error derived is totally dependent on the choice of time series model, and since this isn’t a random process, there is no correct choice.“The essential problem with the models is that they are deterministic, so only give a single result. ”This really isn’t true. The whole point of doing multiple runs for an initial condition ensemble is to get a variety of results from the same model.

Maybe I’m wrong here, but I think what snowrunner is saying is that for a given set of initial conditions, the model will return the exact same result no matter how many times you run it. It’s the variation in the initial conditions that gives a variety of results in ensemble forecasting.

Maybe that’s what he meant, but it wasn’t clearly written there. That’s why I pointed out that you can get an initial condition ensemble.

It is true that if you change the initial conditions you will get a different result from the same model. However, there is nothing stochastic about the models, so as JR says you will always get the same result from the same set of initial conditions (and parameters). If you wanted to (and were brave enough) you could run your model with a range of different initial values, calculate some measure of uncertainty and say your prediction was x +/- y. This would then be testable (and would certainly be better than just giving a point value). This hasn’t been done in the MMH paper though, and in any case it is unlikely to capture the full uncertainty in the predictions because this depends on how the process is modeled, and because there is often no real way of knowing how much to vary the initial values. The point however is that doing multiple runs is in principle exactly the same as taking runs from different models, so in the absence of any measure of uncertainty in individual runs they might as well all be regarded as new models.

It’s intersting that they just used the time series process to try to determine trend uncertainty. (Like a student just grabbing an equation to use it whether or not it makes sense.)

I’d also be very wary of them taking complicated noise and modeling trends as being noise. If you look at their GRL05 paper, they (probably) overmodeled the noise, by making “red noise series” that were almost copies of the proxies (from having so many paramaters). Basically what they do is assume long term persistance.

Re models: there is the uncertainty regarding the input. A particular value is used for various inputs over time (eg solar) but there is uncertainty regarding the real world both in measurement uncertainty and in future uncertainity (eg solar sunspot cycle – although I don’t know if models bother with the ~0.1% forcing flucation of the sunspot cycle – but you get the idea. What is the uncertainty surrounding future anthropogenically source CO2?).

And then there is uncertainty surrounding the modelled physics. Cloud parameterization, for instance.

And then there are known errors accepted for the sake of modeling – methodological errors produced by trunctions, gridding, averaging, …

A time series standard error may not be the appropriate one to apply to models – or even to the time series of predictions/projections outputted from the model – but surely there must be someway to speak meaningfully of “model uncertainty.”

Now we are more firmly in the territory that Easterbrook was discussing

http://www.easterbrook.ca/steve/?p=1758

snowrunner: but the stochastic nature you are looking for comes precisely from varying the initial conditions. this is why some modeling groups submit multiple runs for otherwise the same conditions. Do enough of them submit enough multiples to allow the calculation you describe? Not really.

Computing resources are limiting. If it took only a couple hours to run one of these models, I suspect you’d see more use being made of initial condition ensembles.

carrot, I’m a little puzzled by the tight distribution of trend data for each model ( see Mc’s box plot)

The whole issue of comparing models to observations suddenly became clearer and more confusing in one swell foop.

The side show of Santer/Douglas/MMH is less interesting to me. I think Wigley called Douglass work a piece of fraud. man the santer food fight could top the Mann food fight.

I want a cage match with Santer/Gavin/M&M/Annan/Briggs/VS

THAT would be a kick ass discussion. provided it didnt get personal.

Failing that, now that people can get the model data mass confusion can reign

——–

rb: Just FYI, a yellow-flag was thrown due to the use of the word fraud.

Can you explain why McI has a much tighter box plot whiskers than what you show for your 2SD bars? Is he messing up again by some time series stuff, rather than just looking ath the run variation?

Note: regardless of any of the above, his plots of different models would actually show that the models are sampling DIFFERENT populations from each other. Therefore combining them is idiotic. And therefore his whiskers on Figure 2 and 3 are moronic. And he can twist and squiggle and try to blame Santer or Douglas…but he made Figure 2 and 3. and he needs to correct it. As it is just logically idiotic!

I’m talking about this page. Look at GISS-ER on your chart and his. Why’s his so much narrower than yours?

http://climateaudit.org/2010/08/11/within-group-and-between-group-variance/#more-11777

I’m referring to this chart of his:

http://climateaudit.org/2010/08/11/within-group-and-between-group-variance/#more-11777

In your plot, GISS-ER (LT) has 2SD from about 1.2-3.8. McI’s plot has it at

If I’m following this right, he is constructing a standard deviation from the mean of the results of multiple runs of the same model. (In this case, 4 runs minimum). I think this is similar to snowrunner’s frame above where each individual model has no uncertainty, but a mean with variance can be constructed by multiple runs of the same model. In this treatment, the trend is handled as if the trend itself had no uncertainty.

I am presenting 2sd for each model (with or without multiple runs) taken straight from MMH10 Table 1. I frankly do not know how that sd was calculated (yet). I suppose I should dig deeper. Too many things to do – not enough time. Maybe Chad can chime in. What I *think* is being shown is the variance in the trend when the trend is treated as a time series.

From the caption for MMH10 Table 1: “

LT and MT trends based on linear regression allowing 6autoregressive terms. Std errors in brackets. * significant at 10%. ** significant at 5%.”

… and the CMIP data portal is offline. 😆

Well if we just plotted the actual datapoints of the run trends no? How can you do a SD of the Mean of the individal model? You can do the STD DEV of the TREND. Right?

I suspect he is screwing up again. And also that he is being misleading to the viewers when he says he is doing a “first thing” “standard boxplot”:

“Let me start with a very standard boxplot from a standard mixed effects (random effects) program, produced from a data set (calculated from Chad’s collation) consisting of the 1979-2009 T2LT trends for 57 A1B runs stratified by 24 models, showing here a boxplot of the 6 models with 4 or more 4 runs. Once the trends are calculated and collated (and I’ve placed this data online), this figure can be produced in a couple of lines with standard software. It is one of the very first things that a statistical analyst dealing with stratified populations ought to look at. ”

And I suspect (look at some of the commentator comments for instance wrt ENSO) that he is conveying a false sense of the run to run variance…

If you look at the table he is reading his trends from …

http://www.climateaudit.info/data/models/santer/2010/info.runs.csv

… you can see that no trend error is included.

I don’t think he is screwing up. He is showing that trends from a single model don’t vary much from one another under the various CMIP3 forcing scenarios.

Is he deliberately trying to conflate the lack of variance between the trends of a single model’s various runs and the variance associated with the trends themselves? :shrug: Who knows? I think that showing the tight intra-model variance in the same graph as the observations invites confusion.

Is he showing the uncertainty of the trend itself? No. I don’t think so. I think that is what is displayed in my charts above. But I am going to have to figure out how to retrieve and read the archived CMIP data sets to prove it. Which isn’t going to happen soon. Especially if the CMIP archive remains off-line and I don’t want to pull too many overnighters. 😀

Here’s what he says in discussion:

“Here’s what sticks out for me like a sore thumb from the boxplot. There is a lot of between-group (i.e. model) variance, but very little within-group variance. For some models, there is negligible difference between trends from different runs.”

So wif he is showing run to run variation for GISS-ER, what are you showing? What are just the run trend numbers themselves? Like the 4+ actual trend numbers?

Do you think he is confusing variation for REPEATED runs (i.e. doing the same exact thing over again), with run to run (alternate starting conditions) variations?

I tried looking at the spreadsheet but can’t tell how to read it. There are no headings and all the numbers are run together.

Re #56. Dunno. (I seem to be saying that a lot lately!)

From MMH10

This suggests that the various runs are perturbed initial conditions.

http://www-pcmdi.llnl.gov/ipcc/time_correspondence_summary.htm

Some of the comments are pretty funny:

“run on a different platform, equivalent to using slightly different atmospheric initial condition”I notice that some of the different runs are just started from different seasons or years.

So I suspect that you’re right. Perturbed initial conditions – not perturbed forcings/scenarios.

Perturbed initial conditions should be fine.

So, where do your wide SD lines come from?

OK. I just looked at the MMH data table. Unfortunately, the individual run trends are not shown. But for instance for GISS-ER, LT he has mean trend as .258 with SD of .065. So double the standard deviation would be .13. And we should expect a spread from .12 to .39. Which is exactly what your whisker plot shows. So what the heck is he drawing??

This is a snippet from Santer (linked by McI)

http://www.climateaudit.info/data/models/santer/2010/info.runs.csv

And this is snipped from his posted code

The key code is here:

(Model$tropo_mean[j]- Trend[i,”trend”]) / sqrt(Model$tropo_sd[j]^2+Trend[i,”se”]^2)

Trend is the satellite (RSS/UAH) trend

So to pull out the key eqn …

(model_mean – sat_trend) / sqrt(model_sd^2 + sat_se^2)

The model_sd is, as stated above, the variance of the mean of the trends. It does not, as far as I can tell, include any information about the sd of the trend itself.

To my eyes, what McI is showing is that small perturbations in the initial conditions of a particular model leads to only small changes in the output trends.

But the whole paper was done under one scenario. Run to run MEANS different perturbing conditions. So what the heck does (ST DEV) underneat the average trend mean, when we have a model and show an average trend for the different runs?

I understand your plot from table 1. Not McI.

The std deviations in MMH cannot illustrate variations due to random initial conditions, because most of the models only have one run.

I don’t understand how these std devs are calculated. I suppose it has something to do with snowrunner’s comment about AR(1) processes – do they just generate lots of AR(1) random series with the observed autocorrelation from each model? I’m not sure where this is done in the paper.

good point. But then Table 1 is messed up. What an abortion of a paper.

Just FYI, with fresh this morning and a second look at the code, McI box plots are just that – simple box plots. The box plot show the max and min of the data set (end of the whiskers) and the 25% and 75% quartiles (top and bottom of the box) and the mean of the data set as a whole (thick middle line). In this case, “the data set” is the individual runs of a single model. Nothing fancy is being shown in McI’s graph.

The eqn pulled out above is not the one generating the box plots of the intra-model mean shown at McI’s post. That is being used in the discussion of stratification.

Steve is just following up on his comments earlier about the intra-model (within model) means. And what that shows is that the changes in initial conditions in the varios model runs (perturbations) do not result in large changes in the trended output (narrow range of results). To put it another way – the models are relatively insensitive to perturbations in their initial states.

It says nothing about sensitivity to forcing scenarios. That is a different but also interesting question.

How the heck to they show an average trend (standard devation) for models with one run? Obviously, standard deviation is not based on a conventional sampling statistic. I wonder if the averages are averages! What a screwed up paper.

Lazar is pecking away at Ross to get the SE definition. The whole thing is bizarre. They iuse different than conventional definitions of the standard error, and don’t highlight that they are doing so in the paper methods. All buried in linear algebra gobbledigook. I think the might have even got themselves confused in time series land and forgot how to take a normal sampling statistic of multiple trends!

chad is nowhere to be seen. Steve blustered, screwed up, locked a thread. Ross will have to tortously be pinned down. what crap theses guys do.

Something stuck in your craw, TCO?

The CMIP data store is back online,

but I am returning my attention to GSOD

and some other projects,

even though the stats in MMH10 and MW10 fascinate me.

So much cool stuff to study and learn. Too little time.

Maybe I need to retire so I can get some work done!

I’m sick of the idiots. Very burned out on the Neverending Audit.

Learn to read between the rants.

For me, that usually means following specific links rather than browsing emotionally charged blogs. More likely to be led to something interesting that way. That’s part of the reason I’m trying the Friday round-ups – identify and promote technical blogging. Let the belly-button-gazing and deception-driven-policy wankers stew in their own mess.