Home > CRUTEMP > Mosher: Deviant Standards?

Mosher: Deviant Standards?

2010 September 5

Introduction

Anthony Watts has picked up and posted some investigation by Steve Mosher (here and here) into the effects of the 5 standard deviation filter used by CRUTEM. Steve was looking into the 5SD filter to see if that could explain the differences between his global averaging and CRUTEM. It doesn’t. To see why it doesn’t, lets take a brief look at the effects of the 5SD filter within CRUTEM.

UKMET CRUTEM code and data

In December 2009 and January 2010, the UKMET released some CRUTEM-like global gridding and averaging code to use with the releasable portion of the CRU data set.

Website:
http://www.metoffice.gov.uk/climatechange/science/monitoring/subsets.html

Code:
station_gridder.perl
make_global_average_ts_ascii.perl

Data
All_Jan_2010.zip

Comparing UEA CRUTEM3vGL and UKMET CRUTEM code and data

So does the UKMET CRUTEM-like code and data actually produce CRUTEM like results? I compare the UEA CRUTEM3vGL global average with the global average produced by the UKMET code.

CRU Code comparisons

It’s a pretty strong match post-WWI. That huge swing in the anomalies in the beginning of the century is WWI. For reference to Mosher’s study, I’ve circled the year 1936.

Comparing UEA CRUTEM3vGL and UKMET CRUTEM code and data

So we know that the UKMET CRUTEM is a good match to the official UEA CRUTEM3vGL. What happens if we compare a UKMET run using the 5-SD filter and one without? Its pretty easy to do. The relevant code from the station_gridder.perl is below.

        # Round anomalies to nearest 0.1C - but skip them if too far from normal
            if (   defined( $data{normals}[$i] )
                && $data{normals}[$i] > -90
                && defined( $data{sds}[$i] )
                && $data{sds}[$i] > -90
                && abs( $data{temperatures}{$key} - $data{normals}[$i] ) <=
                ( $data{sds}[$i] * 5 ) )
            {
                $data{anomalies}{$key} = sprintf "%5.1f",
                  $data{temperatures}{$key} - $data{normals}[$i];
            }
        }

A bit of commenting and we are ready to make a run without the 5-SD filtering.

        # Round anomalies to nearest 0.1C - but skip them if too far from normal
            if (   defined( $data{normals}[$i] )
                && $data{normals}[$i] > -90 )
                #&& defined( $data{sds}[$i] )
                #&& $data{sds}[$i] > -90
                #&& abs( $data{temperatures}{$key} - $data{normals}[$i] ) <=
                #( $data{sds}[$i] * 5 ) )
            {
                $data{anomalies}{$key} = sprintf "%5.1f",
                  $data{temperatures}{$key} - $data{normals}[$i];
            }
        }

And we are ready to compare the two …

CRU SD comparisons

Comparing UKMET CRUTEM code with GHCN and GHCN_adj

So why is Mosher having trouble matching MoshTemp with CRUTEM? Probably because he is using a GHCN unadjusted data set while CRUTEM is using a homogenity adjusted data set. I haven’t looked hard at MoshTemp yet, but this was the result of a comparison using the sparser Dec 2009 UK MET CRUTEM data set.

UKMET CRU GHCN raw

and …

UKMET CRU GHCN adj

Discussion

The 5-SD filter makes almost no difference to the CRUMET data set on a global averaging scale. I whipped together some quick perl scripts to read the UKMET data set. Out of the entire released data set, there are about 3.6 million monthly records. Out of that set, there are 425 records that are more than 5-SD off of the norm on the high-side and 346 records that are more than 5-SD off of the norm on the low-side. Only about 0.02% of the records. A note of caution, however. This is only the rate of 5SD in the released, processed data set. There may have been more in the preprocessed data. But maybe not. Jones and Moberg 2003 suggests that this issue effects less than 0.01% of the data – half of my count but still way down there.

On the other hand, if I have this right (and I may not), 5 sigma should only toss about 0.00005 % of the data if the data was held to a gaussian distribution. Does this tell us something about the data?

My attention did drift to the 1914-1917 swing. Did WWI cool the planet? Or did changes in record taking/keeping cool the data set? Or is that just coincidence?

About these ads
  1. 2010 September 5 at 11:35 am | #1

    And there goes my morning. If you have questions, I’ll unfortunately be offline most of the rest of the day. Or maybe fortunately! ;-)

  2. steven Mosher
    2010 September 5 at 1:58 pm | #2

    Thanks Ron.

    I also didnt compare to the variance adjusted figures ( crutemp3 as opposed to crutemp3v)

  3. 2010 September 5 at 7:36 pm | #3

    The thing you want to look at further is the kurtosis of the data. They’re, indeed, not gaussian. My experience with different, but related, data, shows that the tails are fat. i.e., the kurtosis is large, and you have more highly deviant values than you’d expect from a perfect gaussian. I also find that the temperature distributions are usually pretty skewed. In high latitude oceans, the long tail is towards warm (you can only get so cold), and in tropics, the long tail is cold (you can only get so hot). Your bulk statistics suggest a net skew towards hot. It’ll probably be informative to break things down by area.

    There’s a modest industry in meteorology, particularly data assimilation, regarding what to do for or with non-gaussian data. Life gets very much more complicated if you try to deal rigorously with the non-gaussian behavior. The CRU 5SD filter, however, is not especially (imho, and supported by your results) concerned with gaussian vs. non-gaussian. The intent and result is to eliminate some exceptionally deviant observations. That it’s order 0.02%, rather than 0.00005%, isn’t especially important. At least in this sense: The aim was to eliminate in a consistent way the most exceptionally deviant observations. ‘most exceptionally deviant’ = 0.02%. The 5 SD is, imho, just a way of consistently eliminating the most extreme 1 in 5000 observations.

  4. Steven Mosher
    2010 September 5 at 9:57 pm | #4

    “So why is Mosher having trouble matching MoshTemp with CRUTEM? Probably because he is using a GHCN unadjusted data set while CRUTEM is using a homogenity adjusted data set. ”

    That’s probably the best explanation. which then comes down to the question of why a homogeniety adjustment would get rid of record weather events? I’m more of a mind to leave the outliers in unless I have some firm evidence that it is a data error as opposed to a extreme weather event.

  5. Steven Mosher
    2010 September 6 at 9:04 am | #5

    Robert,

    I see no purpose served by eliminating the data, especially if I am planning on looking at correlation between time series for example. The overall effect is that you present a smaller CI. So, while the trimming may not bias the mean, it does present more confidence where there actually may be less.

  6. 2010 September 6 at 12:55 pm | #6

    Steve:
    Outliers can create or destroy correlations between time series. If the outlying data are actually of good quality, that’s to the good. If they’re nonsense, caused by any one of the infinite ways that data can be corrupted, then your conclusion regarding correlation being present or absent is equal nonsense.

  7. 2010 September 6 at 8:29 pm | #7

    Three things to do with outliers:

    1. Leave them. Bob suggests that can be problematic.
    2. Toss them. But you don’t want to toss out good data.
    3. Examine each of them to determine whehter to leave them in or out.

    My problem with 3 occurs if it is a ‘manual’ examination. It makes it difficult to reproduce.

    Thanks for the kurtosis hint, Bob. I’ll take a look at that later.

  8. Steven Mosher
    2010 September 13 at 9:15 am | #8

    Steve:
    Outliers can create or destroy correlations between time series. If the outlying data are actually of good quality, that’s to the good. If they’re nonsense, caused by any one of the infinite ways that data can be corrupted, then your conclusion regarding correlation being present or absent is equal nonsense.

    I guess I should have been more clear. I see no reason to toss a 5 sigma event merely because it is a 5 sigma event. especially if a station 10km away has a 4.9865 sigma event. Ron puts the dilemma pretty clearly, if you do a manual inspection you have reproduceability issues.

    The QC screening process in the temperature data hasnt been covered in much detail.

Comments are closed.
Follow

Get every new post delivered to your Inbox.

Join 27 other followers