Home > GHCN, GIStemp, GSOD > GSOD: Global Surface Summary of the Day

GSOD: Global Surface Summary of the Day

2010 May 23

On Real Climate on 28 November 2009, Gavin Schmidt wrote in a reply to a comment

If people were looking for a ‘citizen science’ project to work on, coming up with a way for the SYNOP data (available via WeatherUnderground etc.) to be made commensurate with the CLIMAT data (available via GHCN), would be a great one. There are some subtleties involved (definitions of daily and monthly means vary among providers), but that would provide an interesting back-up and comparison to the CLIMAT-derived summaries from GISTEMP, HadCRU or NCDC.

I looked briefly at the SYNOP data available at the OGIMET site. But OGIMET is not set up for bulk tranfer of data. Only later did I find the GSOD data at NCDC. And still later, before a post at Lucia’s Blackboard helped me put the two together, although I’m still not fully clear on the relationship between OGIMET synop and GSOD synop (OGIMET draws on GSOD or is a parallel collection?)

Global Surface Summary of the Day

                     FEDERAL CLIMATE COMPLEX
                           VERSION 7
                  (OVER 9000 WORLDWIDE STATIONS)

Download here:

GSOD Station Locations

From the Readme Overview:

The following is a description of the global surface summary of day product produced by the National Climatic Data Center (NCDC) in Asheville, NC. The input data used in building these daily summaries are the Integrated Surface Data (ISD), which includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries. The online data files begin with 1929, and are now at the Version 7 software level. Over 9000 stations’ data are typically available.

From the Readme Details:

Global summary of day data for 18 surface meteorological elements are derived from the synoptic/hourly observations contained in USAF DATSAV3 Surface data and Federal Climate Complex Integrated Surface Data (ISD). Historical data are generally available for 1929 to the present, with data from 1973 to the present being the most complete.

For some periods, one or more countries’ data may not be available due to data restrictions or communications problems. In deriving the summary of day data, a minimum of 4 observations for the day must be present (allows for stations which report 4 synoptic observations/day). Since the data are converted to constant units (e.g, knots), slight rounding error from the originally reported values may occur (e.g, 9.9 instead of 10.0).

The mean daily values described below are based on the hours of operation for the station. For some stations/countries, the visibility will sometimes ‘cluster’ around a value (such as 10 miles) due to the practice of not reporting visibilities greater than certain distances. The daily extremes and totals–maximum wind gust, precipitation amount, and snow depth–will only appear if the station reports the data sufficiently to provide a valid value. Therefore, these three elements will appear less frequently than other values. Also, these elements are derived from the stations’ reports during the day, and may comprise a 24-hour period which includes a portion of the previous day. The data are reported and summarized based on Greenwich Mean Time (GMT, 0000Z – 2359Z) since the original synoptic/hourly data are reported and based on GMT.

As for quality control (QC), the input data undergo extensive automated QC to correctly ‘decode’ as much of the synoptic data as possible, and to eliminate many of the random errors found in the original data. Then, these data are QC’ed further as the summary of day data are derived. However, we expect that a very small % of the errors will remain in the summary of day data.

A Note About This Draft

This is only a ‘first run’ – more ‘proof of concept’ than ‘published results.’ Because of the scale of the effort, I present the following more as ‘lab notes’ than rigorous method.

Preparing the GSOD Data

Download data from NCDC into directories binned by year

In each directory, untar and unzip the station files

Iterate through each directory creating a file list (aka a station list)

Iterate through each directory feeding the file list into the monthly mean R script.

The R script opens each station file and calculates the monthly means.

R script output is directed into one aggregate file using a USAF-WMO id number.

The old station ids are used to read the station inventory and create GHCN/GISS formatted inventory file with new station ids.

Realize that while the 11 char USAF-WMO id is unique, it will confuse programs that expect the last 8 chars to be unique records of a common station.

The country codes are parsed out of the station inventory and are assigned arbitrary 3 digit numbers.

New station ids are created:
if USAF < 999999 then cccUUUUUU00
else if WMO < 99999 then cccWWWWW000

Duplicate station ids are located and manually reviewed. Most "dupes" are same station with slightly differing names, latitudes, or longitudes. Stations with duplicate records differing by more than 30 minutes (roughly, this was a manually review) are eliminated (count ~400)

Once in proper format, ran the GSOD inventory through the GHCN meta data program.


I chose to enter GISTEMP at STEP1. That was a mistake. You will need the additional southern stations to successfully create the “Zone 1” files in STEP2.

So go to STEP0 and do just the station adds, skipping the USHCN and Hohenpeissenburg steps. Hey, and sort these suckers while you are at it.

In STEP1, edit v2_to_bdb.py to ignore the mcdw, ushcn, and sumofday tables. Found a number of stations were giving this script a problem. Created a script to pull out the problem stations and looped until I had a clean run in STEP1. This removed about ~6% of the stations.

By STEP2, v2.inv needs to be numerically sorted. Might as well have it done in STEP0. A handful of stations in zone 1 have eaten my lunch and too many hours. Turns out that there are some stations that only report in the summer. So it looks at first pass as if there are a whole bunch of good years in the problem stations, but there are no years suitable for zonal annual anomalies – which craps out in toANNanom.f Similar story for Zone 2 and some buoys. Hand edited Ts.txt to remove dozen or so offending stations and reran STEP1.

No issues in STEP3.


Draft Chart of GISTEMP with GSOD and GHCN

Draft Difference of GISTEMP with GSOD and GHCN

Draft GISTEMP GSOD Station Count


This post should be considered a prototype – a proof of concept. Don’t place high confidence in the results. I am posting it because the project is too big for me to continue without creating a ‘benchmark’ to work the next iteration against. There are numerous manual edits of files and odd filters used to get from the ‘raw’ GSOD data to the final GISTEMP output. The next step, now that I know where I’m going, is to return to the first step and automate the entire procedure. Several of these steps took several days to run on my limited hardware. It may take a month or more to deliver the next version.

Note the extreme station drop in 1972. I’m guessing an interruption in my download scripts.

My thanks to Dr Schmidt for the suggestion. While this might not be exactly what Gavin had in mind, it’s been an interesting trail to follow.

The GSOD data begins in 1929. There are about 12 non-GSOD stations that have records going back to 1905 or so which come into the data during the station adds in STEP0 from GISTEMP. But there aren’t enough stations with good temps for GISTEMP until the middle of 1938. Even then, there are wild divergences from GISTEMP with GHCN for a couple of years. I suspect insufficient spatial coverage creates wide error ranges.

A tip of the hat to O’Day’s Climate Charts and Graphs whose scripts I scavenged for graphing code.

  1. 2010 May 23 at 7:15 pm

    My laptop died multiple times posting this today! Dying fan. Thought I had lost the latest portions of this when my vbox and laptop crashed. All backed up now.

  2. carrot eater
    2010 May 24 at 4:37 am

    Wow, good job. I had briefly thought about doing something like this, but the sheer ridiculous volume of the data put me off.

    I don’t see it- can you please describe how you compute monthly means from the sub-daily reports?

    Note that this is complementary to what Roy Spencer did. Roy Spencer compared NH CRU with his new NH, found them to be pretty much the same, and didn’t know what to make of it. So then he repeated it for US only, found a difference between the two at the US-only level (which is to be expected since in the US, adjustments have some impact), and promptly declared that the surface station record must be unreliable and we should start all over again from scratch. Complained about CRU in various ways, without acknowledging that GISS already did everything he wanted. I wonder how far he’d have gone to make that declaration – if the US didn’t give him what he wanted, would he have gone on to Alabama-only?

  3. carrot eater
    2010 May 24 at 4:41 am

    Ha – when did you put that cheat sheet of formulas at the top right? You may as well add in the point of contention – the range of values of lambda.

  4. 2010 May 24 at 6:11 am


    (edit to remove the mkMonthlyMean.sh script. This script loops over the years 1929 to 2010 – but I’ve already included that loop in the R script – if you use the mkMonthlyMean.sh script to drive the R script, you will be doing 80 time more work than you need to!)

  5. 2010 May 24 at 6:18 am

    I put the formula up during my conversation with Dr. Judith Curry at Kloors. I suggested it could be the tool by which we could separate the wheat from the chaff when it came to blog science on AGW. I’m willing to discuss a range of sensitivities down to zero as long as you are able to discuss it in this framework – is the IPCC about right?, is there a leading tail on the range of possibilities?, a trailing tail?, historical evidence for lambda, theoretical?, modeled?, are there known non-linearities that make it nonsense?, unknown non-linearities?

    As I recall, she found herself unable to say unequivocally that she supported the formulation and was unwilling to discuss her objections concretely at the time.

  6. 2010 May 24 at 1:22 pm


    Good work! Its always nice to have more data to play with.

    Your GSOD station count chart looks quite similar to the ISH station count chart I made awhile back: http://i81.photobucket.com/albums/j237/hausfath/Picture27.png

  7. steven Mosher
    2010 May 25 at 12:10 am

    Ron the other dataset that is interesting is this

    Click to access HadGHCND_paper.pdf

  8. steven Mosher
    2010 May 25 at 12:16 am

    At some point it would be cool to do a definitive post that puts to rest (one can hope) some of the spurious issues. Have

  9. 2010 May 25 at 5:40 am

    Got my hands full with this one, thanks! 😉

    Interesting paper. GHCND supplemented with a couple of additional stations in Africa and Greenland. Thank you for bringing this to my attention!

    You might recall that I dipped briefly into the GHCND data (more or less erroneously) when following up Watt’s claim that METAR coding problems were creeping into GHCN-Monthly. Had no idea that it had more stations. Still CLIMAT data, though.

    And of course, that brings us to the third lineage of surface-data: METAR

    Eventually someone is going to have to make all these available in a single format. And while the data sets are large, they aren’t huge by modern standards. Anyone out there want to put me on a stipend for a year?

  10. 2010 May 25 at 5:43 am

    You got cut off … but which spurious issues? The ones raised by McKitrick 7 years ago and echoed again by Smith, Watts, and D’Aleo? They won’t be put to rest. They have no interest in finding answers and will repeat the claims as long as they believe it will raise doubt, kick up dust, or muddy the waters in some audience. Their goal is obfuscation, not clarification. And there will be no end to it.

  11. 2010 May 25 at 7:05 am

    Just a note to anyone who downloaded the scripts. Don’t use mkMonthlyMean.sh to drive the R script. I had already moved the loop to the R script and you don’t need the sh script; you will be doing 80 times more work than necessary if you do use it! 🙂

  12. 2010 June 3 at 5:08 am

    Good work. I recently (as in weeks) made ccc-gistemp more robust against the sorts of input problems that you’ve come across. It already was more robust than the GISTEMP Fortran (for example, it doesn’t require an exact match between USHCN and GHCN station lists).

    As an example, you can run a set of records through ccc-gistemp that entirely excludes the southern hemisphere.

  13. 2010 June 3 at 6:39 am

    How have you guys handled the different urban markers: rural/urban, ghcn ABC, giss DMSP/RC brightness index?

    Can you flag the use of the different data sets?

    Have you looked at running ccc-temp with and without the ‘ice mask’?

  1. 2010 May 24 at 9:42 pm
  2. 2010 June 26 at 1:42 pm
  3. 2010 August 3 at 7:15 am
  4. 2010 August 20 at 7:30 pm
Comments are closed.