GSOD: Global Surface Summary of the Day
If people were looking for a ‘citizen science’ project to work on, coming up with a way for the SYNOP data (available via WeatherUnderground etc.) to be made commensurate with the CLIMAT data (available via GHCN), would be a great one. There are some subtleties involved (definitions of daily and monthly means vary among providers), but that would provide an interesting back-up and comparison to the CLIMAT-derived summaries from GISTEMP, HadCRU or NCDC.
I looked briefly at the SYNOP data available at the OGIMET site. But OGIMET is not set up for bulk tranfer of data. Only later did I find the GSOD data at NCDC. And still later, before a post at Lucia’s Blackboard helped me put the two together, although I’m still not fully clear on the relationship between OGIMET synop and GSOD synop (OGIMET draws on GSOD or is a parallel collection?)
Global Surface Summary of the Day
FEDERAL CLIMATE COMPLEX GLOBAL SURFACE SUMMARY OF DAY DATA VERSION 7 (OVER 9000 WORLDWIDE STATIONS) 08/24/2006
From the Readme Overview:
The following is a description of the global surface summary of day product produced by the National Climatic Data Center (NCDC) in Asheville, NC. The input data used in building these daily summaries are the Integrated Surface Data (ISD), which includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries. The online data files begin with 1929, and are now at the Version 7 software level. Over 9000 stations’ data are typically available.
From the Readme Details:
Global summary of day data for 18 surface meteorological elements are derived from the synoptic/hourly observations contained in USAF DATSAV3 Surface data and Federal Climate Complex Integrated Surface Data (ISD). Historical data are generally available for 1929 to the present, with data from 1973 to the present being the most complete.
For some periods, one or more countries’ data may not be available due to data restrictions or communications problems. In deriving the summary of day data, a minimum of 4 observations for the day must be present (allows for stations which report 4 synoptic observations/day). Since the data are converted to constant units (e.g, knots), slight rounding error from the originally reported values may occur (e.g, 9.9 instead of 10.0).
The mean daily values described below are based on the hours of operation for the station. For some stations/countries, the visibility will sometimes ‘cluster’ around a value (such as 10 miles) due to the practice of not reporting visibilities greater than certain distances. The daily extremes and totals–maximum wind gust, precipitation amount, and snow depth–will only appear if the station reports the data sufficiently to provide a valid value. Therefore, these three elements will appear less frequently than other values. Also, these elements are derived from the stations’ reports during the day, and may comprise a 24-hour period which includes a portion of the previous day. The data are reported and summarized based on Greenwich Mean Time (GMT, 0000Z – 2359Z) since the original synoptic/hourly data are reported and based on GMT.
As for quality control (QC), the input data undergo extensive automated QC to correctly ‘decode’ as much of the synoptic data as possible, and to eliminate many of the random errors found in the original data. Then, these data are QC’ed further as the summary of day data are derived. However, we expect that a very small % of the errors will remain in the summary of day data.
A Note About This Draft
This is only a ‘first run’ – more ‘proof of concept’ than ‘published results.’ Because of the scale of the effort, I present the following more as ‘lab notes’ than rigorous method.
Preparing the GSOD Data
Download data from NCDC into directories binned by year
In each directory, untar and unzip the station files
Iterate through each directory creating a file list (aka a station list)
Iterate through each directory feeding the file list into the monthly mean R script.
The R script opens each station file and calculates the monthly means.
R script output is directed into one aggregate file using a USAF-WMO id number.
The old station ids are used to read the station inventory and create GHCN/GISS formatted inventory file with new station ids.
Realize that while the 11 char USAF-WMO id is unique, it will confuse programs that expect the last 8 chars to be unique records of a common station.
The country codes are parsed out of the station inventory and are assigned arbitrary 3 digit numbers.
New station ids are created:
if USAF < 999999 then cccUUUUUU00
else if WMO < 99999 then cccWWWWW000
Duplicate station ids are located and manually reviewed. Most "dupes" are same station with slightly differing names, latitudes, or longitudes. Stations with duplicate records differing by more than 30 minutes (roughly, this was a manually review) are eliminated (count ~400)
Once in proper format, ran the GSOD inventory through the GHCN meta data program.
I chose to enter GISTEMP at STEP1. That was a mistake. You will need the additional southern stations to successfully create the “Zone 1” files in STEP2.
So go to STEP0 and do just the station adds, skipping the USHCN and Hohenpeissenburg steps. Hey, and sort these suckers while you are at it.
In STEP1, edit v2_to_bdb.py to ignore the mcdw, ushcn, and sumofday tables. Found a number of stations were giving this script a problem. Created a script to pull out the problem stations and looped until I had a clean run in STEP1. This removed about ~6% of the stations.
By STEP2, v2.inv needs to be numerically sorted. Might as well have it done in STEP0. A handful of stations in zone 1 have eaten my lunch and too many hours. Turns out that there are some stations that only report in the summer. So it looks at first pass as if there are a whole bunch of good years in the problem stations, but there are no years suitable for zonal annual anomalies – which craps out in toANNanom.f Similar story for Zone 2 and some buoys. Hand edited Ts.txt to remove dozen or so offending stations and reran STEP1.
No issues in STEP3.
GISTEMP with GSOD
This post should be considered a prototype – a proof of concept. Don’t place high confidence in the results. I am posting it because the project is too big for me to continue without creating a ‘benchmark’ to work the next iteration against. There are numerous manual edits of files and odd filters used to get from the ‘raw’ GSOD data to the final GISTEMP output. The next step, now that I know where I’m going, is to return to the first step and automate the entire procedure. Several of these steps took several days to run on my limited hardware. It may take a month or more to deliver the next version.
Note the extreme station drop in 1972. I’m guessing an interruption in my download scripts.
My thanks to Dr Schmidt for the suggestion. While this might not be exactly what Gavin had in mind, it’s been an interesting trail to follow.
The GSOD data begins in 1929. There are about 12 non-GSOD stations that have records going back to 1905 or so which come into the data during the station adds in STEP0 from GISTEMP. But there aren’t enough stations with good temps for GISTEMP until the middle of 1938. Even then, there are wild divergences from GISTEMP with GHCN for a couple of years. I suspect insufficient spatial coverage creates wide error ranges.
A tip of the hat to O’Day’s Climate Charts and Graphs whose scripts I scavenged for graphing code.