Home > GHCN, GIStemp, GSOD > GISTEMP with GSOD: Round 2

GISTEMP with GSOD: Round 2

2010 June 29


The GSOD data developed in the previous post is fed into the GISTEMP code.


Skip this step entirely. No additional Antarctic data. No USHCN swaps. No Hohenpeissenburg. None of it.


There is a flaw in my GSOD processing. 348 stations are in the gsod.mean file that are not in the gsod.inv file. These need to be removed for successful processing through v2_to_bdb.py. In addition, this script was modified to ignore the input_file tables: mcdw.tbl, ushcn.tbl, and sumofday.tbl.

Ran normally

Skipped this script

Skipped this script

Ran normally



There were 12 stations which failed in the toANNanom.exe. These stations all have one common issue: they each have at least one month in which there is no valid data for the entire span of years for that station. This causes an “annual ave from 4 seasonal aves from 3 monthly aves” subroutine (annav) to fail.

The only change to GISTEMP code was to change the label from “GHCN.CL” to “GSOD.CL” in the do_comb_step2.sh script.



No changes here except to change the label from “GHCN.CL.PA” to “GSOD.CL.PA” in the do_comb_step3.sh script.



GISTEMP GSOD Station Count


Filter out of the gsod.mean file those stations without inventory data.

Improve the handling of station ids – there are WMO “dupes” with different USAF ids for many US stations. These could be left in by improving the station id labeling.

Improve the country code mapping. There are hundreds of stations dropping out because I don’t find a useful country code. Can I improve that part of the code?

Initial station count: 23570
Final station count: 19540
All but 12 were removed in the various GSOD processing stages.

  1. carrot eater
    2010 June 29 at 6:49 pm

    Hmm. Looking at the last post, you had insufficient spatial coverage into the 1960s, despite decent station counts. And then who knows what happened in 1972.

    Can you make a plot of how global your coverage really is? Something like figure (c) here, or % of grid boxes with x stations inside, or somesuch


    Once your spatial coverage gets good, (1973+), your match to using GHCN becomes amazingly good. I’m surprised, really. And preliminary or not, I really do think this work is significant. I hope you’ve made Reto Ruedy aware of it.

    Doesn’t the thing spit out text files with gridded data? Some of the things I’m curious in (like checking GISS’s interpolation over areas that are sparse in GHCN), I could look at myself, if it existed.

  2. 2010 June 29 at 6:59 pm

    I’ve been thinking about the spatial coverage. GISTEMP is not a simple 5×5 grid and that complicates things a bit – although I can ignore GISTEMP gridding and just lay down a simple one myself. I did pull some ocean/land fractional area data down today and I’ve just begun to chew on Chad’s code. I’ll see what I can come up with. I agree that it is an important data point.

  3. carrot eater
    2010 June 29 at 7:28 pm

    it’s still the constant area grids in the 1987 paper, no?

    Could you at least have it drop a flag if there was a point with no stations within 1200 km? Just counting those would give some idea.

  4. 2010 June 30 at 5:29 am

    I was overcomplicating the GISTEMP gridding. With the 100 box subgrids and the 1/2 grid overlaps, I thought there was more going on then there really is.

    Poking at the to.SBBXgrid.f file now to see if I can extract the same data as shown in the 1987 map.

  5. 2010 June 30 at 12:36 pm

    You could also produce a version of GSOD with monthly mean temps in the standard GHCN data file / metadata file format for us to play around with. I have a feeling that daily might be a tad too bulky to easily work with, and you aren’t really loosing that much interesting data in going from daily temps to monthly means.

  6. carrot eater
    2010 June 30 at 12:50 pm

    Not a bad idea. Now that you did all that thankless grunt work with the big unwieldy source files, make a nice manageable text file for the rest of us, in the format we’re used to.

    We’ll buy you a beer or something.

    You do lose something by not having daily, but since you never had it before, you maybe don’t realise it. It lets you look more at everyday weather, and how climate shifts might be affecting weather.

    Am I correct that GSOD, like the ISH, should be free of TOB issues? I never bothered to download the raw GSOD files to see what was actually in there.

  7. 2010 June 30 at 5:28 pm

    While the GSOD is created from multiple SYNOP records per day, the GSOD files themselves only include one summary per day including Tmin, Tmax, and Tmean.

  8. carrot eater
    2010 June 30 at 5:32 pm

    so it should be free of TOB problems, but there isn’t enough information there for you to test that yourself.

    IN that case, a comparison for the US section to USHCN, TOB version, could be good. Though that would be extra work for you, since GISS doesn’t naturally spit out the US 48 result normally, I don’t think

  9. 2010 June 30 at 5:32 pm

    No comment on TOB. I dunno (but I wouldn’t think so – okay, I guess that counts as a comment!)

  10. cce
    2010 June 30 at 11:30 pm


    In your original post, you thought your download script may have been interrupted, causing the weird dropoff. Did you check into that?

  11. 2010 July 1 at 7:28 am

    Nope, that wasn’t it. Repeated downloads lead to the same drop off. The data gap is real.

    My latest *guess* is that there was a switch between reporting methods which took longer than expected to implement. But that is still completely speculative.

  12. 2010 July 1 at 3:37 pm

    Thanks Ron! Is that only the GSOD records corresponding with GHCN stations, or the whole set?

  13. 2010 July 1 at 3:52 pm

    Not too big a difference between the two with my model, with a few years being exceptions.

    Slightly higher trend over 1950-present and 1960-present vis-a-vis GHCN (not showing 1930-1950 as GSOD has too few stations to be usable; its similar to GHCN but noisy during that period as Ron shows).

    Method was standard 1961-1990 CAM, 5×5 gridboxes, and land-mask applied to grid weights.

  14. 2010 July 1 at 4:19 pm

    The U.S. is rather interesting. For the last 10 years or so GSOD tracks much closer to USHCN raw than USHCN tobs, but its pretty noisy:

  15. 2010 July 1 at 4:21 pm

    Ack, it should read GSOD 1985-2005 anomalies, not baseline. All lines are baselined relative to 1961-1990.

    I decided to use a more recent baseline for CAM for GSOD because lots of stations only have records post-70s, and it lets me use ~500 U.S. stations instead of only 250 or so.

  16. 2010 July 1 at 4:22 pm

    Is that all of GSOD, or just the US stns in GSOD?
    Gridded or bulk average of anomalies?

  17. 2010 July 1 at 4:24 pm

    Nevermind 😉

    I’ve also been thinking that for GSOD – a post 1973 baseline makes sense.

  18. 2010 July 1 at 4:24 pm

    Just CONUS GSOD stations, 5×5 gridboxes (same are used for USHCN in this analysis).

  19. 2010 July 1 at 4:27 pm

    Here is the same CONUS map with a 1980-2009 baseline period: http://i81.photobucket.com/albums/j237/hausfath/Picture451.png

  20. cce
    2010 July 1 at 10:10 pm

    So that gives you nearly total coverage of the non-ice covered land surface for the last 30 years. Combine that with the AVHRR SST (and possibly the ice “skin temperature”) and that should allow for all sorts of correlation analysis for interpolating sparser years.

  21. carrot eater
    2010 July 2 at 4:07 am

    I don’t know what to make of those. But your legend is wrong I think, it still says 1985-2005.

  22. carrot eater
    2010 July 2 at 7:03 am

    Ron, I gave you a plug at WUWT and raised Watts’ station drop accusations in the context of this work, to see if I could elicit a reply. No such luck.


  23. 2010 July 2 at 9:35 am


    I hadn’t been tracking that thread. For that matter, I don’t read many WUWT threads any more unless someone points one out on another blog. I think its the SSDD factor.

    I don’t see a ‘station drop’ meme in that thread, that’s probably why you aren’t getting much response. But there is a great quote in that thread:

    The adjustment code used is publicly available, of course, but not a hint of all the code discarded because it didn’t produce the desired result.

    Love it! :LOL:

    As to responding to GSOD, that will require new talking points to be generated in the Man Cave of Denial, disseminated by pigeons to the Loyal Minions, floated on a few select blogs, vetted by a Senator’s aide, published by the SSPI, and echoed over and over and over again until the bots learn their new rote.

    OTOH, there is no reason to assume that GSOD is immune to the ‘station siting’ issues and criticism. The major advantage that GSOD has: better spatial coverage in the later decades than GHCN. The major disadvantage: Shorter period of good data.

    Possible future project: splice in data from earlier time periods.

    I really, really want to get to understand the reporting networks better. This is a subject that you would think a working meteorologist might have a head start. AWOS, ASOS, AWSS. How else are SYNOP reports collected by Federal Climate Complex? Any connections with GTS? I might actually have to start ‘investigating’ and make some emails and phone calls.

  24. carrot eater
    2010 July 2 at 9:51 am

    If anything, I would think the stations which are in GSOD but not GHCN would actually be worse, in terms of inhomogeneity problems. And definitely also QC; I don’t know if anybody does much QC on the GSOD, either at the sending country side, or at the receiving agency.

    I just raised the station drop thing because your use of this data set is another way to beat that meme upside the head. Since every comment goes through moderation, I thought there was some chance of getting an in-line response from Watts.

    But yes, I also loved that quote. The logical prowress of some of those commenters is something.

  25. 2010 July 2 at 10:06 am

    Ah! But inhomogeneity and lack of QC are advantages. Bwahahahahaha!

    Since there has been little effort to clean these up statistically, there has been little chance for (cue minor chords) “scientists” to manipulate the data to their desired results which could be used to enable a world conspiracy and enrich the participating scientists. Fame! Power! Glory! Cash! (end minor chords)

    You can’t point to GSOD and say “manipulated.” (cue music in the key of C) 😀

  26. carrot eater
    2010 July 2 at 10:15 am

    See how that works out for you. They’ll start complaining about the various inhomogeneities that you haven’t adjusted for. That is, after all, the point of the surface stations project. So then you’ll have come full circle.

    But yes, you should absolutely be talking to the NCDC guys now, to understand everything there is to know about how the GSOD is put together. It’s quite irritating when WUWT posters talk about data sets they know nothing about. Don’t be like them.

  27. 2010 July 2 at 11:54 am

    It would be fairly trivial to create a coupled record of all GSOD and GHCN monthly data. Just need to snoop out obvious duplicates.

  28. carrot eater
    2010 July 2 at 12:23 pm

    Actually I’d be interested in looking at the duplicates between GSOD and GHCN. In theory, they should match, but I bet many of them will have some minor differences here and there. To some extent, because extra QC is done on climate data, compared to the SYNOP (cue the conspiracy music, Ron). To some extent because of the multiple ways people have calculated Tmean, etc, which leads to there being the slightly differing duplicates in GHCN to begin with.

  29. carrot eater
    2010 July 2 at 12:23 pm

    Is this your most commented thread so far?

  30. 2010 July 2 at 1:02 pm

    I *think* its second – so far.
    But still only my 3 faithful readers 😉

    Edit: Nope, not even second. We have to break 47 to win 1st.

  31. 2010 July 2 at 1:08 pm

    Creating a matched pairing of GSOD and GHCN is on my to-do list. That’s the major reason I went through the effort of matching GSOD countries to GHCN country codes.

    While the GSOD metadata/readme mentions SYNOP, I’m beginning to think that is an oversimplification. At the moment, I’m leaning towards GSOD being a “daily summary” of (most) all the station in the ISH. And the ISH is compiled from numerous sources: “The database includes data originating from various codes such as synoptic, airways, METAR (Meteorological Routine Weather Report), and SMARS (Supplementary Marine Reporting Station), as well as observations from automatic weather stations.” Also, it is easy to conflate “synoptic” with SYNOP – and I’m not sure that every reference to “synoptic” data can be inferred to read “submitted as SYNOP” data. What is becoming clear to me is that the ISH is aggregated from numerous different instrumental data sources and several different weather networks.

  32. 2010 July 2 at 1:09 pm

    I don’t like this commenting format. Too difficult to follow replies to a previous comment. Someone should kick the blog owner to fix that.

  33. carrot eater
    2010 July 2 at 1:29 pm

    In that case, your results should match Spencer’s pretty well, but you got to deal with a less massive pile of data.

    I would say that all SYNOPs are examples of synoptic data, but not all synoptic data becomes encoded as a SYNOP.

  34. carrot eater
    2010 July 3 at 2:35 am

    Nothing from Anthony, but I did draw EM Smith out of the woodwork on that thread. He claims he’s got a source for the missing data. I wonder if he’s stumbled onto ISH or GSOD. On past form, odds are he’s making a hash of it, but we’ll see.

  35. 2010 July 3 at 8:42 am

    I bumped into this comment while googling around a couple of weeks ago:

    There are two ways that this data can be used for ‘filling in’.

    One, there are more stations for most regions than in GHCN, so GSOD can be used to add stations.

    Two, there are probably stations in the GSOD dataset that dropped out of the GHCN data set, so they can be spliced together to create longer records in GHCN.

    Of course to do option 1 successfully, you need to be able to create metadata for rural/urban or bright/dim to populate the inventory file (if using GISTEMP).

  36. carrot eater
    2010 July 3 at 3:37 pm

    he still hadn’t figured out the difference between CLIMAT, METAR and SYNOP by that point I see. He did at some point.

  37. carrot eater
    2010 July 4 at 7:58 pm

    Why not just compare the history at a sparse grid point (sparse, when using GHCN, but hopefully less so when using GSOD), using GSOD vs GHCN? That’s be your first indication.

  38. Daniel the Yooper
    2010 July 5 at 9:13 am

    Ron Broberg :
    I *think* its second – so far.
    But still only my 3 faithful readers
    Edit: Nope, not even second. We have to break 47 to win 1st.

    The Yooper will try to be more faithful with his reading assignments, Ron. 🙂
    Nice Key of C bit, BTW.

    On behalf of your silent readership, keep up the good work!

    Daniel the Yooper

  39. 2010 July 6 at 5:59 am

    Thanks for the note.
    Once more into the breach, dear friends!

  40. 2010 July 12 at 4:38 am

    I downloaded your GHCN-format files, and got it working with TempLS – some results here. As expected, the results are quite similar to other replications with GHCN.

    Thanks for doing all that hard work. The original files look really daunting.

  41. carrot eater
    2010 July 12 at 4:48 am

    There was indeed a huge barrier to using this source, and Ron removed it for everybody.

  42. carrot eater
    2010 July 19 at 5:43 am

    Here is something from Peterson 1998, the paper about QC.

    “Another consideration in the compilation of GHCN is the origin of the data. The most reliable monthly data come from sources that have serially complete data for every monthly report. Monthly data derived from synoptic reports transmitted over the Global Telecommunication System (GTS) are not as reliable as CLIMAT type monthly reports. This may be due to missing data or the orders of magnitude more
    digitization and corresponding greater likelihood of keypunch errors. Schneider (1992) showed that synoptically derived monthly precipitation typically differs from CLIMAT monthly precipitation by 20–40%. A similar analysis performed on temperature found synoptically derived monthly temperatures differ by as much as 0.5°C from CLIMAT temperatures (M. Halpert, personal communication, 1992).
    Therefore, GHCN does not include monthly data that were derived from transmitted synoptic reports. While this decision does not significantly impact the quantity of historical data available for GHCN, it does decrease the quantity of near real time data available because many more stations currently report synoptically (ca. 8000) than send in CLIMAT reports (ca. 1650).”

    When you append GSOD data to GHCN, watch out for the two not matching where they do overlap. Different ways of calculating Tmean will also show up here.

  43. 2010 July 19 at 8:42 am

    Sorry I missed your note the first time around, Nick.

    Glad you are looking at the files. Let me know about any further issues. I found your posts using them interesting.

  44. Maxim
    2010 August 20 at 7:09 pm

    It seems that the very first link is broken.

  45. 2010 August 20 at 7:47 pm

    Thank you for noting this.

    I have searched through the complete listing of posts and pages and the post referred to appears to be deleted. I have tried google cache and archive.org, but cannot retrieve a copy. I recall ‘bumping’ a post to a new data two weeks ago to see how that works and then reverting to the old date. I suspect that the missing post is the one I bumped and appear to have accidently deleted.

    The post in question described the scripts used to retrieve and process the GSOD data. The code exists. I will have to pull together a new post. My apologies.

  46. 2010 August 20 at 9:09 pm

    Thanks. I hope you will have time to recreate it. I was interested in looking at GSOD data, so I’m patiently waiting.

  47. 2010 August 20 at 10:17 pm

    I was able to locate a copy in Yahoo cache. It isn’t a completely clean copy, but the descriptions and links are available.

  1. 2010 July 2 at 11:00 am
  2. 2010 July 13 at 2:06 pm
  3. 2010 July 14 at 4:13 am
  4. 2010 August 20 at 7:30 pm
  5. 2010 November 1 at 10:57 pm
Comments are closed.