Home > GHCN > Kuska and Serahs

Kuska and Serahs

2010 July 21

Introduction

kushka 1966

RomanM has a recent post up looking at “duplicate” station records in GHCN. I decided to follow up on one of those that he had selected: Kuska.

GHCN

Kuska is listed in the GHCN inventory (v2.temperature.inv) as the second record for WMO ID 38974. Serahs is the other entry (aka Saragt aka Serakhs).

22938974001 SERAHS 36.53 61.22 279 286R   -9FLDEno-9x-9WARM GRASS/SHRUBC
22938974002 KUSKA  36.53 61.22 625 286R   -9FLDEno-9x-9WARM GRASS/SHRUBC

This is strange, since KUSKA (aka Kyshka aka Gyshgy ) has its own WMO ID, 38987 per the WMO Global Observing System.

2 ASIA / ASIE TURKMENISTAN 2018	1726 38974 0 SARAGT 36 32N 61 13E 275
2 ASIA / ASIE TURKMENISTAN 2018	1727 38987 0 GYSHGY 35 17N 62 21E 625

The GHCN mean temperature file (v2.mean) has 5 records for 22938974

229389740010 1903-1907,1914,1935-1989
229389740011 1936-1989
229389740012 1936-1989

229389740020 1904-1908,1913-1917,1921-1989
229389740021 1904-1908,1911-1918,1921-1989

Dupes have been described as flags which describe the same data record as received through two different delivery channels. So I took a quick look at the GHCN data source table to see if I could discern a likely candidate for these stations. Two immediately did, “USSR Network of CLIMAT stations” and “Daily Temperature and Precipitation Data for 223 USSR Stations (NDP-040)”. The search was on.

NDP048

I tracked down the “USSR Network of CLIMAT stations” as NDP048 aka “Six- and Three-Hourly Meteorological Observations from 223 U.S.S.R. Stations (1998)” first.
http://cdiac.ornl.gov/ndps/ndp048.html
http://cdiac.ornl.gov/ftp/ndp048/

This database contains 6- and 3-hourly meteorological observations from a 223-station network of the former Soviet Union. These data have been made available through cooperation between the two principal climate data centers of the United States and Russia: the National Climatic Data Center (NCDC), in Asheville, North Carolina, and the All-Russian Research Institute of Hydrometeorological Information-World Data Centre (RIHMI-WDC) in Obninsk, Russia. The first version of this database extended through the mid-1980s (ending year dependent upon station) and was made available in 1995 by the Carbon Dioxide Information Analysis Center (CDIAC) as NDP-048. A second version of the database extended the data records through 1990. This third, and current version of the database includes data through 2000 for over half of the stations (mainly for Russia), whereas the remainder of the stations have records extending through various years of the 1990s. Because of the break up of the Soviet Union in 1991, and since RIHMI-WDC is a Russian institution, only Russain stations are generally available through 2000. The non-Russian station records in this database typically extend through 1991. Station records consist of 6- and 3-hourly observations of some 24 meteorological variables including temperature, past and present weather type, precipitation amount, cloud count and type, sea level pressure, relative humidity, and wind direction and speed. The 6-hourly observations extend from 1936 through 1965; the 3-hourly observations extend from 1966 through 2000 (or through the latest year available). These data have undergone extensive quality assurance checks by RIHMI-WDC, NCDC, and CDIAC. The database represents a wealth of meteorological information for a large and climatologically important portion of the earth’s land area, and should prove extremely useful for a wide variety of regional climate change studies.

There is more information here:
http://cdiac.ornl.gov/ftp/ndp048/ndp048.pdf (7.5 mb)

min(ndp48_38987$Year) # 1935
max(ndp48_38987$Year) # 1991

min(ndp48_38974$Year) # 1935
max(ndp48_38974$Year) # 1991

However, 1935 includes just 1 entry, the last record in Dec 31.
This provided a clue as to how to process this data.

See also:
ds475.0 – U.S.S.R. Surface 6- and 3-hourly Surface Synoptic Observations 1936-1983

NDP040

Next came another CDIAC data archive: ndp040
http://cdiac.ornl.gov/ndps/ndp040.html
http://cdiac.ornl.gov/ftp/ndp040/

The stations in this dataset are considered by RIHMI to comprise one of the best networks suitable for temperature and precipitation monitoring over the the former-USSR. Factors involved in choosing these 223 stations included length or record, amount of missing data, and achieving reasonably good geographic coverage. There are indeed many more stations with daily data over this part of the world, and hundreds more station records are available through NOAA’s Global Historical Climatology Network – Daily (GHCND) database. The 223 stations comprising this database are included in GHCND, but different data processing, updating, and quality assurance methods/checks mean that the agreement between records will vary depending on the station. The relative quality and accuracy of the common station records in the two databases also cannot be easily assessed. As of this writing, most of the common stations contained in the GHCND have more recent records, but not necessarily records starting as early as the records available here.

This database contains four variables: daily mean, minimum, and maximum temperature, and daily total precipitation (liquid equivalent). Temperature were taken three times a day from 1881-1935, four times a day from 1936-65, and eight times a day since 1966. Daily mean temperature is defined as the average of all observations for each calendar day. Daily maximum/minimum temperatures are derived from maximum/minimum thermometer measurements. See the measurement description file for further details.

http://cdiac.ornl.gov/ftp/ndp040/

There is a station history file with a some information. The metadata file describes these entries.

38974 MOVE 1935  2 -9  0 W
38974 MOVE 1938 -9 -9  0 S
38974 MOVE 1942 -9 -9  0 E
38974 PRCP 1950 10 -9
38974 MOVE 1961  6 12  2 NE

38987 MOVE 1904  4 -9 -9 -99
38987 MOVE 1910 -9 -9 -9 -99
38987 MOVE 1913  8 -9 -9 -99
38987 MOVE 1927  5 -9 -9 -99
38987 PRCP 1953  1  4

A 134 page description of the data set is available here which includes a reprint of A New Perspective on Recent Global Warming: Asymmetric Trends of Daily Maximum and Minimum Temperature (“Bulletin of the American Meteorlogical Society, Vol 74, No 6, June 1993”) http://cdiac.ornl.gov/ftp/ndp040/ndp040.pdf (4 mb)

min(ndp40_38974$Year) # 1936
max(ndp40_38974$Year) # 2001

min(ndp40_38987$Year) # 1904
max(ndp40_38987$Year) # 2001

See also:
ds524.0 Russian Summary of Day, 1881-1989

A map of the 223 stations in NDP040 and NDP048
223 stations

Results

My initial cut at the NDP048 (6 hour records) indicated a strong correlation with the GHCN *011 and *021 records – but it was off. The leading record (the last entry from the day (and year) before) gave me the hint I needed – which was to slip temperature records down one ‘slot’ to include the last entry from the previous day as today, and today’s last record as belonging to the next day. Why would you do this? TOB. I had read a paper two weeks ago that I was planning on writing a post on. This planted the seed I needed. The slight TOP adjustment gave a much better match.

But the real match came with from the NDP040 data formatted as Tmean, Tmax, and Tmin. Taking the daily means of the Tmean provided a near perfect match with GHCN *011 and *021 records.

This shows the difference between GHCN 22938974011 and NDP040 38974.
NDP040 38974

The match is clearly very similar with the hourly mean data for 223, but a little off.
NDP048 38974

Likewise for Station 38987 (known in GHCN as 229389740021).

NDP040 38987

NDP048 38987

The R-code is not completely automated, but my notes are recorded here: kuska4.R

sum(abs(ghcn_389740011[,3:14] – ndp40_38974a[,3:14]),na.rm=T) # 3
sum(abs(ghcn_389740011[,3:14] – ndp48_38974a[,3:14]),na.rm=T) # 113
sum(abs(ghcn_389740021[,3:14] – ndp40_38987a[,3:14]),na.rm=T) # 0
sum(abs(ghcn_389740021[,3:14] – ndp48_38987a[,3:14]),na.rm=T) # 92

Discussion

The DSxxx data sets are UCAR archived and requires email contact. I have not pursued that route at this time.

I took a very quick look at GSOD data for these stations, but the match was not close enough to indicate a direct match with the GHCN records.

I processed the NDP048 first and fairly quickly found the match with the GHCN *011 and *021 records. So I *thought* that NDP040 was going to give me GHCN *010 and *020. Imagine my disappointment when it didn’t pan out. On the other hand, the near perfect match of NDP040 with GHCN was quite a pleasant surprise. Maybe someone else can locate the original source for the *010 and *020 records.

In addition, the GHCN records are longer than the related NDP records.

Update

JR in the comments below identifed 012 and 020 as derivations of Tmean=(Tmax+Tmin)/2 from the NDP040. Also identifying 011 as another derivation of Tmid from NDP040.

So I went back, made a few tweaks, and took a look.

# a = monthly mean of Daily Tmids
 sum(abs(ghcn_389740010[,3:14] - ndp40_38974a[,3:14]),na.rm=T) # 649
 sum(abs(ghcn_389740011[,3:14] - ndp40_38974a[,3:14]),na.rm=T) # 3
 sum(abs(ghcn_389740012[,3:14] - ndp40_38974a[,3:14]),na.rm=T) # 4035
 sum(abs(ghcn_389740020[,3:14] - ndp40_38987a[,3:14]),na.rm=T) # 5348
 sum(abs(ghcn_389740021[,3:14] - ndp40_38987a[,3:14]),na.rm=T) # 0

# b = monthly mean of Daily Tmeans = (Tmax + Tmin)/2
 sum(abs(ghcn_389740010[,3:14] - ndp40_38974b[,3:14]),na.rm=T) # 3849
 sum(abs(ghcn_389740011[,3:14] - ndp40_38974b[,3:14]),na.rm=T) # 4125
 sum(abs(ghcn_389740012[,3:14] - ndp40_38974b[,3:14]),na.rm=T) # 159
 sum(abs(ghcn_389740020[,3:14] - ndp40_38987b[,3:14]),na.rm=T) # 176
 sum(abs(ghcn_389740021[,3:14] - ndp40_38987b[,3:14]),na.rm=T) # 5309

I agree with his analysis that 0012 and 0021 are the mean of the daily mean of Tmax + Tmin from NDP040.
I’m not convinced that the 0010 series originates from either method.

Advertisements
  1. carrot eater
    2010 July 21 at 9:38 pm

    Good going. This is the first I’ve seen of a blogger trying to hunt down the sources of GHCN.

    I think NOAA should add a column to the inv file, denoting the sources used for each duplicate number.

    I’m a little surprised that source was easy to find online. I always assumed that getting the historical archives meant trudging to the library.

  2. 2010 July 21 at 9:57 pm

    Well, even this isn’t the full span for just these records – still need the 1904-1936 data. But, yeah, its a pretty strong first step. Anybody want to buy me a full set of World Weather Records? 😉

  3. 2010 July 21 at 10:26 pm

    Ron,
    That’s great detective work. I loved the picture too! Just summarising, as I understand it:
    1.Serahs and Kuska are different places – Serahs at (36.53N, 61.22E) 279m and Kuska at (35 17N, 62 21E) 625m
    2. GHCN records *011 and *021 are the same place, apparently Kuska.
    3. *020 we don’t know – it doesn’t seem the same as *010, but probably isn’t Kuska.

    I tracked a few years of *021 and *020 at Roman’s. *021 seemed more seasonally variable, but about the same mean, which isn’t consistent with a 350m altitude difference. Ah well – I kept having that old song in my head – “the vessel with the pestle has the brew that is true…”.

    Practical implications? Clearly “duplicates” crossing over is going to mess up any algorithmic refinements designed to treat them accurately. On the other hand, just taking the mean, and then aggregating the means over a grid cell (all these seem to be in the same cell) is probably unaffected by the crossover. The same is true ofmy way of treating them, which is anyway equivalent. The contributions to the regional and global means may not be correctly attributed, but they add to the right amount.

  4. carrot eater
    2010 July 21 at 10:44 pm

    Ron, I’m not buying you anything, but if you ask really nicely, I might scan in some pages from the library for you if you have focused requests.

    Nick, in terms of algorithms, I think it’d be best if the method detected whether the duplicates can actually be considered duplicates. I do agree that in some implementations, it’ll end up counting the same either way, anyway.

    But what bothers me more is what can get crammed into a single duplicate. If there’s a gap of 20 years, I think it ought to be treated as a separate station. You have no idea if there was some discontinuity in that gap.

  5. 2010 July 22 at 4:29 am

    Nick,

    *01* is Serahs
    *02* is Kuska

    One of the oddest things about GHCN on this issue is that they placed these two stations into one WMO id since Kuska has its own WMO id 38987.

    We don’t know the origin of the data for
    *010
    *012
    *020

    The “223 USSR/FSU Stations” data in NDP040, a data set noted in the GHCN description, is the origin of the records for
    *011
    *021

  6. 2010 July 22 at 8:03 am

    I wonder how useful WMSSC (http://dss.ucar.edu/datasets/ds570.0/) is as a companion to GHCN, as far as duplicates and whatnot go? I have the datafile at http://drop.io/0yhqyon/asset/wmssc-temperature-txt for those who don’t want to register with UCAR.

  7. RomanM
    2010 July 22 at 8:28 am

    Very nice sleuthing, Ron. You appear to have done a thorough job.

    What this indicates to me is that clearing up the greater portion of these inconsistent looking “duplicates” would require independent temperature sources and a monumental effort without any notion of how different the final global results would be at the end of the process.

    I had chosen this example somewhat haphazardly to indicate what the more extreme differences might look like. There are some that are more interesting. For example, for several of the stations in Canada:

    4033714310030 and 4033714310031 Peterborough Ontario
    403716250050 and 4033716250051 Pembroke Ontario

    the difference is a series of constant monthly adjustments over a good part of the earlier record. There is no indication of which would be the “unadjusted” version.

    As well, there seem to be an inordinately large number of Korean stations which display highly variable differences across their entire record (country code 221 and station numbers typically of the form 47xxx).

    Nick, given the uncertainty of the provenances of some of these series, I would be loath to start averaging them either as individual series or by first averaging the 0 and 1 series together and then throwing the average into the mix. The latter method would give each a relative weight of .5 to each series as opposed to an individual weight of 1 by the former method. This can affect the size of any calculated error bounds of the result.

    When I get some time, I will look at the stations with three or more duplicates as well. Despite being retired, I still have some work commitments which interfere with the fun of blogging. 😦

  8. toto
    2010 July 22 at 9:06 am

    Ron Broberg :
    Nick,
    *01* is Serahs
    *02* is Kuska
    One of the oddest things about GHCN on this issue is that they placed these two stations into one WMO id since Kuska has its own WMO id 38987.

    But they did not place them as “duplicates” either, right? I mean, the 001* and 002* numbers indicate different stations. It is the last digit only that enumerates duplicates of the same station (that’s how I interpret the info in the readme file quoted on Roman’s blog post).

    So in short, it seems that GHCN failed to recognise that Kuska has its own WMO number, but nevertheless recognised (somehow) that it constitutes a different station. So they gave it the WMO number of nearby Serahs, and tacked an additional “modifier” (3-digit) number on it (following the procedure on the readme file).

    The question is whether the duplicates within each station are actually bona fide duplicates (i.e. whether all the 001* are actual Serahs records, and all the 002* are actual Kuska records). If it were not so, then the “duplicate” flag would not be reliable in this case (and thus, presumably, in others). Right?

    rb: edit to fix quote tags

  9. 2010 July 22 at 9:18 am

    toto, I think you have the gist of it.

    0010,0011,0112 are all likely to be records from Serahs
    0020,0021 are all likely to be records from Kuska

    But we can only confirm that by locating data sources that match 0010,0012, and 0020. I think we *might* be able to find that other data set.

    In fact, I think that there may be more than “one” other data set. The very early records 1904-1935 are possibly available in the early editions of World Weather Records. Later records may be in MCDW or some other more modern aggregate. Just don’t know *yet*.

  10. carrot eater
    2010 July 22 at 10:15 am

    Ron, do you have access to the MCDW? I checked, and unless I missed something it isn’t the source you want (it only has 38974 as Saragt for recent years). It’s a horrific format for this sort of work, by the way – a separate PDF for every month.

    I’d have thought the NCAR wmssc set would be promising, but unless I missed it, I don’t think it’s the answer here.

    So maybe it’s the world weather records we want. If I get bored, I might take a look.

  11. RomanM
    2010 July 22 at 10:26 am

    #8 toto

    GISS seems to think that both of these records belong to the same station because it uses the same station number AND the same three digit modifier for them:

    http://data.giss.nasa.gov/cgi-bin/gistemp/findstation.py?datatype=gistemp&data_set=0&name=Kuska

    Only the duplicate number is different.

  12. JR
    2010 July 22 at 10:39 am

    Good job identifying ndp040 as the source.

    0010, 0011, and 0012 are all from station 38974. 0010 and 0011, which have numbers that are not identical but very close are from the average of TMID. 0012 is from the average of TMAX and TMIN.

    0020 and 0021 are from station 38987. 0020 is from the average of TMAX and TMIN. 0021 is from the average of TMID.

  13. 2010 July 22 at 10:45 am

    Roman, GISS isn’t a data source. They are a data consumer. >90% of their stations are simply GHCN stations. For the US, they swap in USHCN. The pick up additional Antarctic data from SCAR.

    And JR finishes the analysis! Nice work.

  14. carrot eater
    2010 July 22 at 10:59 am

    on the terminology – what you are meaning by “average of Tmid”?

  15. RomanM
    2010 July 22 at 11:21 am

    Ron, I didn’t imply that they were a data source. I merely pointed out that, like myself, they seemed to be making the same interpretation that these were duplicate records from the same station.

  16. JR
    2010 July 22 at 11:25 am

    TMID is defined as the “average of all observations for each calendar day”. Since the number of observations per day varies over time, the TMID time series should be avoided.

  17. carrot eater
    2010 July 22 at 11:29 am

    JR, That’s what I assumed. Thank you.

  18. 2010 July 22 at 3:28 pm

    Posted an update based on JR’s comments. Agree that 012 and 020 derive from Tmean = (Tmax + Tmin)/2. Not convinced by the 010 is another Tmid record from NDP040.

  19. carrot eater
    2010 July 22 at 4:26 pm

    So where do we stand? This whole thing was essentially prompted by Roman being wary of the magnitude of the difference between 020 and 021, which should be duplicates, which often are due to different ways of calculating the monthly mean.

    So we have indeed confirmed that 020 and 021 are just coming from different ways of calculating the monthly mean?

  20. 2010 July 22 at 5:03 pm

    I think we know where 0020 and 0021 originate. The first is likely a Tmean=(Tmax+Tmin)/2 record and the second is definitely a Tmid record from the same source that was archived as NDP040. The difference between the two records remains and Roman’s concern about what might be covered by “dupe” is well placed.

    I think it also raises some questions about the “over 100 methods” of calculating a monthly mean.

  21. carrot eater
    2010 July 22 at 5:47 pm

    OK, but the difference is due to the advertised reason.

    They never hid that the different ways of calculating a mean can give some bigger-than-you-might expect differences.

    “Indeed, the differences between two different methods of calculating mean temperature at a particular station can be greater than the temperature difference from two neighboring stations.”

    But being more aware of this, if the bloggers want to change their approach with the duplicates, that’s good. I think using the RSM or LSM for combining duplicates is the way to go. I don’t know what they all do, though.

    The other thing to not gloss over is how to handle the sublocations within the same ID. Have to make sure the spatial weighting is appropriate.

  22. 2010 July 22 at 7:50 pm

    KISS: Toss out all the dupes and leave only the “0010” record.

    What you lose in information you gain in simplicity.

    Don’t get me wrong, CE, I personally am not opposed to trying to cram in more information through the use of multiple records and dupes, to try to suppress the noise and increase the signal through the use of homogenization and other station specific or data set general techniques.

    But one of the design goals where I come from is to “Marine-proof” things. Keep it simple; make it damn hard to break. You design for the 18 year old kids in the field; kids who are going to use it in environments you haven’t dreamed of in ways that would make you blush.

    One of the issues with CRUTEM and GISTEMP is that they are written for specialty audiences and lack some degree of transparency – GISTEMP through complexity of code and CRUTEM from failure to fully disclose. I don’t condemn them for that. Their primary audience was specialists. GHCN has some of the same faults – lack of traceability for instance.

    But the game is changing. And transparency and simplicity have additional value today … because of the wider audience, because of the money, because of the political policy fallout.

    Trying to do “too much”, to cram in “too many features,” has a negative effect for this broader audience. Again, this is not pointing fingers, but a broader message that needs to be heard. Some things can’t be “dumbed down,” but other things can, and I think that both the code and the data can use a bit of “dumbing down.”

    Of course, that’s a lot of chutzpah for a guy who hasn’t mastered homogenization techniques! 🙂

  23. carrot eater
    2010 July 23 at 3:25 am

    You’re only setting the stage for the next meme: “they cherrypicked the data and put the hockey sticks in 0010, and their meat-grinders throw away the rest!!!”

  24. carrot eater
    2010 July 23 at 4:40 am

    Well, I see your drive for simplicity, and there is room for both intricate and simple out there; in fact there’s no reason why one blogger can’t produce both.

    But I think it’s worthwhile continuing this line of poking around. Maybe find a few worst case scenarios – duplicates that have large absolute differences, and correlate poorly with each other – and see what’s going on there. In the end, you want to have made sure that simply using RSM/LSM to combine duplicates is reasonable in all or almost all cases.

    But still, the worst cases I worry more about is where there are big gaps in data.

  25. 2010 July 29 at 7:16 am

    Anyone not wanting to deal with duplicates could use the output from Step 1 of ccc-gistemp:

    python tool/run.py -s 0,1
    

    That will runs Steps 0 and 1; Step 1 output is in work/v2.step1.out in v2.mean format (and is therefore at 0.1C precision).

    Step 1 in ccc-gistemp is where duplicates records are combined (by RSM). There are still stations with duplicates in Step 1 output (because they could not be combined for lack of overlap).

  26. 2010 July 29 at 7:18 am

    I second what carrot eater said. Large gaps in records could be a problem. At some point in the Glorious Future I intend to modify ccc-gistemp so that it will split records with large gaps and treat them as separate “duplicates”.

  1. No trackbacks yet.
Comments are closed.