GHCN: A Preview of Version 2
Now that GHCN v3 is almost upon us … a look back to a time when GHCN v2 was almost upon us …
(and, yes, the section on Quality Control really does have two sections 3.2 :lol:)
3. QUALITY CONTROL
GHCN version 2.0 will be primarily constructed from what Guttman (1991) calls “secondary data sources … that have been compiled, adjusted, or summarized by anyone other than the researcher using the data.” When using such data, it is imperative that data “validity” and “accuracy” be assessed. “Validity” refers to whether the data fit the particular application; “accuracy” refers to the reliability of the data. To address both concerns, an extensive quality control procedure was developed. In theory, a quality control procedure is intended to ensure that data meet certain standards of excellence. In practice, quality control implies looking for gross data errors. These “errors” take many forms, ranging from inappropriate data to magnetic media problems to formatting errors to data gaps to outliers to time series inhomogeneities. The procedure employed in GHCN version 2.0 is designed to catch the most pervasive and/or egregious problems that were discussed at the International Workhop on Quality Control of Monthly Climatic Data (Peterson, 1994). The procedure consists of four parts:
1. selection of data sets,
2. preprocessing of data files,
3. duplicate station elimination, and
4. outlier checks.
3.1 The Right Stuff
Data set “validity” was determined by reviewing any documentation and publications related to a particular data set. A number of data sets (or portions thereof) will not be included in GHCN version 2.0 because they do not meet the “validity” criterion; that is, something inherent in the way they were compiled rendered them inappropriate for this application.
Among those not included are any sets in which monthly means/totals were derived from synoptic reports. Comparisons between monthly temperature and precipitation data which are created from synoptic reports and CLIMAT data, which are monthly data created by the country taking the observations, indicate that synoptically derived values are not up to the quality that are required for GHCN (C. Ropelewski, pers. comm.. 1993).
Archives in which climatic values have undergone major “adjustments” (e.g., corrections for discontinuities) will also be excluded. Although it is sometimes difficult to find data that have not been sanitized in any manner, the intermingling of “pure” and “adjusted” records can create some acute spatial problems. For example, data for Carlsbad, New Mexico, were available from two data sets that were used to compile GHCN version 1.0. One series only contained raw data, whereas the other contained data in which discontinuity adjustments were applied. Because the annual temperature between the two correlated at an f of less than 0.01 and their metadata were dissimilar, both
3.2 Preprocessing of Data Files
Data received are rarely in a condition that would permit immediate use, regardless of the source or data set documentation. To guarantee data of the highest possible quality, extensive preprocessing reviews are conducted. The reviews involve checking data sets for completeness, reasonableness, and consistency. Although they have common objectives, the reviews are tailored to each data set, often requiring extensive programming efforts and days (or even weeks) of manpower. Some of the routine preprocessing checks applied to each new data include the following:
1. determining if the physical characteristics of each data
file (e.g., number of lines, record length) agree with
supplied documentation (if any);
2. determining if variable storage locations (i.e., the
columns in which variables are located) and variable
types (i.e., integer, real, or character) are consistent
throughout each file and agree with supplied
documentation (if any);
3. determining the number of unique stations in each file,
whether the file contains any duplicate stations and (by
comparison with documentation) if any stations are
missing or if any undocumented stations are present;
4. determining if all date variables have reasonable values,
if each file is chronologically sorted, if the period of
record for each station agrees with supplied
documentation (if any), and if duplicate statioddate
entries are present;
5. determining if the units of each climate variable agree
with supplied documentation (if any) or determining the
units if documentation is lacking; searching for specially
defined values (e.g., missing value codes or trace
rainfall codes) and undocumented, physically
meaningless values; checking the frequency of
occurrence of all possible data values;
6. searching for the presence of various flag codes and
whether they have documented meanings; searching for
contradictions between flag codes and data values and
between flag codes and other flag codes; and
7. determining if the units of each metadata variable (e.g.,
coordinates or elevation) agree with the supplied
documentation (if any) or determining the uNts if
documentation is lacking; searching for the presence of
specially defined metadata values or undocumented,
When problems are detected, the original data set compiler is contacted for additional information or revised files, if possible.
3.2 Too Much of a Good Thing
A time series for a given station can frequently be obtained from more than one source data set For example, precipitation data for Beijiig, China, are available in both Eischeid et al. (1991) and Shiyan et al. (1991). When “merging” data from multiple sources, it is important to identify these duplicate station time series because (1) the inclusion of multiple time series for a particular location can create acute spatial problems, as in the case of Carlsbad, New Mexico; (2) each may have a different period of record that. when “mingled,” results in a single series that has a longer or more complete record than either of the originals; and (3) the intercomparison of comparable year/month values in each can help identlfy other data quality problems. Unfortunately, a number of factors complicate the identification of duplicate stations. These include:
1. nonidentical or inaccurate station numbers, names,
coordinates, and elevations (i.e., metadata); and
2. data which are the same for some years and not others
(this can be caused by poor data processing practices or
inhomogeneity adjustments having been applied to one
or both series).
The duplicate station elimination procedure used in GHCN version 2.0 will take such problems into account.
3.3 The Needle in the Haystack
GHCN version 2.0 will contain a massive quantity of data (at least 10,000 temperature and precipitation stations). Considering the available resources, it will not be possible to search for extreme values/outliers in the same manner employed when compiling version 1.0 (ie., by plotting and visually inspecting every time series). As a result, it has been necessary to develop an inventory of the most common and egregious errors detected when compiling version 1.0 and then to devise tests that can detect those problems. In general, the tests fall into three categories: serial checks, spatial checks, and intervariable checks.
Serial checks involve comparing an observation in a time series with some other observation in the same time series and then deciding, based upon some threshold, whether the observation is problematic. Examples of problematic values from a serial perspective include the following:
1. cases in which the same data value occurs for several
consecutive months in one or more years;
2. cases in which consecutive years have the same data:
3. cases in which the same data value occurs in the same
month of several consecutive years;
4. cases in which a month with missing data was “left out”
of the file rather than being set to missing, causing all
subsequent data values to be placed in the preceding
month (e.g., October’s data value would be listed under
5. cases in which an observation appears extreme when
compared with other values in the same month;
6. cases in which an observation appears extreme when
compared with values in adjacent months; and
7. cases in which there is a severe change in mean or
Spatial checks involve comparing data at one station to data at another and then deciding, based upon some threshold, whether the data are problematic. Examples of problematic values from a spatial perspective include the following:
1. cases in which an observation appears extreme when
compared with values in adjacent stations; and
2. cases in which the long-term mean at a station appears
extreme when compared with the long-term mean at
Intervariable checks are quite straightforward. They include the following:
1. cases which violate the relationship: minimum
temperature <= mean temperature <= maximum
2. cases which violate the relationship: station pressure <=
sea level pressure (unless the station is below sea level).
It should be noted that erroneous values will not be corrected; rather, they will flagged as problematic, stored in a separate file, or set to missing.
4. DATA ADJUSTMENTS
The importance of using homogeneous climatological time series in climate research has recently received much attention. A homogeneous time series is defined as one where variations are caused only by variations in weather and climate (Conrad and Pollak, 1962). Using climatological time series containing artificially induced variations can lead to inconsistent conclusions. For example, Hansen and Lebedeff (1987) used data that had not been adequately examined for inhomogeneities and provided analyses of temperature trends for different regions of the globe.. One station, St. Helena Island, was used to represent a large area in the tropical south Atlantic, and indicated a considerable amount of warming over the past century. However, most of the warning apparent in the temperature record occurred when the elevation of the station decreased approximately 200 meters because of a move in 1972, causing an abrupt warming of approximately 2OC (P. Michaels. pers. comm., 1992).
A wide variety of factors can create an inhomogeneous time series. Sometimes inhomogeneities in time series occur as a gradual, artificially induced trend (e.& urban warming, drift in instrument calibration). Others are manifest as abrupt changes in scale or variance (e.g., station relocations, instrumentation replacements). Because numerous applications require homogeneous time series, GHCN version 2.0 will include temperature and precipitation time series in which discontinuities have been removed.
The technique to detect discontinuities in station temperature time series consists of four parts. First, a homogeneous reference time series for each station is created from nearby stations (Peterson and Easterling, 1994). Second. the station time series is subtracted from its reference series to create a difference series. Third, the technique tests the difference series for changes in statistical characteristics indicating a discontinuity (Easterling and Peterson, 1994). And fourth, the magnitude of the inhomogeneity is determined and an adjustment is made in the station data to account for the discontinuity. This technique will be applied to all GHCN temperature data to create a data set where much of the effects of inhomogeneities such as station moves, changes in instrumentation, and changes in station environment have been systematically and objectively removed.
Adjustments to precipitation data wiU be primarily based on metadata, with an emphasis on accounting for changes in the catch of frozen precipitation caused by changes in instrumentation, especially the installation of wind shields (Groisman, 1991). In general, stations where metadata indicate no change in observing practices or instrumentation will comprise the bulk of the homogeneous precipitation data set.
The Global Historical Climatology Network: A Preview of Version 2
Vose, Peterson, Schmoyer, Eischeid