GSOD to GHCN: Round 2
UPDATE 2010 08 20: THIS POST IS RESTORED FROM A YAHOO CACHE AND SUFFERS FROM FORMATTING ISSUES. MY APOLOGIES FOR THE TECHNICAL DIFFICULTY
Introduction
My initial cut at processing the Global Surface Summary of the Day (GSOD) data was made at GSOD: Global Surface Summary of the Day. I noted that I had made numerous hand edits and manually run command line scripts for that initial processing. In this post, I lay out my first pass at a completely automated run.
Scripts
I used four scripts to process the data.
In addition, I hand-crafted a GHCN/GSOD country code chart:
runGSOD.sh
#!/bin/bash # initialize variables dir=/mnt/usb/sysadmin/projects/co2/gsod ddir=$dir/data fdir=$dir/data/ftproot sdir=$dir/scripts rdir=$dir/results mkdir -p $ddir mkdir -p $fdir mkdir -p $rdir # required input files ishfile=$fdir/ish-history.txt ccfile=$ddir/ghcn.gsod.cc.csv tmpM1=$ddir/tmp.gsod.mean.1 tmpM2=$ddir/tmp.gsod.mean.2 tmpM3=$ddir/tmp.gsod.mean.3 tmpM4=$ddir/tmp.gsod.mean.4 tmpM5=$ddir/tmp.gsod.mean.5 tmpM6=$ddir/tmp.gsod.mean.6 tmpM7=$ddir/tmp.gsod.mean.7 tmpI1=$ddir/tmp.gsod.inv.1 tmpI2=$ddir/tmp.gsod.inv.2 tmpI3=$ddir/tmp.gsod.inv.3 tmpI4=$ddir/tmp.gsod.inv.4 tmpI5=$ddir/tmp.gsod.inv.5 tmpI6=$ddir/tmp.gsod.inv.6 tmpC1=$ddir/tmp.gsod.stncc.1 tmpC2=$ddir/tmp.gsod.stncc.2 tmpC3=$ddir/tmp.gsod.stncc.3 tmpC4=$ddir/tmp.gsod.stncc.4 tmpC5=$ddir/tmp.gsod.stncc.5
Nothing very exciting above. There are some unused variables. Note, though, the ccfile with the GHCN-GSOD country names and codes.
#------------------------------------------------------------------------------
# Download the GSOD tarballs from the NCDC ftp server
#------------------------------------------------------------------------------
# approx 2.5 hours
getData(){
echo "Getting FTP Data"
cd $fdir
for i in `seq 1929 2010`
do
echo $i
mkdir $i
cd $i
wget -np ftp://ftp.ncdc.noaa.gov/pub/data/gsod/$i/gsod_$i.tar
cd $fdir
done
}
#------------------------------------------------------------------------------
# Unpack the yearly tarballs
#------------------------------------------------------------------------------
unzipData() {
echo "Unpacking FTP Data"
cd $fdir
for i in `seq 1929 2010`
do
cd $i
tar -xf gsod_$i.tar
# arg list is too long in 2010?
for k in `seq 0 9`
do
for j in `ls -b $k*gz`
do
gunzip $j
done
done
cd $fdir
done
}
# make file list for each year
makeFileList() {
cd $fdir
for i in `seq 1929 2010`
do
cd $i
echo $i
# get rid of the old list
if [[ -e gsod_${i}_stnlist.txt ]]
then
rm gsod_${i}_stnlist.txt
fi
# make a new list
for j in `tar tvf gsod_${i}.tar | cut -b 49-`
do
echo $j | sed 's/\.\///' | sed 's/\.gz//' >> gsod_${i}_stnlist.txt
done/
cd $fdir
done
}
The station list for each year is read by the synopMonthlyTmean.R script below.
makeMonthlyMean() {
cd $fdir
R --slave --no-save --no-restore --no-environ --silent < $sdir/synopMonthlyTmean.R >> $tmpM1
# clean up station ids
sed 's/\.op,/,/' < $tmpM1 > $tmpM2
# sort by station id (not years)
sort $tmpM2 > $tmpM3
}
Obviously, refer to synopMonthlyTmean.R at this point.
getGsodCountry() {
cut -b 1-12 $tmpM3 | sort -u -n > $tmpC1
rm $tmpC2 > /dev/null
rm $tmpC3 > /dev/null
rm $tmpC4 > /dev/null
# create new station list with country abbrev
for i in `cat $tmpC1`
do
j=`echo $i | sed 's/-/ /'`
cc1=`grep "^$j" $ishfile | cut -b 47-48`
cc2=`grep "^$j" $ishfile | cut -b 44-45`
cc=""
if [ ! "${cc2}" == "" ]; then cc=${cc2}; fi;
if [ ! "${cc1}" == "" ]; then cc=${cc1}; fi;
echo ${cc}-${i} >> $tmpC2
done
# remove the unknown ids
grep "^-" $tmpC2 > $tmpC2.nocc
grep -v "^-" $tmpC2 > $tmpC3
# change country abbrv in to 3 digit ids
for i in `cat $tmpC3`
do
cc=`echo $i | cut -b 1-2`
ccc=`grep -m 1 ",${cc}," $ccfile | cut -b 1-3`
wmo=`echo $i | cut -d '-' -f 2`
usaf=`echo $i | cut -d '-' -f 3`
id=0
if [[ $wmo < 999999 ]]; then id=${ccc}${wmo}00; fi
if [[ $wmo > 999998 && $usaf < 99999 ]]; then id=${ccc}${usaf}000; fi
if [[ ! $id == 0 ]]; then echo "$i,$id" >> $tmpC4; fi
if [[ $id == 0 ]]; then echo ERROR:$wmo-$usaf; fi
done
# still seem to have a few phantoms
grep -v "^.\{25\}00" $tmpC4 > $tmpC4.nocc
grep "^.\{25\}00" $tmpC4 > $tmpC5
}
Here we begin to convert the native GSOD station naming conventions into GHCN-like station ids. I was unable to map 901 stations into an appropriate country code at this stage. There is surely room for improvement on the next pass.
formatGhcnGsodMean () {
src=$ddir/my.gsod.mean.3
idfile=$ddir/my.gsod.stncc.5
dest=$ddir/my.gsod.mean.4
dest2=$ddir/my.gsod.mean.5
dest3=$ddir/my.gsod.mean.6
rm $dest
for i in `cat $tmpC5`
do
oldid=`echo $i | cut -b 4-15`
newid=`echo $i | cut -b 17-27`
grep "^${oldid}" $tmpM3 | sed 's/'${oldid}'-/'${newid}'/' >> $tmpM4
done
# sort by stn id
sort $tmpM4 > $tmpM5
# the last step
$sdir/formatGhcnGsodMean.pl $tmpM5 > $tmpM6
}
We go back to the “mean” file and replace the native GSOD station ids with GHCN station ids and move from csv format to the GHCN fixed width.
makeGhcnGsodStationInv () {
# pull out matching station data from the ish station history
rm $tmpI2
rm $tmpI3
cut -b 1-11 $tmpM6 | sort -u > $tmpM7
for i in `cat $tmpM7`
do
stn6=`echo $i | cut -b 4-9`
stn5=`echo $i | cut -b 4-8`
grep -m 1 "^$stn6 " $ishfile | sed 's/,/ /g' >> $tmpI5
if [[ $? > 0 ]]; then grep -m 1 "^...... $stn5" $ishfile | sed 's/,/ /g' >> $tmpI5; fi
done
# do the format
$sdir/formatGhcnGsodInventory.pl $tmpI5 $tmpC5 > $tmpI6
}
For every station in the mean file, create a GHCN type entry in a station inventory. This entry will have dummy values for most of the metadata.
getGhcnInventory() {
mkdir results 2>/dev/null
fileIn=$tmpI6
fileOut=$rdir/my.gsod.inv
cd $dir
cd ../ghcn-java
java -cp src/cdm/src/main/java:src cx.rhinohide.ghcn.WhiteboardGhcn $fileIn > $fileOut
cd $dir
}
Here we use the GHCN metadata code explored in GHCN Station Inventory: A Reconstruction.
That’s it. We have taken the GSOD data, computed monthly means, created GHCN style station ids with GHCN country codes (where possible), reformatted the monthly means into a GHCN v2.mean format, and created a GHCN style station inventory file to go with it. Data linked below.
synopMonthlyTmean.R
for (year in 1929:2010) {
filelist <- paste(year,"/gsod_",year,"_stnlist.txt",sep="")
files <- read.table(filelist,header=F,skip="1")
len <- length(files$V1)
cn <- c("STN","WBAN","YEARMODA","TEMP","A","DEWP","B","SLP","C","STP","D","VISIB","E","WDSP","F","MXSPD","GUST","MAX","MIN","PRCP","SNDP","FRSHTT")
for (f in 1:len){
dirfile <- paste(year,"/",files[f,],sep="")
gsod <- read.table(dirfile,header=F, skip=1, col.names=cn)
trow <- numeric(12)
minyrm <- year * 100 + 01
maxyrm <- year * 100 + 12
for (ym in minyrm:maxyrm) {
minyrmoda <- ym * 100 + 00
maxyrmoda <- ym * 100 + 32
this_mon minyrmoda & gsod$YEARMODA < maxyrmoda,]
trow[ym-minyrm+1] <- as.integer((mean(this_mon$TEMP) - 32) * (5/9) * 10)
}
line <- files[f,]
for (i in 1:12) {
line <- paste(line,trow[i],sep=",")
}
cat (line,"\n")
}}
Hmmm. I haven’t looked at this since I first slapped this into place. Does the mean handle nulls correctly? I need to bang on this again to make sure it is doing what I intended.
Data
There are some redistribution restrictions I am obliged to post. You can read the original conditions here:
http://www.ncdc.noaa.gov/cgi-bin/res40.pl?page=gsod.html
The following data and products may have conditions placed on their international commercial use. They can be used within the U.S. or for non-commercial international activities without restriction. The non-U.S. data cannot be redistributed for commercial purposes. Re-distribution of these data by others must provide this same notification. A log of IP addresses accessing these data and products will be maintained and may be made available to data providers.
For details, please consult:
http://www.wmo.int/pages/about/Resolution40.html
Discussion
I have only run this script in stages, not yet end-to-end. So there may be a few glitches left. I have begun the end-to-end run today and will update this post in 4 days!
A couple of links for a little more background information about GSOD and its relationship to ISH.
GCMD: Global Surface Summary of the Day – GSOD; Entry ID: gov.noaa.ncdc.C00516
Integrated Surface Hourly (DSI-3505) Entry ID: gov.noaa.ncdc.C00532
The FCC Integrated Surface Hourly Database, A New Resource for Global Climate Data
Some of these scripts are no more
Is there any possibility to download all scripts needed to make my.gsod.mean file?