Home > Uncategorized > GSOD to GHCN: Round 2

GSOD to GHCN: Round 2

2010 June 26

UPDATE 2010 08 20: THIS POST IS RESTORED FROM A YAHOO CACHE AND SUFFERS FROM FORMATTING ISSUES. MY APOLOGIES FOR THE TECHNICAL DIFFICULTY

Introduction

My initial cut at processing the Global Surface Summary of the Day (GSOD) data was made at GSOD: Global Surface Summary of the Day. I noted that I had made numerous hand edits and manually run command line scripts for that initial processing. In this post, I lay out my first pass at a completely automated run.

Scripts

I used four scripts to process the data.

runGSOD.sh

formatGhcnGsodMean.pl

formatGhcnGsodInventory.pl

synopMonthlyTmean.R

In addition, I hand-crafted a GHCN/GSOD country code chart:

ghcn.gsod.cc.csv

runGSOD.sh

#!/bin/bash

# initialize variables

dir=/mnt/usb/sysadmin/projects/co2/gsod

ddir=$dir/data

fdir=$dir/data/ftproot

sdir=$dir/scripts

rdir=$dir/results

mkdir -p $ddir

mkdir -p $fdir

mkdir -p $rdir

# required input files

ishfile=$fdir/ish-history.txt

ccfile=$ddir/ghcn.gsod.cc.csv

tmpM1=$ddir/tmp.gsod.mean.1

tmpM2=$ddir/tmp.gsod.mean.2

tmpM3=$ddir/tmp.gsod.mean.3

tmpM4=$ddir/tmp.gsod.mean.4

tmpM5=$ddir/tmp.gsod.mean.5

tmpM6=$ddir/tmp.gsod.mean.6

tmpM7=$ddir/tmp.gsod.mean.7

tmpI1=$ddir/tmp.gsod.inv.1

tmpI2=$ddir/tmp.gsod.inv.2

tmpI3=$ddir/tmp.gsod.inv.3

tmpI4=$ddir/tmp.gsod.inv.4

tmpI5=$ddir/tmp.gsod.inv.5

tmpI6=$ddir/tmp.gsod.inv.6

tmpC1=$ddir/tmp.gsod.stncc.1

tmpC2=$ddir/tmp.gsod.stncc.2

tmpC3=$ddir/tmp.gsod.stncc.3

tmpC4=$ddir/tmp.gsod.stncc.4

tmpC5=$ddir/tmp.gsod.stncc.5

Nothing very exciting above. There are some unused variables. Note, though, the ccfile with the GHCN-GSOD country names and codes.

#------------------------------------------------------------------------------

# Download the GSOD tarballs from the NCDC ftp server

#------------------------------------------------------------------------------

# approx 2.5 hours

getData(){

	echo "Getting FTP Data"

	cd $fdir

	for i in `seq 1929 2010`

	do

		echo $i

		mkdir $i

		cd $i

		wget -np ftp://ftp.ncdc.noaa.gov/pub/data/gsod/$i/gsod_$i.tar

		cd $fdir

	done

}
#------------------------------------------------------------------------------

# Unpack the yearly tarballs

#------------------------------------------------------------------------------

unzipData() {

	echo "Unpacking FTP Data"

	cd $fdir

	for i in `seq 1929 2010`

	do

		cd $i

		tar -xf gsod_$i.tar

		# arg list is too long in 2010?

		for k in `seq 0 9`

		do

		for j in `ls -b $k*gz`

		do

			gunzip $j

		done

		done

		cd $fdir

	done

}
# make file list for each year

makeFileList() {

	cd $fdir

	for i in `seq 1929 2010`

	do

		cd $i

		echo $i

		# get rid of the old list

		if [[ -e  gsod_${i}_stnlist.txt ]]

		then

			rm gsod_${i}_stnlist.txt

		fi

		# make a new list

		for j in `tar tvf gsod_${i}.tar | cut -b 49-`

		do

echo $j | sed 's/\.\///' | sed 's/\.gz//' >> gsod_${i}_stnlist.txt

		done/

		cd $fdir

	done

}

The station list for each year is read by the synopMonthlyTmean.R script below.

makeMonthlyMean() {

	cd $fdir

R --slave --no-save --no-restore --no-environ --silent < $sdir/synopMonthlyTmean.R >> $tmpM1

	# clean up station ids

	sed 's/\.op,/,/' < $tmpM1 > $tmpM2

	# sort by station id (not years)

	sort $tmpM2  > $tmpM3

}

Obviously, refer to synopMonthlyTmean.R at this point.

getGsodCountry() {

	cut -b 1-12 $tmpM3 | sort -u -n > $tmpC1

	rm $tmpC2 > /dev/null

	rm $tmpC3 > /dev/null

	rm $tmpC4 > /dev/null

	# create new station list with country abbrev

	for i in `cat $tmpC1`

	do

		j=`echo $i | sed 's/-/ /'`

		cc1=`grep "^$j" $ishfile | cut -b 47-48`

		cc2=`grep "^$j" $ishfile | cut -b 44-45`

		cc=""

		if [ ! "${cc2}" == "" ]; then cc=${cc2}; fi;

		if [ ! "${cc1}" == "" ]; then cc=${cc1}; fi;

		echo ${cc}-${i} >> $tmpC2

	done

	# remove the unknown ids

	grep "^-" $tmpC2 > $tmpC2.nocc

	grep -v "^-" $tmpC2 > $tmpC3

	# change country abbrv in to 3 digit ids

	for i in `cat $tmpC3`

	do

		cc=`echo $i | cut -b 1-2`

		ccc=`grep -m 1 ",${cc}," $ccfile | cut -b 1-3`

		wmo=`echo $i | cut -d '-' -f 2`

		usaf=`echo $i | cut -d '-' -f 3`

		id=0

		if [[ $wmo < 999999 ]]; then id=${ccc}${wmo}00; fi

		if [[ $wmo > 999998 && $usaf < 99999 ]]; then id=${ccc}${usaf}000; fi

		if [[ ! $id == 0  ]]; then echo "$i,$id" >> $tmpC4; fi

		if [[ $id == 0 ]]; then echo ERROR:$wmo-$usaf; fi

	done

	# still seem to have a few phantoms

	grep -v "^.\{25\}00" $tmpC4 > $tmpC4.nocc

	grep "^.\{25\}00" $tmpC4 > $tmpC5

}

Here we begin to convert the native GSOD station naming conventions into GHCN-like station ids. I was unable to map 901 stations into an appropriate country code at this stage. There is surely room for improvement on the next pass.

formatGhcnGsodMean () {

	src=$ddir/my.gsod.mean.3

	idfile=$ddir/my.gsod.stncc.5

	dest=$ddir/my.gsod.mean.4

	dest2=$ddir/my.gsod.mean.5

	dest3=$ddir/my.gsod.mean.6

	rm $dest

	for i in `cat $tmpC5`

	do

		oldid=`echo $i | cut -b 4-15`

		newid=`echo $i | cut -b 17-27`

		grep "^${oldid}" $tmpM3 | sed 's/'${oldid}'-/'${newid}'/' >> $tmpM4

	done

	# sort by stn id

	sort $tmpM4 > $tmpM5

	# the last step

	$sdir/formatGhcnGsodMean.pl $tmpM5 > $tmpM6

}

We go back to the “mean” file and replace the native GSOD station ids with GHCN station ids and move from csv format to the GHCN fixed width.

makeGhcnGsodStationInv () {

	# pull out matching station data from the ish station history

	rm $tmpI2

	rm $tmpI3

	cut -b 1-11 $tmpM6 | sort -u > $tmpM7

	for i in `cat $tmpM7`

	do

		stn6=`echo $i | cut -b 4-9`

		stn5=`echo $i | cut -b 4-8`

		grep -m 1 "^$stn6 " $ishfile | sed 's/,/ /g' >> $tmpI5

		if [[ $? > 0 ]]; then grep -m 1 "^...... $stn5" $ishfile | sed 's/,/ /g' >> $tmpI5; fi

	done

	# do the format

	$sdir/formatGhcnGsodInventory.pl $tmpI5 $tmpC5 > $tmpI6

}

For every station in the mean file, create a GHCN type entry in a station inventory. This entry will have dummy values for most of the metadata.

getGhcnInventory() {

        mkdir results 2>/dev/null

        fileIn=$tmpI6

        fileOut=$rdir/my.gsod.inv

	cd $dir

        cd ../ghcn-java

        java -cp src/cdm/src/main/java:src cx.rhinohide.ghcn.WhiteboardGhcn $fileIn > $fileOut

        cd $dir

}

Here we use the GHCN metadata code explored in GHCN Station Inventory: A Reconstruction.

That’s it. We have taken the GSOD data, computed monthly means, created GHCN style station ids with GHCN country codes (where possible), reformatted the monthly means into a GHCN v2.mean format, and created a GHCN style station inventory file to go with it. Data linked below.

synopMonthlyTmean.R

for (year in 1929:2010) {

  filelist <- paste(year,"/gsod_",year,"_stnlist.txt",sep="")

  files <- read.table(filelist,header=F,skip="1")

  len <- length(files$V1)

  cn <- c("STN","WBAN","YEARMODA","TEMP","A","DEWP","B","SLP","C","STP","D","VISIB","E","WDSP","F","MXSPD","GUST","MAX","MIN","PRCP","SNDP","FRSHTT")

  for (f in 1:len){

	dirfile <- paste(year,"/",files[f,],sep="")

	gsod <- read.table(dirfile,header=F, skip=1, col.names=cn)

	trow <- numeric(12)

	minyrm <- year * 100 + 01

	maxyrm <- year * 100 + 12

	for (ym in minyrm:maxyrm) {

		minyrmoda <- ym * 100 + 00

		maxyrmoda <- ym * 100 + 32

		this_mon  minyrmoda & gsod$YEARMODA < maxyrmoda,]

		trow[ym-minyrm+1] <- as.integer((mean(this_mon$TEMP) - 32) * (5/9) * 10)

	}

	line <- files[f,]

	for (i in 1:12) {

		line <- paste(line,trow[i],sep=",")

	}

	cat (line,"\n")

}}

Hmmm. I haven’t looked at this since I first slapped this into place. Does the mean handle nulls correctly? I need to bang on this again to make sure it is doing what I intended.

Data

There are some redistribution restrictions I am obliged to post. You can read the original conditions here:

http://www.ncdc.noaa.gov/cgi-bin/res40.pl?page=gsod.html

The following data and products may have conditions placed on their international commercial use. They can be used within the U.S. or for non-commercial international activities without restriction. The non-U.S. data cannot be redistributed for commercial purposes. Re-distribution of these data by others must provide this same notification. A log of IP addresses accessing these data and products will be maintained and may be made available to data providers.

For details, please consult:

http://www.wmo.int/pages/about/Resolution40.html

my.gsod.mean

my.gsod.inv

Discussion

I have only run this script in stages, not yet end-to-end. So there may be a few glitches left. I have begun the end-to-end run today and will update this post in 4 days!

A couple of links for a little more background information about GSOD and its relationship to ISH.

GCMD: Global Surface Summary of the Day – GSOD; Entry ID: gov.noaa.ncdc.C00516

Integrated Surface Hourly (DSI-3505) Entry ID: gov.noaa.ncdc.C00532

The FCC Integrated Surface Hourly Database, A New Resource for Global Climate Data

Advertisements
  1. p
    2013 May 3 at 6:32 am

    Some of these scripts are no more 😉 Is there any possibility to download all scripts needed to make my.gsod.mean file?

  1. No trackbacks yet.
Comments are closed.