DMSP: Unfinished Business
Building my own gridded temperature anomaly code introduced me to the R “raster” package. One of the things that both I and Mosher realized pretty quickly is that much of the GHCN metadata work I had been doing this spring could be refactored in R. The original work was done with a mix of original and extended java classes developed from netcdf-java as well as a Linux version of the gdal toolkit. With Peter O’Neill quietly asking me some questions about those results, it’s well past time to take a another look.
The R raster code was ridiculously easy to implement. What originally took a few weeks of creating new java classes, coding some station objects, writing some wrapper scripts, and testing different tiff files in my free time took less than an hour to reproduce in R.
The first thing to look at is whether the new code reproduces the same results as the old code. Here I pull a list of brightness values off the GISS-style metadata file I built back in May and compare it to a list of brightness values using the R raster code. This DMSP file is “world_stable_lights.tif”
library("raster") # http://data.giss.nasa.gov/gistemp/sources/GISTEMP_sources.tar.gz # http://rhinohide.org/rhinohide.cx/co2/ghcn-java/results/v2.giss.inv-latest.txt # http://ngdc.noaa.gov/dmsp/data/web_data/stable_lights (world_stable_lights.tar) # http://www.ngdc.noaa.gov/dmsp/download_rad_cal_96-97.html (rad_cal.tar) # get table with coordinates invcols <-c("id","name","lat","lon","alt","elev","ru","pop","terr","arid", "coast","cdist","airport","adist","vegetation","abc","123","bright") invwids <- c(11,32,6,8,5,5,1,5,2,2,2,2,1,2,16,1,1,5) giss_inv <- read.fwf("../giss/data/v2.inv", widths=invwids,col.names=invcols,as.is=T) my_inv <- read.fwf("v2.giss.inv-latest.txt", widths=invwids,col.names=invcols,as.is=T) # read radiance values radcal <- raster("../dmsp/data/world_stable_lights.tif") # for each coord, get radiance radiance <- xyValues(radcal,cbind(my_inv$lon,my_inv$lat)) # compare the old with the new sd(my_inv$bright - radiance, na.rm=T) # 0 sd(giss_inv$bright - radiance, na.rm=T) # 29.25
So far so good. The R code exactly reproduces my earlier GeoTiffDataReader class. No feat of genius, to be sure, but I nice confirmation nevertheless. But the values from the “world_stable_lights” have a poor match rate with the GISS brightness values. So why am I using “world_stable_lights” for this field? That choice actually arose because I was looking for the imagery that would best let me reproduce the A/B/C brightness categories in GHCN. And I struggled a bit in my last attempt to investigate the relation between the brightness index and the ABC brightness categories. One problem is that the index is developed by GISS and the categories by GHCN and they are not clearly distinct and overlap as a quartile display will clearly show. I forgo that for the moment as a distraction but will come back to it in a later post. Today I’m concentrating on the GISS brightness.
If we want the best match for the GISS brightness index, Hansen 2010 tells us where to look:
We use a nightlight radiance data set [Imhoff et al., 1997] that is publicly available
[http://www.ngdc.noaa.gov/dmsp/download_rad_cal_96-97.html] (measurements made between
March 1996 and February 1997)] at a resolution of 30″ × 30″ (0.0083°×0.0083°), which is a
linear scale of about 1 km. In Figure 1(a) the global radiances are shown after averaging to 0.5°
× 0.5° resolution, i.e., a pixel (resolution element) in Figure 1(a) is an average over 3600 high
resolution pixels. This averaging reduces the maximum radiance from about 3000 μW/m2/sr/μm
to about 670 μW/m2/sr/μm.
There are two ‘tif’ files included in the referenced tar ball, one with 0.008333 and one with half that at 0.01666 deg. This is convenient. I didn’t notice the low res version earlier and used the R raster ‘aggregate’ function to decrease the resolution (aggregrate neighboring cells into one) during some alternate investigations into the data. It took about 4 hours or so. Far less efficient than doing the same function with gdal.
So taking a quick look at the ‘fit’ of the GISS brightness field with both the hires and lowres versions. Same basic code as above.
radcal_hi = raster("../dmsp/data/world_avg.tif") rad_hi = xyValues(radcal_hi,cbind(giss_inv$lon,giss_inv$lat)) sd(giss_inv$bright - rad_hi, na.rm=T) #  13.08520 radcal_lo = raster("../dmsp/data/world_avg_low_res.tif") rad_lo = xyValues(radcal_lo,cbind(giss_inv$lon,giss_inv$lat)) sd(giss_inv$bright - rad_lo, na.rm=T) #  12.59317
So we get a slightly better fit with the low res version.
Is it good enough? Let’s look. The defining parameter for the rural/urban split given in Hansen 2010 is 32 μW/m2/sr/μm. This is the radiance value. The tif file is scaled to a ‘DN’ also known as a ‘digital number.’ The readme in the NOAA radiance calibration tarball gives the formula for the the conversion.
world_avg – The final Radiance Calibrated world product. Each pixel
contains a DN value between 0 and 255. The values between 0
and 254 can be converted to radiance values with the equation
Radiance = DN^(3/2) * 10^(-10) watts/cm^2/sr.
The value 255 signifies that no unsaturated pixels were found
over that location for the duration of the reduced gain experiment.
32 = DN^(3/2)
32^(2/3) = DN
10 ~= DN
So the ‘brightness’ number (DN) listed in the world_avg.tif which marks the break between rural and urban (and peri-urban) for GISTEMP is roughly ’10.’
I create two masks. One for GISTEMP brightness above and below ’10′. The other for the radiance DN read from the radiance calibrated tif file. Then I look for the stations that don’t fit both dark or both bright categories. I use the low resolution file.
radcal_lo = raster("../dmsp/data/world_avg_low_res.tif") #rad_lo = xyValues(radcal_lo,cbind(giss_inv$lon,giss_inv$lat)) rad_lo = xyValues(radcal_lo,cbind(giss_inv$lon-(5*d),giss_inv$lat+(2*d))) sd(giss_inv$bright - rad_lo, na.rm=T) mask_a = giss_inv$bright > 10 mask_b = rad_lo > 10 stns_b1 = giss_inv[(mask_a) & ! mask_b,] # urban in giss, rural in tif stns_b2 = giss_inv[(!mask_a) & mask_b,] # rural in giss, urban in tif c <- nrow(giss_inv) #  7364 a <- nrow(stns_b1) #  1536 b <- nrow(stns_b2) #  156 1 - (a+b)/c # 0.7702336
The above code snippet basically takes the GISS v2.inv file and selects two categories: stations that are urban in GISTEMP but are rural in the world_avg_low_res and stations that are rural in GISTEMP but are urban in the world_avg_low_res. The fit is only fair. A 77% match rate for brightness seems pretty low if we are using the same coordinates (v2.inv) and the same data files (world_avg.tif).
Can we do better? Recalling my concerns when I last visited this data, I start searching nearby cell-offsets for a better fit.
Starting with the low resolution file and an offset that is one half cell width, I begin a manual search. My best fit is below.
d = .01666 / 2 rad_lo = xyValues(radcal_lo,cbind(giss_inv$lon-(5*d),giss_inv$lat+(2*d))) sd(giss_inv$bright - rad_lo, na.rm=T) # 8.965257 mask_a = giss_inv$bright > 10 mask_b = rad_lo > 10 stns_b1 = giss_inv[(mask_a) & ! mask_b,] # urban in giss, rural in tif stns_b2 = giss_inv[(!mask_a) & mask_b,] # rural in giss, urban in tif c <- nrow(giss_inv) #  7364 a <- nrow(stns_b1) #  760 b <- nrow(stns_b2) #  218 1 - (a+b)/c # 0.8671917
Moving to the hi res file …
d = .008333 / 2 rad_hi = xyValues(radcal_hi,cbind(giss_inv$lon-(11*d),giss_inv$lat+(5*d))) sd(giss_inv$bright - rad_hi, na.rm=T) # 8.795083 mask_a = giss_inv$bright > 10 mask_b = rad_hi > 10 stns_b1 = giss_inv[(mask_a) & ! mask_b,] # urban in giss, rural in tif stns_b2 = giss_inv[(!mask_a) & mask_b,] # rural in giss, urban in tif c <- nrow(giss_inv) #  7364 a <- nrow(stns_b1) #  746 b <- nrow(stns_b2) #  224 1 - (a+b)/c # 0.8682781
We can look at the histogram of the difference between GISTEMP and the DMSP/RC lookup. Here I use the hi-res file.
Similarly, a scatter plot using GISTEMP brightness and the DMSP/RC lookup data.
So what have I shown?
You cannot extract the GISS Brightness values straight from the world_avg.tif or world_avg_low_res.tif files that Hansen 2010 points to in the descriptive GISTEMP paper.
The ‘fit’ as defined both by standard deviation and by categorization +/- 32 10^(-10) watts/cm^2/sr 32 μW/m2/sr/μm is better when the lookup uses a cell that is shifted to the west (lon – 0.04165) and north (lat + 0.01666). . This might be indicative of a coordinate error in the DMSP/RC tif file or in the algorithms used by GISS. It might also be an artificat stemming from the use of a resolution change to the DMSP/RC at GISS. I dunno.
As a note, GISTEMP uses a threshold of DN>10 for periurban and DN>35 for urban as defined in the text_to_binary.f file
If I was using a data source different from GISS to calculate this value, I might consider my 86% match rate to be sufficient and declare victory. But 86% is not good enough when I am supposedly using the same data for input as GISS. Look for more on this.
Oh. And the R raster package makes the code for this ridiculously easy.
I leave you with some related reading on why a ‘radiative calibrated’ data set exists …
Imhoff et al. (1997) investigated how well DMSP-OLS defined urban areas in the US correspond
with census data. Using an iterative thresholding technique, they established a threshold of 89%
for the 1994-95 city lights data set. This threshold was identified as being the level where, on
average, on could shrink the city-lights but preserve the integrity of the lit urban cluster (i.e. no
internal fragmentation within the cluster). This was detected via analysis of the perimeter vs. area
of clusters, with a sudden jump in perimeter length indicative of fragmentation. When overlaid
with US Census bureau data, there was no significant difference between the two areal
assessments. This approach was further exploited by Small et al. (2005) who used this technique
to investigate more general corrections for the overglow phenomenon associated with DMSP-
OLS data (see section 3.3.1). The Imhoff et al. (1997) analysis applies only to the conterminous
United States and it is unknown how this threshold performs in other parts of the world.
Thematic Guide on Night-time light Satellite Data
Previous analyses have revealed a consistent disparity
between various spatial measures of urban extent and the
spatial extent of lighted areas in the night lights datasets
(Welch, 1980; Welch & Zupko, 1980). Specifically, the
lighted areas detected by the OLS are consistently larger
than the geographic extents of the settlements they are
associated with. The larger spatial extent of lighted area,
relative to developed land area, is sometimes referred to as
‘‘blooming’’. The blooming is the result of three primary
phenomena, including: (1) the relatively coarse spatial
resolution of the OLS sensor and the detection of diffuse
and scattered light over areas containing no light source , (2)
large overlap in the footprints of adjacent OLS pixels, and
(3) the accumulation of geolocation errors in the composi-
ting process (Elvidge et al., 2004). In the context of this
study, blooming refers to spurious indication of light in a
location that does not contain a light source.
In order to offset the blooming effect, Imhoff et al.
(1997) proposed using a threshold of 89% detection
frequency to eliminate less frequently detected lighted
pixels at the peripheries of large urban areas. Imposing a
detection frequency threshold effectively shrinks the
lighted areas so they are more consistent with admin-
istrative definitions of urban extent. The drawback
associated with detection frequency thresholds is that they
also attenuate large numbers of smaller, less frequently
detected settlements. The 89% threshold proposed by
Imhoff et al. (1997) was derived from an average of
85%, 89% and 94% thresholds determined for the cities of
Chicago, Sacramento and Miami, respectively. At the state
level, the 89% threshold significantly increased the
correlation (from 0.87 to 0.97) between lighted area and
US. Census-defined urban areas, despite the numerous
caveats of using administrative boundaries that were
discussed by Imhoff et al. (1997). In a subsequent analysis,
Sutton et al. (2001) obtained a correlation of 0.68 between
Ln(lighted area) and Ln(population) when using a threshold
of 80%. These authors recognized the limitations of using a
single threshold for a global analysis and also used a
combination of 40%, 80% and 90% thresholds for a global
analysis of lighted area and aggregate population (Sutton et
al., 2001). In another recent study (Henderson et al., 2003),
he authors were able to avoid some of the confounding
factors associated with administrative (cadastral) delinea-
tions of urban extent by comparing lighted area with urban
boundaries derived from Landsat TM imagery. Using
Supervised Maximum Likelihood classifications of Beijing,
Lhasa and San Francisco, these authors obtained optimum
thresholds of 97%, 88% and 92%, respectively. They also
found that these thresholds resulted in significant lateral
shifts between the lighted area and the Landsat-derived
urban boundary. These studies all suggest that it may be
possible to obtain reasonable agreement between lighted
area and various measures of city size but these studies also
reveal significant variability in the relationship between
lighted area and different definitions of urban extent. All of
these studies have emphasized the need for more extensive
analyses of lighted area, detection thresholds and urban
Spatial analysis of global urban extent from DMSP-OLS night lights