Eschenbach: Poisson Pill
Recently, Dr Masters of Weather Underground publicized an estimate of the chance of having 13 months in a row each with an average temperature in the top tercile of temperatures for that month for the continental US (CONUS). The probability for the temperatures of a random month to be in the top tercile for that month is, by definition, 33.33% (one in three). He calculated the probability of 13 in a row as the probability of 13 independent events which would be 1 in (1/3)^13 . This works out to be about 1 in 1.6 million.
The error here is that monthly averages are not independent events. An above average temperature in one month will “carry over” into the next month so that the probability of that next month being above average is greater than 50%. Likewise, when one month is in the upper tercile, the probability that the next month will be in the upper tercile is greater than 33%. Monthly temperatures are “events” with “memory”, and the probabilities that a particular month will have temps that are above average, or in the top tercile, or in the top quartile, etc … is influenced by the temperature of the preceding month.
At WUWT, Willis Eschenbach took the same data, broke it into overlapping 13 month chunks, counted the number of months in the upper tercile, plotted the results, and thought he saw a Poisson distribution in it. At which point, he works a least squares fit to determine the Poisson parameters. Ironically, in light of his internet preening about looking at the data first, he makes the exact same mistake as Dr Masters and then compounds it. Poisson distributions are for data generated by independent events. Not only is the “success or failure” rate for a particular month to be in the upper tercile in each 13 month period not independent (for the reason explained above), neither are the overlapping 13 month periods independent. For example, say his first 13 month period is for “June 2011 to June 2012” and the second 13 month period is for “May 2011 to May 2012.” These two periods share 11 data points – hardly independent.
But there is something interesting that arises out of the Poisson plot. The discrete Poisson distribution is characterized by only one parameter: lambda. This parameter lambda is the mean of the data. In this case, it is the number of months expected to be found in the upper tercile (for that month) given a set of monthly temperatures over a range of 13 months. If the average monthly temperatures were independent events, the mean would be 13 months * 33.33% = 4.333 months in the upper tercile per 13 month period. Eschenbach’s fit uses a lambda of approximately 5.2. Reversing the previous calculation, 5.2/13 = 40% that any particular month will be found in the upper tercile.
Let’s go back to Master’s calculation.
He calculated the probability that a string of months 13 long will exist given that each month has a 33% of being in the upper tercile. That calcuation is …
(.333)(.333)(.333) … (.333) (for a total of 13 terms), which can be expressed as
But we know that if given a month is in the upper tercile, the following month is more likely than “random” to also be in the upper tercile. So, assuming that memory for monthly temperatures is only one month long, the calcuation looks something like
(.333)(.333 + a)(.333 + a)(.333 + a) … (.333 + a) for a total of 13 terms, or
(.333)(.333 + a)^12
Did Eschenbach stumble into a method of finding ‘a’?
We can discard the Poisson distribution model and go back to the data for the answer.
Using the NOAA NCDC drd964x.tmpst dataset, extract the temperatures for the lower 48 (CONUS). This data is in monthly columns and annual rows. Sorting each column, define the upper tercile for each month. Convert the matrix to year_month list and resort the data into a time series. Calculate the probability of the following month being in the upper tercile given that the previous month was in the upper tercile. That probability is 44.31%. Checking our work, we see that 33.62% of the months are flagged as upper tercile. This should be precisely 33.33% but is not due to rounding of the columns and the presense of a few NAs. Code is here.
Note that this “measured” 44.31% probability is close to, but a bit higher, than that inferred from the Poisson distribution fit of the data.
Now we can return to the original calculation in question. How likely is it that the average temperature for every month in a string of 13 months will be in the upper tercile of averages for that month … assuming that the “memory” of this data is only one month deep? By definition, the first month in a random string has a 33.33% probability of being in the upper tercile, but, per the data, each of the following months in an uninterrupted string will have a 44.44% chance of also being in the upper tercile for that month. Thus …
(.3333)(.4444)(.4444) … (.4444) for a total of thirteen terms, or …
(.3333)(.4444)^12 = one chance in 50500.
There’s my cut at the problem. I’ll note that if we calculate the string of 13 using the 40% probability inferred from the Poisson distribution model, then we get a value (1 in 179000) notably close to Lucia’s Monte Carlo modeling (1 in 167000).
Update 20120712 2026: Reading Chris’ comment at Tamino’s about excluding the data that you want to ‘test’ from the model that you build for it, I recognized the error. I’ve rerun this setup using the same NOAA NCDC CONUS dataset, but excluding 2011 and 2012 from the analysis. This gives me a probability of 43.68% that a month will be in its upper tercile given that the previous month was in the upper tercile. This slight down movement means that the chance of 13 in a row is slightly less than that calculated above, but only slightly. My estimate now stands at 1 in 62,100.