Archive for the ‘Statistics’ Category

Contemplating Cultural Boundaries

2013 July 18 Comments off

The Mesh of Civilizations and International Email Flows

Abstract: In The Clash of Civilizations, Samuel Huntington argued that the primary axis of global conflict was no longer ideological or economic but cultural and religious, and that this division would characterize the “battle lines of the future.” In contrast to the “top down” approach in previous research focused on the relations among nation states, we focused on the flows of interpersonal communication as a bottom-up view of international alignments. To that end, we mapped the locations of the world’s countries in global email networks to see if we could detect cultural fault lines. Using IP-geolocation on a worldwide anonymized dataset obtained from a large Internet company, we constructed a global email network. In computing email flows we employ a novel rescaling procedure to account for differences due to uneven adoption of a particular Internet service across the world. Our analysis shows that email flows are consistent with Huntington’s thesis. In addition to location in Huntington’s “civilizations,” our results also attest to the importance of both cultural and economic factors in the patterning of inter-country communication ties.


Changing Mass Priorities: The Link between Modernization and Democracy

(modified from original in cited paper)


Huntington: The Clash of Civilizations


I find it interesting that Huntington’s cultural boundaries are to some degree quantifiable.


See also Culturomics 2.0:

Zaman: A Bayesian Approach for Predicting the Popularity of Tweets

2013 May 3 Comments off

FIG 7. Graphical model of the Bayesian log-normal-binomial model for the evolution of retweet graphs. Hyper-priors are omitted for simplicity. The plates denote replication over tweets x and users vxj.

We predict the popularity of short messages called tweets created in the micro-blogging site known as Twitter. We measure the popularity of a tweet by the time-series path of its retweets, which is when people forward the tweet to others. We develop a probabilistic model for the evolution of the retweets using a Bayesian approach, and form predictions using only observations on the retweet times and the local network or “graph” structure of the retweeters. We obtain good step ahead forecasts and predictions of the final total number of retweets even when only a small fraction (i.e. less than one tenth) of the retweet paths are observed. This translates to good predictions within a few minutes of a tweet being posted and has potential implications for understanding the spread of broader ideas, memes, or trends in social networks and also revenue models for both individuals who “sell tweets” and for those looking to monetize their reach.

A Bayesian Approach for Predicting the Popularity of Tweets
Tauhid Zaman, Emily B. Fox, Eric T. Bradlow
arXiv:1304.6777 [cs.SI]

Colorado Counties Have More Voters Than People … Not So Much

2012 November 10 1 comment

A review of voter registration data for ten counties in Colorado details a pattern of voter bloat inflating registration rolls to numbers larger than the total voting age population. Using publicly available voter data and comparing it to U.S. Census records reveals the ten counties having a total registration ranging between 104 to 140 percent of the respective populations. …

…All ten counties investigated by Media Trackers reported voter turnout greater than the national average. Nine out of ten also showed voter turnout well above the Colorado average. Mineral and San Juan counties, which have voter registration numbers of 126 percent and 112 percent respectively, had voter turnout of 96 and 83 percent respectively. …

–Aron Gardner, Sep 4 2012
Colorado Counties Have More Voters Than People

The above article was written well before the election but was presented to me after Romney’s defeat by someone trying to make the case that this shows that Colorado is experiencing voter fraud.

So I decided to look at the numbers.

Read more…

Pollitricks: More Election Doodle Dandy

2012 October 29 1 comment

Election Doodle Dandy

2012 October 22 3 comments

Original image replaced by one that auto-updates
The original image can be found here

Read more…

Eschenbach’s Poisson Pill: Take Two, Call Me In The Morning

2012 August 2 Comments off


Eschenbach: I have tried to fit it myself with a number of other distributions, without success … but you guys are better theoreticians than I am, I await your answers.

I’m no statistical theoretician, but I did sleep in a Holiday Inn Express once. So before I forget it all, I want to look at this one more time.

To review, the question on the table is “What is the chance that thirteen hot months in a row will occur” if a hot month is defined as being in the upper third for all such months across all years (eg … one of the 33% hottest Junes on record).

A simple binomial distribution is wrong because hot months are more likely to follow hot months. The chance of one month being hot is not independent of the state of the previous month and binomial distributions assume independent events.

Willis Eschenbach fitted a Poisson distribution to a data set derived from the NCDC CONUS wherein he took a running count of the number of hot months in an overlapping series of thirteen across the full range of the data. The Poisson distribution, however, extends across all the positive integers from 0 to infinity, which in this case means that there is a finite chance that for getting 14 hot months in a series of 13. Obviously that makes no sense – so what is going on?

Eschenbach used all the data available to develop his “Poisson model” to the answer the question regarding the likelihood of a thirteen month run of hot months, but it has been pointed out that it is preferable to exclude the actual event to be “predicted” from the data being used to develop the model intending to predict the rare event. By including the last two “hot years” with the modeling data, Eschenbach fattens the tail of the data distribution, one reason that he has overestimated the Poisson lambda.

In my previous post, I suggested that we could just look at the data and find what the correlated probability of a hot month following a hot month would be and from that, explicitly calculate the chance for 13 hot months. Excluding 2011 and 2012 from consideration, that correlated probability is .4368. From that we can calculate the chance of a run of 13 hot months: (.3333)(.4368)^12 or about 1 in 62000.

But what about other runs? What is the chance of 12 in 13 … or 6 in 13 … or 0 in 13? To find those answers, we can build a matrix of all possible permutations (order matters) of 13 months either hot or not and then use a set of conditional probabilities to calculate each permutation. We need four such conditionals. To do this, we need only calculate the additional probability that a not-hot month follows a not-hot month similar to how we calculated the two hot months previously. This gives the following 4 probabilities:

P(H|H) = 0.4368 which is the probability of two hot months in a row
P(H|N) = 1 – P(H|H) which is the prob that a hot month follows a not month
P(N|H) = 1 – P(N|N) which is the prob that a not month follows a hot month
P(N|N) = .7146 which is the probability of two cold months in a row.

The probability for each permutation of hot and not months can then be calculated. The sum of all such permutations sums to 1 by definition and is verified in the code to make sure that I ensure the permutation matrix was created properly. Then we can bin the resulting sample space into the number of hot months and summing those probabilities. This gives the probability distribution displayed above as “Modified Binomial.”

This is a better fit to the sample data (pre-2011) then Eschenbach’s truncated Poisson with lambda 5.2 by either the unweighted variance (which fits the visual PDF curve) or the probability-weighted variance (which fits the actual data occurrence).

One thing to note about this “modified binomial” (dependent events) is that it is wider and flatter than the “unmodified binomial” (independent events). The autocorrelation makes the tails, the unlikely events, more common and, therefore, reduces the probability of the more common events.

However, the “modified binomial” is not fat enough in the tails and is too tall in the middle. This is likely due to the data being correlated for longer than one month. Accounting for two month lag or three month lag will likely make the modified binomial even flatter and fatter.

But we can also fit the data to a truncated Poisson using the mean of the number of hot months (4.35) found in the actual sample of hot months (pre 2011), and we get an even better fit than either my modified binomial with one month correlation or Eschenbach’s Poisson. I’m not surprised that Eschenbach’s Poisson is too fat in the upper tail. This is partly due to including the recent, rare event in his data set and I suspect that his ‘fit’ process was to find the least variance in this overly fat tail. I am surprised that a different Poisson (using the data mean 4.35) beats the modified binomial developed with a one month correlation. As noted in the previous paragraph, I believe that this is due to the correlation extending deeper than 1 month.

So what does the Poisson from the data mean tell us? For one, my objection regarding the range of the Poisson is negligable – we can pretty much ignore the possibility of events where the number of months in a series of 13 will exceed 13. In the Poisson from the data mean, this will occur only once every 5500 times. In Eschenbach’s fatter Poisson, this occurs once every 1000 times. In the Poisson from the mean, the chance of 13 in a row is once in every 2400 events, far more frequently than in the dependent modified binomial. Perhaps coincidentally, this comes close to matching Lucia’s HK estimate of 1 in 2000.

But why does the Poisson fit the data so well? Fundamentally, it has to do with the autocorrelation of hot/not months flattening the height and fattening the tails of what would be a binomial distribution if we could assume independent events. Canonically, Poisson can be used as an approximation to the binomial when n is large and p is small, such that lambda is less than 5 or so (larger np can be approximated using normal distributions). In this case, we have a mean of 4.35 derived from the data, although n is not large nor p small. Dividing the mean by n=13, we can calculate p = .335 – which would be very close to correct if the hot months were independent. So the Poisson approximation, flatter and fatter than the binomial, mimics the behavior of an explicitly calculated PDF using the probabilities for (autocorrelated) dependent events. Pretty nifty, all in all.

Code here

Eschenbach: Poisson Pill

2012 July 12 7 comments

Recently, Dr Masters of Weather Underground publicized an estimate of the chance of having 13 months in a row each with an average temperature in the top tercile of temperatures for that month for the continental US (CONUS). The probability for the temperatures of a random month to be in the top tercile for that month is, by definition, 33.33% (one in three). He calculated the probability of 13 in a row as the probability of 13 independent events which would be 1 in (1/3)^13 . This works out to be about 1 in 1.6 million.

The error here is that monthly averages are not independent events. An above average temperature in one month will “carry over” into the next month so that the probability of that next month being above average is greater than 50%. Likewise, when one month is in the upper tercile, the probability that the next month will be in the upper tercile is greater than 33%. Monthly temperatures are “events” with “memory”, and the probabilities that a particular month will have temps that are above average, or in the top tercile, or in the top quartile, etc … is influenced by the temperature of the preceding month.

At WUWT, Willis Eschenbach took the same data, broke it into overlapping 13 month chunks, counted the number of months in the upper tercile, plotted the results, and thought he saw a Poisson distribution in it. At which point, he works a least squares fit to determine the Poisson parameters. Ironically, in light of his internet preening about looking at the data first, he makes the exact same mistake as Dr Masters and then compounds it. Poisson distributions are for data generated by independent events. Not only is the “success or failure” rate for a particular month to be in the upper tercile in each 13 month period not independent (for the reason explained above), neither are the overlapping 13 month periods independent. For example, say his first 13 month period is for “June 2011 to June 2012” and the second 13 month period is for “May 2011 to May 2012.” These two periods share 11 data points – hardly independent.

But there is something interesting that arises out of the Poisson plot. The discrete Poisson distribution is characterized by only one parameter: lambda. This parameter lambda is the mean of the data. In this case, it is the number of months expected to be found in the upper tercile (for that month) given a set of monthly temperatures over a range of 13 months. If the average monthly temperatures were independent events, the mean would be 13 months * 33.33% = 4.333 months in the upper tercile per 13 month period. Eschenbach’s fit uses a lambda of approximately 5.2. Reversing the previous calculation, 5.2/13 = 40% that any particular month will be found in the upper tercile.

That’s odd.

Let’s go back to Master’s calculation.

He calculated the probability that a string of months 13 long will exist given that each month has a 33% of being in the upper tercile. That calcuation is …
(.333)(.333)(.333) … (.333) (for a total of 13 terms), which can be expressed as

But we know that if given a month is in the upper tercile, the following month is more likely than “random” to also be in the upper tercile. So, assuming that memory for monthly temperatures is only one month long, the calcuation looks something like
(.333)(.333 + a)(.333 + a)(.333 + a) … (.333 + a) for a total of 13 terms, or
(.333)(.333 + a)^12

Did Eschenbach stumble into a method of finding ‘a’?

We can discard the Poisson distribution model and go back to the data for the answer.

Using the NOAA NCDC drd964x.tmpst dataset, extract the temperatures for the lower 48 (CONUS). This data is in monthly columns and annual rows. Sorting each column, define the upper tercile for each month. Convert the matrix to year_month list and resort the data into a time series. Calculate the probability of the following month being in the upper tercile given that the previous month was in the upper tercile. That probability is 44.31%. Checking our work, we see that 33.62% of the months are flagged as upper tercile. This should be precisely 33.33% but is not due to rounding of the columns and the presense of a few NAs. Code is here.

Note that this “measured” 44.31% probability is close to, but a bit higher, than that inferred from the Poisson distribution fit of the data.

Now we can return to the original calculation in question. How likely is it that the average temperature for every month in a string of 13 months will be in the upper tercile of averages for that month … assuming that the “memory” of this data is only one month deep? By definition, the first month in a random string has a 33.33% probability of being in the upper tercile, but, per the data, each of the following months in an uninterrupted string will have a 44.44% chance of also being in the upper tercile for that month. Thus …
(.3333)(.4444)(.4444) … (.4444) for a total of thirteen terms, or …
(.3333)(.4444)^12 = one chance in 50500.

There’s my cut at the problem. I’ll note that if we calculate the string of 13 using the 40% probability inferred from the Poisson distribution model, then we get a value (1 in 179000) notably close to Lucia’s Monte Carlo modeling (1 in 167000).


Update 20120712 2026: Reading Chris’ comment at Tamino’s about excluding the data that you want to ‘test’ from the model that you build for it, I recognized the error. I’ve rerun this setup using the same NOAA NCDC CONUS dataset, but excluding 2011 and 2012 from the analysis. This gives me a probability of 43.68% that a month will be in its upper tercile given that the previous month was in the upper tercile. This slight down movement means that the chance of 13 in a row is slightly less than that calculated above, but only slightly. My estimate now stands at 1 in 62,100.