FIG 7. Graphical model of the Bayesian log-normal-binomial model for the evolution of retweet graphs. Hyper-priors are omitted for simplicity. The plates denote replication over tweets x and users vxj.
We predict the popularity of short messages called tweets created in the micro-blogging site known as Twitter. We measure the popularity of a tweet by the time-series path of its retweets, which is when people forward the tweet to others. We develop a probabilistic model for the evolution of the retweets using a Bayesian approach, and form predictions using only observations on the retweet times and the local network or “graph” structure of the retweeters. We obtain good step ahead forecasts and predictions of the final total number of retweets even when only a small fraction (i.e. less than one tenth) of the retweet paths are observed. This translates to good predictions within a few minutes of a tweet being posted and has potential implications for understanding the spread of broader ideas, memes, or trends in social networks and also revenue models for both individuals who “sell tweets” and for those looking to monetize their reach.
A Bayesian Approach for Predicting the Popularity of Tweets
Tauhid Zaman, Emily B. Fox, Eric T. Bradlow
This Web-Project represents an accounting of temperature change that is projected for North America in 2041-2070. Regional Climate Models (RCMs) are run 60 years into the future for small, 50 km x 50 km regions in North America, and their results are analyzed statistically for all regions and all four Boreal seasons. The preponderance of results throughout all of North America is one of warming, usually more than 2°C (3.6°F). A Bayesian, spatial, two-way analysis of variance (ANOVA) model is used to analyze RCM data from the North American Regional Climate Change Assessment Program (NARCCAP).
For some reason, two posts at The Blackboard, It’s “Fancy,” Sort of … (Shollenberger) and “To get what he wanted”: Upturned end points. (Lucia), seem to be having difficulty understanding the mechanics of another post at “Open Mind”, In the Classroom (Tamino). But there is nothing unusual or difficult about the methods Tamino used to create the charts which have generated so much smoke and apparent frustration at The Blackboard – resulting in an outbreak of mcintyretude: scorn, derision, insults, and the questioning of motives. Since this is mostly a quick walk through of some code to clear the smoke, I will leave the charts generated to post at the end.
In statistics, a sequence of random variables is heteroscedastic, or heteroskedastic, if the random variables have different variances. The term means “differing variance” and comes from the Greek “hetero” (‘different’) and “skedasis” (‘dispersion’). In contrast, a sequence of random variables is called homoscedastic if it has constant variance.
I’ve mostly been working through GISTEMP in this series, but the exp+sine results were interesting enough that I wanted to pause and look at both line+sine and exp+sine in all three data sets.
The “nonlinear least squares” (nls) function is part of the core of R. John Fox wrote an introduction to it: Nonlinear Regression and Nonlinear Least Squares. This function will in a few dozen iterations return a better fit than my brain-dead looping around parameter space a few tens of thousands of times.
y = (m*(x-1880) + b) + (A * cos(((x-1880)/T)*(2*pi)))
intercept (1880) b = -0.53
slope m = 0.0059 C / yr
amplitude A = 0.3 C
period T = 60 years
Which he displays as such:
The eyeball and quick sigma population checks in the previous post provided some confidence that the global temperature anomalies are normally distributed over the mean. But there are more formal tests, including D’Agostino normality test.
Various players have looked at changes in trends due to loss of stations, loss of rural stations, loss of high latitude, and loss of high altitude stations. Other cuts have included brightness and GPW population or population density. Recently, Zeke added airports to the list.
Pearson’s Chi-squared test is used to test independence of variables in categories. Make no mistake, I am only playing at being a statistician in this post. I welcome comments and corrections in what follows and suggestions of more appropriate category tests.