[Those are comments sent yesterday by Shravan Vasishth in connection with my post. Since they are rather lengthy, I made them into a post. Shravan is also the author of The foundations of Statistics and we got in touch through my review of the book . I may address some of his points later, but, for now, I find the perspective of a psycholinguist quite interesting to hear.]
Christian, Is the problem for you that the p-value, however low, is only going to tell you the probability of your data (roughly speaking) assuming the null is true, it’s not going to tell you anything about the probability of the alternative hypothesis, which is the real hypothesis of interest.
However, limiting the discussion to (Bayesian) hierarchical models (linear mixed models), which is the type of model people often fit in repeated measures studies in psychology (or at least in psycholinguistics), as long as the problem is about figuring out P(θ>0) or P(θ>0), the decision (to act as if θ>0) is going to be the same regardless of whether one uses p-values or a fully Bayesian approach. This is because the likelihood is going to dominate in the Bayesian model.
Andrew has objected to this line of reasoning by saying that making a decision like θ>0 is not a reasonable one in the first place. That is true in some cases, where the result of one experiment never replicates because of study effects or whatever. But there are a lot of effects which are robust and replicable, and where it makes sense to ask these types of questions.
One central issue for me is: in situations like these, using a low p-value to make such a decision is going to yield pretty similar outcomes compared to doing inference using the posterior distribution. The machinery needed to do a fully Bayesian analysis is very intimidating; you need to know a lot, and you need to do a lot more coding and checking than when you fit an lmer type of model.
It took me 1.5 to 2 years of hard work (=evenings spent not reading novels) to get to the point that I knew roughly what I was doing when fitting Bayesian models. I don’t blame anyone for not wanting to put their life on hold to get to such a point. I find the Bayesian method attractive because it actually answers the question I really asked, namely is θ>0 or θ<0? This is really great, I don’t have beat around the bush any more! (there; I just used an exclamation mark). But for the researcher unwilling (or more likely: unable) to invest the time into the maths and probability theory and the world of BUGS, the distance between a heuristic like a low p-value and the more sensible Bayesian approach is not that large.
Filed under: Books, Statistics, University life Tagged: Bayesian tests, book reviews, BUGS, guest post, Le Monde, p-values, psycholinguistics, statistical significance
Last year, John Seaman (III), John Seaman (Jr.), and James Stamey published a paper in The American Statistician with the title Hidden dangers of specifying noninformative priors. (It does not seem to be freely available on-line.) I gave it to read to my PhD students, meaning to read towards the goal of writing a critical reply to the authors. In the meanwhile, here are my own two-cents on the paper.
“Applications typically employ Markov chain Monte Carlo (MCMC) methods to obtain posterior features, resulting in the need for proper priors, even when the modeler prefers that priors be relatively noninformative.” (p.77)
Apart from the above quote, which confuses proper priors with proper posteriors (maybe as the result of a contagious BUGS!), and which is used to focus solely and sort-of inappropriately on proper priors, there is no hard fact to bite in, but rather a collection of soft decisions and options that end up weakly supporting the authors’ thesis. (Obviously, following an earlier post, there is no such thing as a “noninformative” prior.) The paper is centred on four examples where a particular choice of (“noninformative”) prior leads to peaked or informative priors on some transform(s) of the parameters. Note that there is no definition provided for informative, non-informative, diffuse priors, except those found in BUGS with “extremely large variance” (p.77). (The quote below seems to settle on a uniform prior if one understands the “likely” as evaluated through the posterior density.) The argument of the authors is that “if parameters with diffuse proper priors are subsequently transformed, the resulting induced priors can, of course, be far from diffuse, possibly resulting in unintended influence on the posterior of the transformed parameters” (p.77).
“…a prior is informative to the degree it renders some values of the quantity of interest more
likely than others.” (p.77)
The first example is about a one-covariate logistic regression The first surprising choice is one of an identical prior on both the intercept and the regression coefficients. Instead of, say, a g-prior that would rescale the coefficients according to the variation of the corresponding covariate. Since x corresponds to age, the second part of the regression varies 50 times more. When plotting the resulting logistic pdf across a few thousands simulations from the prior, the functions mostly end up as the constant functions with values 0 or 1. Not particularly realistic since the predicted phenomenon is the occurrence of coronary heart disease. The prior is thus using the wrong scale: the simulated pdfs should have a reasonable behaviour over the range (20,100) of the covariate x. For instance, focussing on a -5 log-odds ratio at age 20 and a +5 log-odds ratio at 100. Leading to the comparison pictured below. Furthermore, the fact that the coefficient of x may be negative is also ignoring a basic issue about the model and answers the later (dishonest) criticism that “the [prior] probability is 0.5 that the ED50 is negative” (p.78). Using a flat prior in this example is just fine and would avoid criticisms about the prior behaviour, since this behaviour is then meaningless from a probabilistic viewpoint.
“…in a more complicated model, it may be hard to determine the sample size beyond which induced prior influence on the posterioris negligible.” (p.79)
There is also the undercurrent in the paper (not that under!) that Bayesian inference should look like MLE inference and that if it does not then something is wrong. If the MLE outcome is “right”, there is indeed no point in running a Bayesian analysis. Strange argument. (Example 2.4 uses the MLE of the evenness as “the true evenness”, p.80.)
The second example is inspired by Barnard, McCulloch and Meng (2000, Statistica Sinica) estimating a covariance matrix with a proper hyperprior on regression coefficient variances that results in a peaked prior on the covariances. The paper falls short of demonstrating a clear impact on the posterior inference. And the solution (p.82) of using another proper prior resulting in a wider dispersion requires a prior knowledge of how wide is wide enough.
Example 2.3 is about a confusing model, inspired by Cowles (2002, Statistics in Medicine), and inferring about surrogate endpoints. If I understand correctly the model, there are two models under comparison, one with the surrogate and one without. What puzzles me is that the quantity of interest, the proportion of treatment effect, involves parameters from both models. Even if this can be turned into a meaningful quantity, the criticism that the “proportion” may take values outside (0,1) is rather dead-born as it suffices to impose a joint prior that ensures the ratio stays within (0,1). Which is the solution proposed by the authors (pp.82-83).The fourth and last example concentrates on estimating a Shannon entropy for a vector of eight probabilities. Using a uniform (Dirichlet) prior induces a prior on the relative entropy that is concentrated on (0.5,1). Since there is nothing special about the uniform (!), re-running the evaluation with a Jeffreys prior Dir(½,½,…,½) reduces this feature, which anyway is a characteristic of the prior distribution, not of the posterior distribution which accounts for the data. The authors actually propose to use (p.83) a Dir(¼,¼,…,¼) prior, presumably on the basis that the induced prior on the evenness is then centred close to 0.5.
“A related solution to this problem is to specify a joint prior for meaningful summaries of the parameters in the sampling model. Then the induced prior on the original parameters can be computed.” (p.81)
I thus find the level of the criticism found in the paper rather superficial, as it either relies on a specific choice of a proper prior distribution or on ignoring basic prior information. The paper concludes with recommendations for prior checks. Again, no deficient hardware: the recommendations are mostly sensible if expressing the fact that some prior information is almost always available on some quantities of interest, as translated in the above quote. My only point of contention is the repeated reference to MLE, since it implies assessing/building the prior from the data… The most specific (if related to the above) recommendation is to use conditional mean priors as exposed in Christensen et al. (2010). (I did not spot this notion in my review of two years ago.) For instance, in the first (logistic) example, this meant putting a prior on the cdfs at age 40 and age 60. The authors picked a uniform in both cases, which sounds inconsistent with the presupposed shape of the probability function.
“…it is more natural for experts to think in terms of observables than parameters…” (p.81)
In conclusion, there is nothing pathologically wrong with either this paper or the use of “noninformative” priors! Looking at induced priors on more intuitive transforms of the original parameters is a commendable suggestion, provided some intuition or prior information is indeed available on those. Using a collection of priors incl. reference or invariant priors helps as well. And (in connection with the above quote) looking at the induced dataset by simulating from the corresponding predictive cannot hurt.
Filed under: Books, Statistics, University life Tagged: Bayesian predictive, BUGS, conditional means priors, Dirichlet prior, noninformative priors, The American Statistician
Valen Johnson made the headline in Le Monde, last week. (More precisely, to the scientific blog Passeur de Sciences. Thanks, Julien, for the pointer!) With the alarming title of “Une étude ébranle un pan de la méthode scientifique” (A study questions one major tool of the scientific approach). The reason for this French fame is Valen’s recent paper in PNAS, Revised standards for statistical evidence, where he puts forward his uniformly most powerful Bayesian tests (recently discussed on the ‘Og) to argue against the standard 0.05 significance level and in favour of “the 0.005 or 0.001 level of significance.”
“…many statisticians have noted that P values of 0.05 may correspond to Bayes factors that only favor the alternative hypothesis by odds of 3 or 4–1…” V. Johnson, PNAS
While I do plan to discuss the PNAS paper later (and possibly write a comment letter to PNAS with Andrew), I find interesting the way it made the headlines within days of its (early edition) publication: the argument suggesting to replace .05 with .001 to increase the proportion of reproducible studies is both simple and convincing for a scientific journalist. If only the issue with p-values and statistical testing could be that simple… For instance, the above quote from Valen is reproduced as “an [alternative] hypothesis that stands right below the significance level has in truth only 3 to 5 chances to 1 to be true”, the “truth” popping out of nowhere. (If you read French, the 300+ comments on the blog are also worth their weight in jellybeans…)
Filed under: Books, Statistics, University life Tagged: blogging, comments, False positive, Le Monde, Monsanto, p-values, Passeur de Sciences, statistical significance, UMPB test, uniformly most powerful tests, Valen Johnson
Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to based importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.
After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers 2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis. (There is no comparison with other methods.)
Filed under: R, Statistics, University life Tagged: Bayes factor, Chib's approximation, evidence, harmonic mean estimator, label switching, latent variable, marginal likelihood, MCMC, mixtures, Monte Carlo Statistical Methods, nested sampling, Poisson regression, Rao-Blackwellisation, simulation
Barber, Voss, and Webster recently posted and arXived a paper entitled The Rate of Convergence for Approximate Bayesian Computation. The paper is essentially theoretical and establishes the optimal rate of convergence of the MSE—for approximating a posterior moment—at a rate of 2/(q+4), where q is the dimension of the summary statistic, associated with an optimal tolerance in n-1/4. I was first surprised at the role of the dimension of the summary statistic, but rationalised it as being the dimension where the non-parametric estimation takes place. I may have read the paper too quickly as I did not spot any link with earlier convergence results found in the literature: for instance, Blum (2010, JASA) links ABC with standard kernel density non-parametric estimation and find a tolerance (bandwidth) of order n-1/q+4 and an MSE of order 2/(q+4) as well. Similarly, Biau et al. (2013, Annales de l’IHP) obtain precise convergence rates for ABC interpreted as a k-nearest-neighbour estimator. And, as already discussed at length on this blog, Fearnhead and Prangle (2012, JRSS Series B) derive rates similar to Blum’s with a tolerance of order n-1/q+4 for the regular ABC and of order n-1/q+2 for the noisy ABC…
Filed under: Statistics, University life Tagged: ABC, convergence rate, likelihood-free methods, mean square error, non-parametrics
“…many authors prefer to replace these improper priors by vague priors, i.e. probability measures that aim to represent very few knowledge on the parameter.”
Christèle Bioche and Pierre Druihlet arXived a few days ago a paper with this title. They aim at bringing a new light on the convergence of vague priors to their limit. Their notion of convergence is a pointwise convergence in the quotient space of Radon measures, quotient being defined by the removal of the “normalising” constant. The first results contained in the paper do not show particularly enticing properties of the improper limit of proper measures as the limit cannot be given any (useful) probabilistic interpretation. (A feature already noticeable when reading Jeffreys.) The first result that truly caught my interest in connection with my current research is the fact that the Haar measures appear as a (weak) limit of conjugate priors (Section 2.5). And that the Jeffreys prior is the limit of the parametrisation-free conjugate priors of Druilhet and Pommeret (2012, Bayesian Analysis, a paper I will discuss soon!). The result about the convergence of posterior means is rather anticlimactic as the basis assumption is the uniform integrability of the sequence of the prior densities. An interesting counterexample (somehow familiar to invariance fans): the sequence of Poisson distributions with mean n has no weak limit. And the Haldane prior does appear as a limit of Beta distributions (less surprising). On (0,1) if not on [0,1].
The paper contains a section on the Jeffreys-Lindley paradox, which is only considered from the second perspective, the one I favour. There is however a mention made of the noninformative answer, which is the (meaningless) one associated with the Lebesgue measure of normalising constant one. This Lebesgue measure also appears as a weak limit in the paper, even though the limit of the posterior probabilities is 1. Except when the likelihood has bounded variations outside compacts. Then the limit of the probabilities is the prior probability of the null… Interesting, truly, but not compelling enough to change my perspective on the topic. (And thanks to the authors for their thanks!)
Filed under: Statistics, University life Tagged: Haar measure, Haldane's prior, John Burdon Sanderson Haldane, Lebesgue measure
Following my earlier high school composition (or, as my daughter would stress, a first draft of vague ideas towards a composition!), I came upon an article in the Science leaflet of Le Monde (as of October 25) by the physicist Marco Zito (already commented on the ‘Og): “How natural is Nature?“. The following is my (commented) translation of the column, I cannot say I understand more than half of the words or hardly anything of its meaning, although checking some Wikipedia entries helped (I wonder how many readers have gotten to the end of this tribune)…
The above question is related to physics in that (a) the electroweak interaction scale is about the mass of Higgs boson, at which scale [order of 100GeV] the electromagnetic and the weak forces are of the same intensity. And (b) there exists a gravitation scale, Planck’s mass, which is the energy [about 1.2209×1019GeV] where gravitation [general relativity] and quantum physics must be considered simultaneously. The difficulty is that this second fundamental scale differs from the first one, being larger by 17 orders of magnitude [so what?!]. The difference is puzzling, as a world with two fundamental scales that are so far apart does not sound natural [how does he define natural?]. The mass of Higgs boson depends on the other elementary particles and on the fluctuations of the related fields. Those fluctuations can be very large, of the same order as Planck’s scale. The sum of all those terms [which terms, dude?!] has no reason to be weak. In most possible universes, the mass of this boson should thus compare with Planck’s mass, hence a contradiction [uh?!].
And then enters this apparently massive probabilistic argument:
If you ask passerbys to select a number each between two large bounds, like – 10000 and 10000, it is very unlikely to obtain exactly zero as the sum of those numbers. So if you observe zero as the sum, you will consider the result is not «natural» [I'd rather say that the probabilistic model is wrong]. The physicists’ reasoning so far was «Nature cannot be unnatural. Thus the problem of the mass of Higgs’ boson must have a solution at energy scales that can be explored by CERN. We could then uncover a new and interesting physics». Sadly, CERN has not (yet?) discovered new particles or new interactions. There is therefore no «natural» solution. Some of us imagine an unknown symmetry that bounds the mass of Higgs’ boson.
And a conclusion that could work for a high school philosophy homework:
This debate is typical of how science proceeds forward. Current theories are used to predict beyond what has been explored so fat. This extrapolation works for a little while, but some facts eventually come to invalidate them [sounds like philosophy of science 101, no?!]. Hence the importance to validate through experience our theories to abstain from attributing to Nature discourses that only reflect our own prejudices.
This Le Monde Science leaflet also had a short entry on a meteorite called Hypatia, because it was found in Egypt, home to the Alexandria 4th century mathematician Hypatia. And a book review of (the French translation of) Perfect Rigor, a second-hand biography of Grigory Perelman by Martha Gessen. (Terrible cover by the way, don’t they know at Houghton Mifflin that the integral sign is an elongated S, for sum, and not an f?! We happened to discuss and deplore with Andrew the other day this ridiculous tendency to mix wrong math symbols and greek letters in the titles of general public math books. The title itself is not much better, what is imperfect rigor?!) And the Le Monde math puzzle #838…
Filed under: Books, Kids, pictures, University life Tagged: CERN, fundamental forces, Grigory Perelman, Higgs boson, Hypatia, Le Monde, mathematical puzzle, Nature, Planck's mass, Poincaré's conjecture
A number theory Le Monde mathematical puzzle whose R coding is not really worth it (and which rings a bell of a similar puzzle in the past, puzzle I cannot trace…):
The set Ξ is made of pairs of integers (x,y) such that (i) both x and y are written as a sum of two squared integers (i.e., are bisquare numbers) and (ii) both xy and (x+y) are bisquare numbers. Why is the product condition superfluous? For which values of (a,b) is the pair (13a,13b) in Ξ ?
In the first question, the property follows from the fact that the product of two bisquare numbers is again a bisquare number, thank to the remarkable identity
(a²+b²)(c²+d²) = (ac+bd)²+(ad-bc)²
(since the double products cancel). For the second question, once I realised that
it followed that any number 13a was the sum of two squares, hence a bisquare number, and thus that the only remaining constraint was that (b≥a)
is also bisquare. If b-a is even, this sum is then the product of two bisquare numbers and hence a bisquare number. If b-a is odd, I do not have a general argument to bar the case (it certainly does not work for 13+13² and the four next ones).
Filed under: Books, Kids, R Tagged: Le Monde, mathematical puzzle, number theory
Filed under: pictures, University life Tagged: bois de Boulogne, La Défense, Paris, Université Paris Dauphine
We received the good news from the Program Committee of the next ISBA World meeting in Cancún, Quintana Roo, México, that our proposal of a short course on ABC methods was accepted. So, along with Jean-Michel Marin, I hope to introduce ABC to a large group of interested participants on either July 13 or the morn of July 14. Here is the abstract for the short course
ABC appeared in 1999 to solve complex genetic problems where the likelihood of the model was impossible to compute. They are now a standard tool in the statistical genetic community but have also addressed many other problems where likelihood computation was also an issue, including dynamic models in signal processing and financial data analysis. However, these methods suffer to some degree from calibration difficulties that make them rather volatile in their implementation and thus render them suspicious to the users of more traditional Monte Carlo methods. Nonetheless, ABC techniques have several claims to validity: first, they are connected with econometric methods like indirect inference. Second, they can be expressed in terms of various non-parametric estimators of the likelihood or of the posterior density and follow standard convergence patterns. At last, they appear as regular Bayesian inference over noisy data. The lectures cover those validation steps but also detail the different implementations of ABC algorithms and the calibration of their parameters. The second part of the course illustrates those issues in the special case of the coalescent model used in population genetics, where many of the early advances of ABC were first implemented.
and the complete description is available on the ISBA website.
In addition, the special ISBA section sessions for BayesComp and O’Bayes ISBA sections have been accepted too. (As well as sessions for most of the other ISBA sections.) Thanks to Raquel Prado and to the whole scientific committee for working hard towards another successful ISBA World meeting! Note that the early bird conference registration begins today, so make sure to book your seat for Cancun!
Filed under: Statistics, Travel, University life Tagged: ABC, Cancún, conference, deadline, ISBA, ISBA sections, logo, Mexico, objective Bayes, short courses, Valencia conferences, Yucatan
Filed under: Books, R, Statistics, University life Tagged: Bayesian Core, Bayesian Essentials with R, e-book, Jean-Michel Marin, R, Springer-Verlag
Last night I was cooking buckwheat pancakes (galettes de sarrasin) from Brittany with an egg-and-ham filling. The first egg I used contained a double yolk, a fairly rare occurrence, at least in my kitchen! Then came the second pancake and, unbelievably!, a second egg with a double yolk! This sounded too unbelievable to be…unbelievable! The experiment stopped there as no one else wanted another galette, but tonight, when making chocolate mousse, I checked whether or not the four remaining eggs also were double-yolkers…and indeed they were. Which does not help when separating yolks from white, by the way. Esp. with IX fingers. At some stage, during the day, I remembered a talk by Prof of Risk David Spiegelhalter mentioning the issue, even including a picture of an egg-box with the double-yolker guarantee, as in the attached picture. But all I could find first was this explanation on BBC News. Which made sense for my eggs, as those were from a large calibre egg-box (which I usually do not buy)… (And then I typed David Spiegelhalter plus ‘double-yolker” on Google and all those references came out!)
Filed under: Kids, Statistics Tagged: Brittany, chocolate mousse, coincidence, David Spiegelhalter, egg double yolks, eggs, galettes, yolk
While Nobel prizes never get close to mathematics, it sometimes happens they border statistics. It was the case two years ago with Chris Sims—soon to speak at Bayes 250 in Duke— winning the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel (often shortened into the Nobel Prize in Economic Sciences). This year was also a borderline year since an econometrician, Lars Peter Hansen, got a third of the prize… Hansen is the current co-editor of Econometrica and one of the main contributors to the theory of the generalised method of moments. (Which can be re-interpreted as a precursor to ABC!) He has most of his papers in Econometrica, Journal of Econometrics, and other econom’ics journals, but also has a 2009 paper in the Annals of Statistics. In addition, even though the case is even more borderline, the fact that simulation techniques are at the core of the 2013 Nobel Prize in Chemistry is also a good thing for our field (although I did not see much mentions made of statistics in the reports I read, apart from their methods making “good predictions”…)
Filed under: University life Tagged: ABC, Bayes 250, Duke University, ISBA, Monte Carlo Statistical Methods, Nobel Prize, simulation
In one of his posts, my friend Larry mentioned that popular posts had to mention the Bayes/frequentist opposition in the title… I think mentioning machine learning is also a good buzzword to increase the traffic! I did spot this phenomenon last week when publishing my review of Kevin Murphy’s Machine Learning: the number of views and visitors jumped by at least a half, exceeding the (admittedly modest) 1000 bar on two consecutive days. Interestingly, the number of copies of Machine Learning (sold via my amazon associate link) did not follow this trend: so far, I only spotted a few copies sold, in similar amounts to the number of copies of Spatio-temporal Statistics I reviewed the week before. Or most books I review, positively or negatively! (However, I did spot a correlated increase in overall amazon associate orderings and brazenly attributed the command of a Lego robotic set to a “machine learner”! And as of yesterday Og‘s readers massively ordered 152 236 copies of the latest edition of Andrew’s Bayesian Data Analysis, Thanks!)
Filed under: Books, Kids Tagged: Amazon, blogging, Lego, machine learning, Normal deviate, traffic