Today, I took a look at a recently arXived paper posted in physics, lifting – A non reversible MCMC algorithm by Marija Vucleja, but I simply could not understand the concept of lifting. Presumably because of the physics perspective. And also because the paper is mostly a review, referring to the author’s earlier work. The notion of lifting is to create a duplicate of a regular Markov chain with given stationary distribution towards cancelling reversibility and hence speeding up the exploration of the state space. The central innovation in the paper seems to be in imposing a lifted reversibility, which means using the reverse dynamics on the lifted version of the chain, that is, the dual proposal
However, the paper does not explicit how the resulting Markov transition matrix on the augmented space is derived from the original matrix. I now realise my description is most likely giving the impression of two coupled Markov chains, which is not the case: the new setting is made of a duplicated sample space, in the sense of Nummelin split chain (but without the specific meaning for the binary variable found in Nummelin!). In the case of the 1-d Ising model, the implementation of the method means for instance picking a site at random, proposing to change its spin value by a Metropolis acceptance step and then, if the proposal is rejected, possibly switching to the corresponding value in the dual part of the state. Given the elementary proposal in the first place, I fail to see where the improvement can occur… I’d be most interested in seeing a version of this lifting in a realistic statistical setting.
Filed under: Books, Statistics, University life Tagged: arXiv, MCMC algorithms, reversible Markov chain
Last week, Likelihood-free Bayesian inference on the minimum clinically important difference was arXived by Nick Syring and Ryan Martin and I read it over the weekend, slowly coming to the realisation that their [meaning of] “likelihood free” was not my [meaning of] “likelihood free”, namely that it has nothing to do with ABC! The idea therein is to create a likelihood out of a loss function, in the spirit of Bassiri, Holmes and Walker, the loss being inspired here by a clinical trial concept, the minimum clinically important difference, defined as
which defines a loss function per se when considering the empirical version. In clinical trials, Y is a binary outcome and X a vector of explanatory variables. This model-free concept avoids setting a joint distribution on the pair (X,Y), since creating a distribution on a large vector of covariates is always an issue. As a marginalia, the authors actually mention our MCMC book in connection with a logistic regression (Example 7.11) and for a while I thought we had mentioned MCID therein, realising later it was a standard description of MCMC for logistic models.
The central and interesting part of the paper is obviously defining the likelihood-free posterior as
The authors manage to obtain the rate necessary for the estimation to be asymptotically consistent, which seems [to me] to mean that a better representation of the likelihood-free posterior should be
(even though this rescaling does not appear verbatim in the paper). This is quite an interesting application of the concept developed by Bissiri, Holmes and Walker, even though it also illustrates the difficulty of defining a specific prior, given that the minimised target above can be transformed by an arbitrary increasing function. And the mathematical difficulty in finding a rate.
Filed under: Uncategorized Tagged: Amnesty International, blogging, freedom of speech, Je suis Charlie, Raif Badawi, religions, Saudi Arabia
Now my grading is over, I can reflect on the unexpected difficulties in the mathematical statistics exam. I knew that the first question in the multiple choice exercise, borrowed from Cross Validation, was going to be quasi-impossible and indeed only one student out of 118 managed to find the right solution. More surprisingly, most students did not manage to solve the (absence of) MLE when observing that n unobserved exponential Exp(λ) were larger than a fixed bound δ. I was also amazed that they did poorly on a N(0,σ²) setup, failing to see that
and determine an unbiased estimator that can be improved by Rao-Blackwellisation. No student reached the conditioning part. And a rather frequent mistake more understandable due to the limited exposure they had to Bayesian statistics: many confused parameter λ with observation x in the prior, writing
hence could not derive a proper posterior.
Filed under: Kids, pictures, Statistics, University life Tagged: Bayesian statistics, copies, final exam, grading, mathematical statistics, MLE, Université Paris Dauphine
Le Monde illustrated an article about discriminations against women with this graph which gives the number of men for 100 women per continent. This is a fairly poor graph, fit for one of Tufte’s counterexamples, as the bars are truncated at 85, make little sense as they do not convey the time dimension, are dwarfed by the legend on the left that is not of the same colors, and also miss the population dimension, which makes the title inappropriate since the graph does not show why there are more men than women on the planet, even if the large percentage of the population of Asia in the World’s population hints at the result.
Filed under: Books, pictures, Statistics Tagged: awful graphs, barplot, Le Monde, World's population
As mentioned in my recent review of Redshirts, I was planning to read John Scalzi’s most recent novel, Lock In, if only to check whether or not Redshirts was an isolated accident! This was the third book from “the pile” that I read through the Yule break and, indeed, it was a worthwhile attempt as the book stands miles above Redshirts…
The story is set in a very convincing near-future America where a significant part of the population is locked by a super-flu into a full paralysis that forces them to rely on robot-like interfaces to interact with unlocked humans. While the book is not all that specific on how the robotic control operates, except from using an inserted “artificial neural network” inside the “locked-in” brains, Scalzi manages to make it sound quite realistic, with societal and corporation issues at the forefront. To the point of selling really well the (usually lame) notion of instantaneous relocation at the other end of the US. And with the bare minimum of changes to the current society, which makes it easier to buy. I have not been that enthralled by a science-fiction universe for quite a while. I also enjoyed how the economics of this development of a new class of citizens was rendered, the book rotating around the consequences of the ending of heavy governmental intervention in lock in research.
Now, the story itself is of a more classical nature in that the danger threatening the loked-in population is uncovered single-handedly by the rookie detective who conveniently happens to be the son of a very influential ex-basketball-player and hence to meet all the characters involved in the plot. This is pleasant but somewhat thin with a limited number of players considering the issues at stake and a rather artificial ending.
Look here for a more profound review by Cory Doctorow.
Filed under: Books, Kids, Travel Tagged: Cory Doctorow, flu, John Scalzi, Lock In, redshirts, robots, science fiction
“One of Jeffreys’ goals was to create default Bayes factors by using prior distributions that obeyed a series of general desiderata.”
The paper Harold Jeffreys’s default Bayes factor hypothesis tests: explanation, extension, and application in Psychology by Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers is both a survey and a reinterpretation cum explanation of Harold Jeffreys‘ views on testing. At about the same time, I received a copy from Alexander and a copy from the journal it had been submitted to! This work starts with a short historical entry on Jeffreys’ work and career, which includes four of his principles, quoted verbatim from the paper:
- “scientific progress depends primarily on induction”;
- “in order to formalize induction one requires a logic of partial belief” [enters the Bayesian paradigm];
- “scientific hypotheses can be assigned prior plausibility in accordance with their complexity” [a.k.a., Occam’s razor];
- “classical “Fisherian” p-values are inadequate for the purpose of hypothesis testing”.
“The choice of π(σ) therefore irrelevant for the Bayes factor as long as we use the same weighting function in both models”
A very relevant point made by the authors is that Jeffreys only considered embedded or nested hypotheses, a fact that allows for having common parameters between models and hence some form of reference prior. Even though (a) I dislike the notion of “common” parameters and (b) I do not think it is entirely legit (I was going to write proper!) from a mathematical viewpoint to use the same (improper) prior on both sides, as discussed in our Statistical Science paper. And in our most recent alternative proposal. The most delicate issue however is to derive a reference prior on the parameter of interest, which is fixed under the null and unknown under the alternative. Hence preventing the use of improper priors. Jeffreys tried to calibrate the corresponding prior by imposing asymptotic consistency under the alternative. And exact indeterminacy under “completely uninformative” data. Unfortunately, this is not a well-defined notion. In the normal example, the authors recall and follow the proposal of Jeffreys to use an improper prior π(σ)∝1/σ on the nuisance parameter and argue in his defence the quote above. I find this argument quite weak because suddenly the prior on σ becomes a weighting function... A notion foreign to the Bayesian cosmology. If we use an improper prior for π(σ), the marginal likelihood on the data is no longer a probability density and I do not buy the argument that one should use the same measure with the same constant both on σ alone [for the nested hypothesis] and on the σ part of (μ,σ) [for the nesting hypothesis]. We are considering two spaces with different dimensions and hence orthogonal measures. This quote thus sounds more like wishful thinking than like a justification. Similarly, the assumption of independence between δ=μ/σ and σ does not make sense for σ-finite measures. Note that the authors later point out that (a) the posterior on σ varies between models despite using the same data [which shows that the parameter σ is far from common to both models!] and (b) the [testing] Cauchy prior on δ is only useful for the testing part and should be replaced with another [estimation] prior when the model has been selected. Which may end up as a backfiring argument about this default choice.
“Each updated weighting function should be interpreted as a posterior in estimating σ within their own context, the model.”
The re-derivation of Jeffreys’ conclusion that a Cauchy prior should be used on δ=μ/σ makes it clear that this choice only proceeds from an imperative of fat tails in the prior, without solving the calibration of the Cauchy scale. (Given the now-available modern computing tools, it would be nice to see the impact of this scale γ on the numerical value of the Bayes factor.) And maybe it also proceeds from a “hidden agenda” to achieve a Bayes factor that solely depends on the t statistic. Although this does not sound like a compelling reason to me, since the t statistic is not sufficient in this setting.
In a differently interesting way, the authors mention the Savage-Dickey ratio (p.16) as a way to represent the Bayes factor for nested models, without necessarily perceiving the mathematical difficulty with this ratio that we pointed out a few years ago. For instance, in the psychology example processed in the paper, the test is between δ=0 and δ≥0; however, if I set π(δ=0)=0 under the alternative prior, which should not matter [from a measure-theoretic perspective where the density is uniquely defined almost everywhere], the Savage-Dickey representation of the Bayes factor returns zero, instead of 9.18!
“In general, the fact that different priors result in different Bayes factors should not come as a surprise.”
The second example detailed in the paper is the test for a zero Gaussian correlation. This is a sort of “ideal case” in that the parameter of interest is between -1 and 1, hence makes the choice of a uniform U(-1,1) easy or easier to argue. Furthermore, the setting is also “ideal” in that the Bayes factor simplifies down into a marginal over the sample correlation only, under the usual Jeffreys priors on means and variances. So we have a second case where the frequentist statistic behind the frequentist test[ing procedure] is also the single (and insufficient) part of the data used in the Bayesian test[ing procedure]. Once again, we are in a setting where Bayesian and frequentist answers are in one-to-one correspondence (at least for a fixed sample size). And where the Bayes factor allows for a closed form through hypergeometric functions. Even in the one-sided case. (This is a result obtained by the authors, not by Jeffreys who, as the proper physicist he was, obtained approximations that are remarkably accurate!)
“The fact that the Bayes factor is independent of the intention with which the data have been collected is of considerable practical importance.”
The authors have a side argument in this section in favour of the Bayes factor against the p-value, namely that the “Bayes factor does not depend on the sampling plan” (p.29), but I find this fairly weak (or tongue in cheek) as the Bayes factor does depend on the sampling distribution imposed on top of the data. It appears that the argument is mostly used to defend sequential testing.
“The Bayes factor (…) balances the tension between parsimony and goodness of fit, (…) against overfitting the data.”
In fine, I liked very much this re-reading of Jeffreys’ approach to testing, maybe the more because I now think we should get away from it! I am not certain it will help in convincing psychologists to adopt Bayes factors for assessing their experiments as it may instead frighten them away. And it does not bring an answer to the vexing issue of the relevance of point null hypotheses. But it constitutes a lucid and innovative of the major advance represented by Jeffreys’ formalisation of Bayesian testing.
Filed under: Books, Statistics, University life Tagged: Bayesian hypothesis testing, Dickey-Savage ratio, Harold Jeffreys, overfitting, Statistical Science, testing, Theory of Probability
I found another paper on the Jeffreys-Lindley paradox. Entitled “A Misleading Intuition and the Bayesian Blind Spot: Revisiting the Jeffreys-Lindley’s Paradox”. Written by Guillaume Rochefort-Maranda, from Université Laval, Québec.
This paper starts by assuming an unbiased estimator of the parameter of interest θ and under test for the null θ=θ0. (Which makes we wonder at the reason for imposing unbiasedness.) Another highly innovative (or puzzling) aspect is that the Lindley-Jeffreys paradox presented therein is described without any Bayesian input. The paper stands “within a frequentist (classical) framework”: it actually starts with a confidence-interval-on-θ-vs.-test argument to argue that, with a fixed coverage interval that excludes the null value θ0, the estimate of θ may converge to θ0 without ever accepting the null θ=θ0. That is, without the confidence interval ever containing θ0. (Although this is an event whose probability converges to zero.) Bayesian aspects come later in the paper, even though the application to a point null versus a point null test is of little interest since a Bayes factor is then a likelihood ratio.
As I explained several times, including in my Philosophy of Science paper, I see the Lindley-Jeffreys paradox as being primarily a Bayesiano-Bayesian issue. So just the opposite of the perspective taken by the paper. That frequentist solutions differ does not strike me as paradoxical. Now, the construction of a sequence of samples such that all partial samples in the sequence exclude the null θ=θ0 is not a likely event, so I do not see this as a paradox even or especially when putting on my frequentist glasses: if the null θ=θ0 is true, this cannot happen in a consistent manner, even though a single occurrence of a p-value less than .05 is highly likely within such a sequence.
Unsurprisingly, the paper relates to the three most recent papers published by Philosophy of Science, discussing first and foremost Spanos‘ view. When the current author introduces Mayo and Spanos’ severity, i.e. the probability to exceed the observed test statistic under the alternative, he does not define this test statistic d(X), which makes the whole notion incomprehensible to a reader not already familiar with it. (And even for one familiar with it…)
“Hence, the solution I propose (…) avoids one of [Freeman’s] major disadvantages. I suggest that we should decrease the size of tests to the extent where it makes practically no difference to the power of the test in order to improve the likelihood ratio of a significant result.” (p.11)
One interesting if again unsurprising point in the paper is that one reason for the paradox stands in keeping the significance level constant as the sample size increases. While it is possible to decrease the significance level and to increase the power simultaneously. However, the solution proposed above does not sound rigorous hence I fail to understand how low the significance has to be for the method to stop/work. I cannot fathom a corresponding algorithmic derivation of the author’s proposal.
“I argue against the intuitive idea that a significant result given by a very powerful test is less convincing than a significant result given by a less powerful test.”
The criticism on the “blind spot” of the Bayesian approach is supported by an example where the data is issued from a distribution other than either of the two tested distributions. It seems reasonable that the Bayesian answer fails to provide a proper answer in this case. Even though it illustrates the difficulty with the long-term impact of the prior(s) in the Bayes factor and (in my opinion) the need to move away from this solution within the Bayesian paradigm.
Filed under: Books, Statistics, University life Tagged: Bayes factor, frequentist inference, Jeffreys-Lindley paradox, Philosophy of Science, Québec, Université Laval
In conjunction with the recent PNAS paper on massive model choice, Rob Johnson†, Paul Kirk and Michael Stumpf published in Bioinformatics an implementation of nested sampling that is designed for biological applications, called SYSBIONS. Hence the NS for nested sampling! The C software is available on-line. (I had planned to post this news next to my earlier comments but it went under the radar…)
Filed under: Books, Statistics, University life Tagged: Bayesian model choice, bioinformatics, nested sampling, PNAS, SYSBIONS
Another Cross Validated forum question that led me to an interesting (?) reconsideration of certitudes! When simulating from a normal distribution, is Box-Muller algorithm better or worse than using the inverse cdf transform? My first reaction was to state that Box-Muller was exact while the inverse cdf relied on the coding of the inverse cdf, like qnorm() in R. Upon reflection and commenting by other members of the forum, like William Huber, I came to moderate this perspective since Box-Muller also relies on transcendental functions like sin and log, hence writing
also involves approximating in the coding of those functions. While it is feasible to avoid the call to trigonometric functions (see, e.g., Algorithm A.8 in our book), the call to the logarithm seems inescapable. So it ends up with the issue of which of the two functions is better coded, both in terms of speed and precision. Surprisingly, when coding in R, the inverse cdf may be the winner: here is the comparison I ran at the time I wrote my comments> system.time(qnorm(runif(10^8))) sutilisateur système écoulé 10.137 0.120 10.251 > system.time(rnorm(10^8)) utilisateur système écoulé 13.417 0.060 13.472`
However re-rerunning it today, I get opposite results (pardon my French, I failed to turn the messages to English):> system.time(qnorm(runif(10^8))) utilisateur système écoulé 10.137 0.144 10.274 > system.time(rnorm(10^8)) utilisateur système écoulé 7.894 0.060 7.948
(There is coherence in the system time, which shows rnorm as twice as fast as the call to qnorm.) In terms, of precision, I could not spot a divergence from normality, either through a ks.test over 10⁸ simulations or in checking the tails:
“Only the inversion method is inadmissible because it is slower and less space efficient than all of the other methods, the table methods excepted”. Luc Devroye, Non-uniform random variate generation, 1985
Update: As pointed out by Radford Neal in his comment, the above comparison is meaningless because the function rnorm() is by default based on the inversion of qnorm()! As indicated by Alexander Blocker in another comment, to use an other generator requires calling RNG as inRNGkind(normal.kind = “Box-Muller”)
(And thanks to Jean-Louis Foulley for salvaging this quote from Luc Devroye, which does not appear to apply to the current coding of the Gaussian inverse cdf.)
Filed under: Books, Kids, R, Statistics, University life Tagged: Box-Muller algorithm, cross validated, inverse cdf, logarithm, normal distribution, qnorm()
Goshdarnit! On the way home tonight, I stopped as usual at the bakery nearby and as usual I let my bike outside next to the entrance. Without bothering with locking as it takes me ages to unlock it with the left hand… When I came out a minute later, the bike was gone!!! At least, this did not happened with my borrowed high-range bike in Warwick… And definitely nothing comparable with the jogger being shot nearby on my running route a week ago. But, still, a real bummer. And I cannot even use my Amazon credit to get the new one! (The local deity I had installed a few months ago on the front fork did not have the intended impact. I may have to switch to another creed…)
Filed under: Kids, pictures, Running Tagged: baguette, bakery, Barbamama, Decathlon, mountain bike, stolen bike
Today (01/06) was a double epiphany in that I realised that one of my long-time beliefs about unbiased estimators did not hold. Indeed, when checking on Cross Validated, I found this question: For which distributions is there a closed-form unbiased estimator for the standard deviation? And the presentation includes the normal case for which indeed there exists an unbiased estimator of σ, namely
which derives directly from the chi-square distribution of the sum of squares divided by σ². When thinking further about it, if a posteriori!, it is now fairly obvious given that σ is a scale parameter. Better, any power of σ can be similarly estimated in a unbiased manner, since
And this property extends to all location-scale models.
So how on Earth was I so convinced that there was no unbiased estimator of σ?! I think it stems from reading too quickly a result in, I think, Lehmann and Casella, result due to Peter Bickel and Erich Lehmann that states that, for a convex family of distributions F, there exists an unbiased estimator of a functional q(F) (for a sample size n large enough) if and only if q(αF+(1−α)G) is a polynomial in 0≤α≤1. Because of this, I had this [wrong!] impression that only polynomials of the natural parameters of exponential families can be estimated by unbiased estimators… Note that Bickel’s and Lehmann’s theorem does not apply to the problem here because the collection of Gaussian distributions is not convex (a mixture of Gaussians is not a Gaussian).
This leaves open the question as to which transforms of the parameter(s) are unbiasedly estimable (or U-estimable) for a given parametric family, like the normal N(μ,σ²). I checked in Lehmann’s first edition earlier today and could not find an answer, besides the definition of U-estimability. Not only the question is interesting per se but the answer could come to correct my long-going impression that unbiasedness is a rare event, i.e., that the collection of transforms of the model parameter that are U-estimable is a very small subset of the whole collection of transforms.
Filed under: Books, Kids, Statistics, University life Tagged: cross validated, epiphany, Erich Lehmann, George Casella, mathematical statistics, Theory of Point Estimation, U-estimability, unbiased estimation
“Our approach in handling the model uncertainty has some resemblance to statistical ‘‘emulators’’ (Kennedy and O’Hagan, 2001), approximative methods used to express the model uncertainty when simulating data under a mechanistic model is computationally intensive. However, emulators are often motivated in the context of Gaussian processes, where the uncertainty in the model space can be reasonably well modeled by a normal distribution.”
Pierre Pudlo pointed out to me the paper AABC: Approximate approximate Bayesian computation for inference in population-genetic models by Buzbas and Rosenberg that just appeared in the first 2015 issue of Theoretical Population Biology. Despite the claim made above, including a confusion on the nature of Gaussian processes, I am rather reserved about the appeal of this AA rated ABC…
“When likelihood functions are computationally intractable, likelihood-based inference is a challenging problem that has received considerable attention in the literature (Robert and Casella, 2004).”
The ABC approach suggested therein is doubly approximate in that simulation from the sampling distribution is replaced with simulation from a substitute cheaper model. After a learning stage using the costly sampling distribution. While there is convergence of the approximation to the genuine ABC posterior under infinite sample and Monte Carlo sample sizes, there is no correction for this approximation and I am puzzled by its construction. It seems (see p.34) that the cheaper model is build by a sort of weighted bootstrap: given a parameter simulated from the prior, weights based on its distance to a reference table are constructed and then used to create a pseudo-sample by weighted sampling from the original pseudo-samples. Rather than using a continuous kernel centred on those original pseudo-samples, as would be the suggestion for a non-parametric regression. Each pseudo-sample is accepted only when a distance between the summary statistics is small enough. This bootstrap flavour is counter-intuitive in that it requires a large enough sample from the true sampling distribution to operate with some confidence… I also wonder at what happens when the data is not iid. (I added the quote above as another source of puzzlement, since the book is about cases when the likelihood is manageable.)
Filed under: Books, Statistics, University life Tagged: ABC, bootstrap, Dirichlet prior, Gaussian processes, Monte Carlo methods, Theoretical Population Biology
Filed under: Kids, pictures, Statistics, University life Tagged: copies, final exam, grading, mathematical statistics, Université Paris Dauphine
Another question I picked on Cross Validated during the Yule break is about the connection between the Bhattacharyya distance and the Kullback-Leibler divergence, i.e.,
Although this Bhattacharyya distance sounds close to the Hellinger distance,
the ordering I got by a simple Jensen inequality is
and I wonder how useful this ordering could be…
Filed under: Books, Kids, Statistics Tagged: Bhattacharyya distance, cross validated, Hellinger distance, Jensen inequality, Kullback-Leibler divergence
A paper on the comparison of emulation methods for Approximate Bayesian Computation was recently arXived by Jabot et al. The idea is to bypass costly simulations of pseudo-data by running cheaper simulation from a pseudo-model or emulator constructed via a preliminary run of the original and costly model. To borrow from the paper introduction, ABC-Emulation runs as follows:
- design a small number n of parameter values covering the parameter space;
- generate n corresponding realisations from the model and store the corresponding summary statistics;
- build an emulator (model) based on those n values;
- run ABC using the emulator in lieu of the original model.
A first emulator proposed in the paper is to use local regression, as in Beaumont et al. (2002), except that it goes the reverse way: the regression model predicts a summary statistics given the parameter value. The second and last emulator relies on Gaussian processes, as in Richard Wilkinson‘s as well as Ted Meeds’s and Max Welling‘s recent work [also quoted in the paper]. The comparison of the above emulators is based on an ecological community dynamics model. The results are that the stochastic version is superior to the deterministic one, but overall not very useful when implementing the Beaumont et al. (2002) correction. The paper however does not define what deterministic and what stochastic mean…
“We therefore recommend the use of local regressions instead of Gaussian processes.”
While I find the conclusions of the paper somewhat over-optimistic given the range of the experiment and the limitations of the emulator options (like non-parametric conditional density estimation), it seems to me that this is a direction to be pursued as we need to be able to simulate directly a vector of summary statistics instead of the entire data process, even when considering an approximation to the distribution of those summaries.
Filed under: Books, Statistics Tagged: ABC, ABC-Emulation, Approximate Bayesian computation, ecological models, emulation, Gaussian processes, Monte Carlo methods, summary statistics
“Religion, a mediaeval form of unreason, when combined with modern weaponry becomes a real threat to our freedoms. This religious totalitarianism has caused a deadly mutation in the heart of Islam and we see the tragic consequences in Paris today. I stand with Charlie Hebdo, as we all must, to defend the art of satire, which has always been a force for liberty and against tyranny, dishonesty and stupidity. ‘Respect for religion’ has become a code phrase meaning ‘fear of religion.’ Religions, like all other ideas, deserve criticism, satire, and, yes, our fearless disrespect.” Salman Rushdie
Filed under: Books Tagged: Charlie Hebdo, Charpentier, Je suis Charlie, leçons de ténèbres, Salman Rushdie