## Xian's Og

### maximum likelihood: an introduction

*“Basic Principle 0. Do not trust any principle.” L. Le Cam (1990)
*

**H**ere is the abstract of a International Statistical Rewiew 1990 paper by Lucien Le Cam on maximum likelihood. ISR keeping a tradition of including an abstract in French for every paper, Le Cam (most presumably) wrote his own translation [or maybe wrote the French version first], which sounds much funnier to me and so I cannot resist posting both, pardon my/his French! *[I just find “Ce fait” rather unusual, as I would have rather written “Ceci fait”…]*:

Maximum likelihood estimates are reported to be best under all circumstances. Yet there are numerous simple examples where they plainly misbehave. One gives some examples for problems that had not been invented for the purpose of annoying maximum likelihood fans. Another example, imitated from Bahadur, has been specially created with just such a purpose in mind. Next, we present a list of principles leading to the construction of good estimates. The main principle says that one should not believe in principles but study each problem for its own sake.

L’auteur a ouï dire que la méthode du maximum de vraisemblance est la meilleure méthode d’estimation. C’est bien vrai, et pourtant la méthode se casse le nez sur des exemples bien simples qui n’avaient pas été inventés pour le plaisir de montrer que la méthode peut être très désagréable. On en donne quelques-uns, plus un autre, imité de Bahadur et fabriqué exprès pour ennuyer les admirateurs du maximum de vraisemblance. Ce fait, on donne une savante liste de principes de construction de bons estimateurs, le principe principal étant qu’il ne faut pas croire aux principes.

The entire paper is just as witty, as in describing the mixture model as “contaminated and not fit to drink”! Or in “Everybody knows that taking logarithms is unfair”. Or, again, in “biostatisticians, being complicated people, prefer to work out not with the dose y but with its logarithm”… And a last line: “One possibility is that there are too many horse hairs in e”.

Filed under: Books, Statistics Tagged: Bahadur, International Statistical Review, Likelihood Principle, Lucien Le Cam, maximum likelihood estimation

### a neat (theoretical) Monte Carlo result

**M**ark Huber just arXived a short paper where he develops a Monte Carlo approach that bounds the probability of large errors

by computing a lower bound on the sample size r and I wondered at the presence of μ in the bound as it indicates the approach is not translation invariant. One reason is that the standard deviation of the simulated random variables is bounded by cμ. Another reason is that Mark uses as its estimator the median

where the S’s are partial averages of sufficient length and the R’s are independent uniforms over (1-ε,1+ε): using those uniforms may improve the coverage of given intervals but it also means that the absolute scale of the error is multiplied by the scale of S, namely μ. I first thought that some a posteriori recentering could improve the bound but since this does not impact the variance of the simulated random variables, I doubt it is possible.

Filed under: Books, Statistics, University life Tagged: confidence sets, importance sampling, Monte Carlo integration, unbiased estimation, variance reduction

### full Bayesian significance test

**A**mong the many comments (thanks!) I received when posting our Testing via mixture estimation paper came the suggestion to relate this approach to the notion of full Bayesian significance test (FBST) developed by (Julio, not Hal) Stern and Pereira, from São Paulo, Brazil. I thus had a look at this alternative and read the Bayesian Analysis paper they published in 2008, as well as a paper recently published in Logic Journal of IGPL. (I could not find what the IGPL stands for.) The central notion in these papers is the *e-value*, which provides the *posterior probability that the posterior density is larger than the largest posterior density over the null set*. This definition bothers me, first because the *null* set has a measure equal to zero under an absolutely continuous prior (BA, p.82). Hence the posterior density is defined in an arbitrary manner over the *null* set and the maximum is itself arbitrary. (An issue that invalidates my 1993 version of the Lindley-Jeffreys paradox!) And second because it considers the posterior probability of an event that does not exist a priori, being conditional on the data. This sounds in fact quite similar to *Statistical Inference*, Murray Aitkin’s (2009) book using a posterior distribution of the likelihood function. With the same drawback of using the data twice. And the other issues discussed in our commentary of the book. (As a side-much-on-the-side remark, the authors incidentally forgot me when citing our 1992 Annals of Statistics paper about decision theory on accuracy estimators..!)

Filed under: Books, Statistics Tagged: Bayes factor, Bayesian Analysis, Bayesian model choice, e-values, full Bayesian significance test, logic journal of the IGPL, measure theory, Murray Aitkin, p-values, São Paulo, statistical inference

### Topological sensitivity analysis for systems biology

**M**ichael Stumpf sent me Topological sensitivity analysis for systems biology, written by Ann Babtie and Paul Kirk, *en avant-première* before it came out in PNAS and I read it during the trip to NIPS in Montréal. (The paper is published in open access, so everyone can read it now!) The topic is quite central to a lot of debates about climate change, economics, ecology, finance, &tc., namely to assess the impact of using the wrong model to draw conclusions and make decisions about a real phenomenon. (Which reminded me of the distinction between mechanical and phenomenological models stressed by Michael Blum in his NIPS talk.) And it is of much interest from a Bayesian point of view since assessing the worth of a model requires modelling the “outside” of a model, using for instance Gaussian processes as in the talk Tony O’Hagan gave in Warwick earlier this term. I would even go as far as saying that the issue of assessing [and compensating for] how wrong a model is, given available data, may be the (single) most under-assessed issue in statistics. We (statisticians) have yet to reach our Boxian era.

In Babtie et al., the space or universe of models is represented by network topologies, each defining the set of “parents” in a semi-Markov representation of the (dynamic) model. At which stage Gaussian processes are also called for help. Alternative models are ranked in terms of fit according to a distance between simulated data from the original model (sounds like a form of ABC?!). Obviously, there is a limitation in the number and variety of models considered this way, I mean there are still assumptions made on the possible models, while this number of models is increasing quickly with the number of nodes. As pointed out in the paper (see, e.g., Fig.4), the method has a parametric bootstrap flavour, to some extent.

What is unclear is how one can conduct Bayesian inference with such a collection of models. Unless all models share the same “real” parameters, which sounds unlikely. The paper mentions using uniform prior on all parameters, but this is difficult to advocate in a general setting. Another point concerns the quantification of how much one can trust a given model, since it does not seem models are penalised by a prior probability. Hence they all are treated identically. This is a limitation of the approach (or an indication that it is only a preliminary step in the evaluation of models) in that some models within a large enough collection will eventually provide an estimate that differs from those produced by the other models. So the assessment may become altogether highly pessimistic for this very reason.

*“If our parameters have a real, biophysical interpretation, we therefore need to be very careful not to assert that we know the true values of these quantities in the underlying system, just because–for a given model–we can pin them down with relative certainty.”*

In addition to its relevance for moving towards approximate models and approximate inference, and in continuation of yesterday’s theme, the paper calls for nested sampling to generate samples from the posterior(s) and to compute the evidence associated with each model. (I realised I had missed this earlier paper by Michael and co-authors on nested sampling for system biology.) There is no discussion in the paper on why nested sampling was selected, compared with, say, a random walk Metropolis-Hastings algorithm. Unless it is used in a fully automated way, but the paper is rather terse on that issue… And running either approach on 10⁷ models in comparison sounds like an awful lot of work!!! Using importance [sampling] nested sampling as we proposed with Nicolas Chopin could be a way to speed up this exploration if all parameters are identical between all or most models.

Filed under: Books, Statistics, Travel, University life Tagged: Gaussian processes, model choice, model validation, nested sampling, network, PNAS, topology

### Montréal snapshot

### an extension of nested sampling

**I** was reading [in the Paris métro] *Hastings-Metropolis algorithm on Markov chains for small-probability estimation*, arXived a few weeks ago by François Bachoc, Lionel Lenôtre, and Achref Bachouch, when I came upon their first algorithm that reminded me much of nested sampling: the following was proposed by Guyader et al. in 2011,

To approximate a tail probability **P**(H(X)>h),

- start from an iid sample of size N from the reference distribution;
- at each iteration m, select the point x with the smallest H(x)=ξ and replace it with a new point y simulated under the constraint H(y)≥ξ;
- stop when all points in the sample are such that H(X)>h;
- take

as the unbiased estimator of **P**(H(X)>h).

Hence, except for the stopping rule, this is the same implementation as nested sampling. Furthermore, Guyader et al. (2011) also take advantage of the bested sampling fact that, if direct simulation under the constraint H(y)≥ξ is infeasible, simulating via one single step of a Metropolis-Hastings algorithm is as valid as direct simulation. (I could not access the paper, but the reference list of Guyader et al. (2011) includes both original papers by John Skilling, so the connection must be made in the paper.) What I find most interesting in this algorithm is that it even achieves unbiasedness (even in the MCMC case!).

Filed under: Books, Statistics, University life Tagged: arXiv, extreme value theory, Hastings-Metropolis sampler, Metropolis-Hastings algorithms, nested sampling, Poisson process, rare events, unbiasedness

### NIPS 2014

**S**econd and last day of the NIPS workshops! The collection of topics was quite broad and would have made my choosing an ordeal, except that I was invited to give a talk at the probabilistic programming workshop, solving my dilemma… The first talk by Kathleen Fisher was quite enjoyable in that it gave a conceptual discussion of the motivations for probabilistic languages, drawing an analogy with the early days of computer programming that saw a separation between higher level computer languages and machine programming, with a compiler interface. And calling for a similar separation between the models faced by statistical inference and machine-learning and the corresponding code, if I understood her correctly. This was connected with Frank Wood’s talk of the previous day where he illustrated the concept through a generation of computer codes to approximately generate from standard distributions like Normal or Poisson. Approximately as in ABC, which is why the organisers invited me to talk in this session. However, I was a wee bit lost in the following talks and presumably lost part of my audience during *my* talk, as I realised later to my dismay when someone told me he had not perceived the distinction between the trees in the random forest procedure and the phylogenetic trees in the population genetic application. Still, while it had for me a sort of Twilight Zone feeling of having stepped in another dimension, attending this workshop was an worthwhile experiment as an eye-opener into a highly different albeit connected field, where code and simulator may take the place of a likelihood function… To the point of defining Hamiltonian Monte Carlo directly on the former, as Vikash Mansinghka showed me at the break.

I completed the day with the final talks in the variational inference workshop, if only to get back on firmer ground! Apart from attending my third talk by Vikash in the conference (but on a completely different topic on variational approximations for discrete particle-ar distributions), a talk by Tim Salimans linked MCMC and variational approximations, using MCMC and HMC to derive variational bounds. (He did not expand on the opposite use of variational approximations to build better proposals.) Overall, I found these two days and my first NIPS conference quite exciting, if somewhat overpowering, with a different atmosphere and a different pace compared with (small or large) statistical meetings. (And a staggering gender imbalance!)

Filed under: Kids, pictures, Statistics, Travel, University life Tagged: ABC, Andrey Markov, Canada, compiler, Montréal, mug, NIPS 2014, phylogenetic tree, population genetics, probabilistic programming, random forests, variational Bayes methods

### further up North

Filed under: pictures, Travel, University life Tagged: Atlantic ocean, Canada, clouds, dusk, Greenland, Montréal, NIPS 2014, sunset

### up North

Filed under: pictures, Travel, University life Tagged: Canada, clouds, Montréal, NIPS 2014, plane picture, Scotland

### ABC à Montréal

**S**o today was the NIPS 2014 workshop, “ABC in Montréal“, which started with a fantastic talk by Juliane Liepe on some exciting applications of ABC to the migration of immune cells, with the analysis of movies involving those cells acting to heal a damaged fly wing and a cut fish tail. Quite amazing videos, really. (With the great entry line of ‘We have all cut a finger at some point in our lives’!) The statistical model behind those movies was a random walk on a grid, with different drift and bias features that served as model characteristics. Frank Wood managed to deliver his talk despite a severe case of food poisoning, with a great illustration of probabilistic programming that made me understand (at last!) the very idea of probabilistic programming. And Vikash Mansinghka presented some applications in image analysis. Those two talks led me to realise why probabilistic programming was so close to ABC, with a programming touch! Hence why I was invited to talk today! Then Dennis Prangle exposed his latest version of lazy ABC, that I have already commented on the ‘Og, somewhat connected with our delayed acceptance algorithm, to the point that maybe something common can stem out of the two notions. Michael Blum ended the day with provocative answers to the provocative question of Ted Meeds as to whether or not machine learning needed ABC (*Ans.* No!) and whether or not machine learning could help ABC (*Ans.* ???). With an happily mix-up between mechanistic and phenomenological models that helped generating discussion from the floor.

The posters were also of much interest, with calibration as a distance measure by Michael Guttman, in continuation of the poster he gave at MCMski, Aaron Smith presenting his work with Luke Bornn, Natesh Pillai and Dawn Woodard, on why a single pseudo-sample is enough for ABC efficiency. This gave me the opportunity to discuss with him the apparent contradiction with the result of Kryz Łatunsziński and Anthony Lee about the geometric convergence of ABC-MCMC only attained with a random number of pseudo-samples… And to wonder if there is a geometric *versus* binomial dilemma in this setting, Namely, whether or not simulating pseudo-samples until one is accepted would be more efficient than just running one and discarding it in case it is too far. So, although the audience was not that large (when compared with the other “ABC in…” *and* when considering the 2500+ attendees at NIPS over the week!), it was a great day where I learned a lot, did not have a doze during talks (!), *[and even had an epiphany of sorts at the treadmill when I realised I just had to take longer steps to reach 16km/h without hyperventilating!]* So thanks to my fellow organisers, Neil D Lawrence, Ted Meeds, Max Welling, and Richard Wilkinson for setting the program of that day! And, by the way, where’s the next “ABC in…”?! (Finland, maybe?)

Filed under: Kids, pictures, Running, Statistics, Travel, University life Tagged: ABC, ABC in Montréal, ABC model choice, ABC-MCMC, ABC-SMC, arXiv, Canada, conference, geometric ergodicity, machine learning, Montréal, NIPS 2014, poster, pseudo-data, Québec, snow, treadmill

### broken homes [book review]

**E**ven though this is the fourth volume in the Peter Grant series, I did read it first [due to my leaving volume one in my office in Coventry and coming across this one in an airport bookstore in Düsseldorf], an experiment I do not advise anyone to repeat as it kills some of the magic in *Rivers of London* [renamed *Midnight Riots* on the US market, for an incomprehensible reason!, with the series being recalled *Rivers of London*, but at least they left the genuine and perfect covers…, not like some of the other foreign editions!] and makes reading *Broken homes* an exercise in guessing.* [Note for ‘Og’s readers suffering from Peter Grant fatigue: the next instalment, taking the seemingly compulsory trip Outside!—witness the Bartholomew series—, is waiting for me in Warwick, so I will not read it before the end of January!]*

*“I nodded sagely. `You’re right,’ I said. `We need a control.’*

* `Seriously?’she asked.*

* `Otherwise, how do you know the variable you’ve changed is the one having the effect?’ I said.”*

Now, despite this inauspicious entry, I did enjoy *Broken homes* as much [almost!] as the other volumes in the series. It mostly takes place in a less familiar [for a French tourist like me] part of London, but remains nonetheless true to its spirit of depicting London as a living organism! There are mostly characters from the earlier novels, but the core of the story is an infamous housing estate built by a mad architect in Elephant and Castle, not that far from Waterloo [Station], but sounding almost like a suburb from Aaronovitch’s depiction! Actually, the author has added a google map for the novel locations on his blog, wish I had it at the time [kind of difficult to get in a plane!].

*“Search as I might, nobody else was offering free [wifi] connections to the good people of Elephant and Castle.”*

The plot itself is centred on this estate [not really a spoiler, is it?] and the end is outstanding in that it is nothing like one would expect. With or without reading the other volumes. I still had trouble understanding the grand scheme of the main villain, while I have now entirely forgotten about the reasons for the crime scene at the very beginning of *Broken homes*. Rereading the pages where the driver, Robert Weil, appears did not help. What was his part in the story?! Despite this [maybe entirely personal] gap, the story holds well together, somewhat cemented by the characters populating the estate, who are endowed with enough depth to make them truly part of the story, even when they last only a few pages [spoiler!]. And as usual style and grammar and humour are at their best!

Filed under: Books, pictures, Travel Tagged: Coventry, Düsseldorf, Elephant and Castle, London, magic, Rivers of London, Thames, Waterloo

### optimal mixture weights in multiple importance sampling

**M**ultiple importance sampling is back!!! I am always interested in this improvement upon regular importance sampling, even or especially after publishing a recent paper about our AMIS (for adaptive multiple importance sampling) algorithm, so I was quite eager to see what was in Hera He’s and Art Owen’s newly arXived paper. The paper is definitely exciting and set me on a new set of importance sampling improvements and experiments…

Some of the most interesting developments in the paper are that, (i) when using a collection of importance functions qi with the same target p, every ratio qi/p is a *control variate* function with expectation 1 [assuming each of the qi‘s has a support smaller than the support of p]; (ii) the weights of a mixture of the qi‘s can be chosen in an optimal way towards minimising the variance for a certain integrand; (iii) multiple importance sampling incorporates quite naturally stratified sampling, i.e. the qi‘s may have disjoint supports; )iv) control variates contribute little, esp. when compared with the optimisation over the weights [which does not surprise me that much, given that the control variates have little correlation with the integrands]; (v) Veach’s (1997) seminal PhD thesis remains a driving force behind those results [and in getting Eric Veach an Academy Oscar in 2014!].

One extension that I would find of the uttermost interest deals with unscaled densities, both for p and the qi‘s. In that case, the weights do not even sum up to a know value and I wonder at how much more difficult it is to analyse this realistic case. And unscaled densities led me to imagine using geometric mixtures instead. Or even harmonic mixtures! (Maybe not.)

Another one is more in tune with our adaptive multiple mixture paper. The paper works with *regret*, but one could also work with *remorse*! Besides the pun, this means that one could adapt the weights along iterations and even possible design new importance functions from the past outcome, i.e., be adaptive once again. He and Owen suggest mixing their approach with our adaptive sequential Monte Carlo model.

Filed under: Statistics, University life Tagged: adaptive importance sampling, AMIS, Art Owen, control variates, multiple mixtures, Oscars (Academy Awards), population Monte Carlo

### using mixtures towards Bayes factor approximation

**P**hil O’Neill and Theodore Kypraios from the University of Nottingham have arXived last week a paper on “Bayesian model choice via mixture distributions with application to epidemics and population process models”. Since we discussed this paper during my visit there earlier this year, I was definitely looking forward the completed version of their work. Especially because there are some superficial similarities with our most recent work on… Bayesian model choice via mixtures! (To the point that I misunderstood at the beginning their proposal for ours…)

The central idea in the paper is that, by considering the mixture likelihood

where * x* corresponds to the entire sample, it is straighforward to relate the moments of α with the Bayes factor, namely

which means that estimating the mixture weight α by MCMC is equivalent to estimating the Bayes factor.

What puzzled me at first was that the mixture weight is in fine estimated with a single “datapoint”, made of the entire sample. So the posterior distribution on α is hardly different from the prior, since it solely varies by one unit! But I came to realise that this is a numerical tool and that the estimator of α is not meaningful from a statistical viewpoint (thus differing completely from our perspective). This explains why the Beta prior on α can be freely chosen so that the mixing and stability of the Markov chain is improved: This parameter is solely an algorithmic entity.

There are similarities between this approach and the pseudo-prior encompassing perspective of Carlin and Chib (1995), even though the current version does not *require* pseudo-priors, using *true* priors instead. But thinking of weakly informative priors and of the MCMC consequence (see below) leads me to wonder if pseudo-priors would not help in this setting…

Another aspect of the paper that still puzzles me is that the MCMC algorithm mixes at all: indeed, depending on the value of the binary latent variable z, one of the two parameters is updated from the true posterior while the other is updated from the prior. It thus seems unlikely that the value of z would change quickly. Creating a huge imbalance in the prior can counteract this difference, but the same problem occurs once z has moved from 0 to 1 or from 1 to 0. It seems to me that resorting to a common parameter [if possible] and using as a proposal the model-based posteriors for both parameters is the only way out of this conundrum. (We do certainly insist on this common parametrisation in our approach as it is paramount to the use of improper priors.)

*“In contrast, we consider the case where there is only one datum.”*

The idea in the paper is therefore fully computational and relates to other linkage methods that create bridges between two models. It differs from our new notion of Bayesian testing in that we consider estimating the mixture between the two models in comparison, hence considering instead the mixture

which is another model altogether and does not recover the original Bayes factor (Bayes factor that we altogether dismiss in favour of the posterior median of α and its entire distribution).

Filed under: Statistics, Travel, University life Tagged: Bayes factors, Jean Diebolt, mixtures, Monte Carlo Statistical Methods, reversible jump, testing as mixture estimation, University of Nottingham

### Quasi-Monte Carlo sampling

*“The QMC algorithm forces us to write any simulation as an explicit function of uniform samples.” (p.8)*

**A**s posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper. Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of *Jan 2, 2015*).

Quasi-Monte Carlo methods are definitely *not* popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).

*“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)*

At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators *and* error rates of order

means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not *always* used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a *deterministic* function of a *fixed* number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step *had to* appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms. For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform

which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal

has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.

The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.

Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?

In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?

What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?

Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?

Filed under: Books, Kids, Statistics, Travel, University life, Wines Tagged: CREST, forward-backward formula, JRSSB, London, MCMC, particle learning, quasi-Monte Carlo methods, Rao-Blackwellisation, Read Pap, reproducing kernel Hilbert space, Royal Statistical Society, SMC, systematic resampling

### off to Montréal [NIPS workshops]

On Thursday, I will travel to Montréal for the two days of NIPS workshop there. On Friday, there is the ABC in Montréal workshop that I cannot but attend! (First occurrence of an “ABC in…” in North America! Sponsored by ISBA as well.) And on Saturday, there is the 3rd NIPS Workshop on Probabilistic Programming where I am invited to give a talk on… ABC! And maybe will manage to get a sneak at the nearby workshop on Advances in variational inference… (0n a very personal side, I wonder if the weather will remain warm enough to go running in the early morning.)

Filed under: Statistics, Travel, University life Tagged: ABC in Montréal, ISBA, ISBA@NIPS, Montréal, NIPS, probabilistic progamming, variational Bayes methods

### amazonish thanks (& repeated warning)

**A**s in previous years, at about this time, I want to (re)warn unaware ‘Og readers that all links to Amazon.com and more rarely to Amazon.fr found on this blog are actually susceptible to earn me an advertising percentage if a purchase is made by the reader* in the 24 hours following the entry on Amazon through this link*, thanks to the “*Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com/fr*“. Unlike last year, I did not benefit as much from the new edition of Andrew’s book, and the link he copied from my blog entry… Here are some of the most Og-unrelated purchases:

- Mr. Beer Deluxe Beer Bottling System
- Kyjen 2518 Dog Life Jacket
- Fisher-Price Learn-to-Flush Potty
- Way Huge Green Rhino
- WWII Helmets and Headgear

Once again, books I reviewed, positively or negatively, were among the top purchases… Like a dozen Monte Carlo simulation and resampling methods for social science , a few copies of Naked Statistics. And again a few of The Cartoon Introduction to Statistics. (Despite a most critical review.) Thanks to all of you using those links and feeding further my book addiction, with the drawback of inducing even more fantasy book reviews.

Filed under: Books, Kids, R, Statistics Tagged: Amazon, amazon associates, book reviews, dog life jacket, Monte Carlo Statistical Methods, Og

### the demise of the Bayes factor

**W**ith Kaniav Kamary, Kerrie Mengersen, and Judith Rousseau, we have just arXived (and submitted) a paper entitled “Testing hypotheses via a mixture model”. (We actually presented some earlier version of this work in Cancũn, Vienna, and Gainesville, so you may have heard of it already.) The notion we advocate in this paper is to replace the posterior probability of a model or an hypothesis with the posterior distribution of the weights of a mixture of the models under comparison. That is, given two models under comparison,

we propose to estimate the (artificial) mixture model

and in particular derive the posterior distribution of α. One may object that the mixture model is neither of the two models under comparison but this is the case at the boundary, i.e., when α=0,1. Thus, if we use prior distributions on α that favour the neighbourhoods of 0 and 1, we should be able to see the posterior concentrate near 0 or 1, depending on which model is true. And indeed this is the case: for any given Beta prior on α, we observe a higher and higher concentration at the right boundary as the sample size increases. And establish a convergence result to this effect. Furthermore, the mixture approach offers numerous advantages, among which *[verbatim from the paper]*:

- relying on a Bayesian estimator of the weight α rather than on the posterior probability of the corresponding model does remove the need of overwhelmingly artificial prior probabilities on model indices;
- the interpretation of this estimator is at least as natural as handling the posterior probability, while avoiding the caricaturesque zero-one loss setting. The quantity α and its posterior distribution provide a measure of proximity to both models for the data at hand, while being also interpretable as a propensity of the data to stand with (or to stem from) one of the two models. This representation further allows for alternative perspectives on testing and model choices, through the notions of predictive tools cross-validation, and information indices like WAIC;
- the highly problematic computation of the marginal likelihoods is bypassed, standard algorithms being available for Bayesian mixture estimation;
- the extension to a finite collection of models to be compared is straightforward, as this simply involves a larger number of components. This approach further allows to consider all models at once rather than engaging in pairwise costly comparisons and thus to eliminate the least likely models by simulation, those being not explored by the corresponding algorithm;
- the (simultaneously conceptual and computational) difficulty of “label switching” that plagues both Bayesian estimation and Bayesian computation for most mixture models completely vanishes in this particular context, since components are no longer exchangeable. In particular, we compute neither a Bayes factor nor a posterior probability related with the substitute mixture model and we hence avoid the difficulty of recovering the modes of the posterior distribution. Our perspective is solely centred on estimating the parameters of a mixture model where both components are always identifiable;
- the posterior distribution of α evaluates more thoroughly the strength of the support for a given model than the single figure outcome of a Bayes factor or of a posterior probability. The variability of the posterior distribution on α allows for a more thorough assessment of the strength of the support of one model against the other;
- an additional feature missing from traditional Bayesian answers is that a mixture model also acknowledges the possibility that, for a finite dataset,
*both*models or*none*could be acceptable. - while standard (proper and informative) prior modelling can be painlessly reproduced in this novel setting, non-informative (improper)

priors now are manageable therein, provided both models under comparison are first reparametrised towards common-meaning and shared parameters, as for instance with location and scale parameters. In the special case when all parameters can be made common to both models [While this may sound like an extremely restrictive requirement in a traditional mixture model, let us stress here that the presence of common parameters becomes quite natural within a testing setting. To wit, when comparing two different models for the*same*data, moments are defined in terms of the observed data and hence*should*be the*same*for both models. Reparametrising the models in terms of those common meaning moments does lead to a mixture model with some and maybe*all*common parameters. We thus advise the use of a common parametrisation, whenever possible.] the mixture model reads asFor instance, if θ is a location parameter, a flat prior can be used with no foundational difficulty, in opposition to the testing case;

- continuing from the previous argument, using the
*same*parameters or some*identical*parameters on both components is an essential feature of this reformulation of Bayesian testing, as it highlights the fact that the opposition between the two components of the mixture is not an issue of enjoying different parameters, but quite the opposite. As further stressed below, this or even*those*common parameter(s) is (are) nuisance parameters that need be integrated out (as they also are in the traditional Bayesian approach through the computation of the marginal likelihoods); - the choice of the prior model probabilities is rarely discussed in a classical Bayesian approach, even though those probabilities linearly impact the posterior probabilities and can be argued to promote the alternative of using the Bayes factor instead. In the mixture estimation setting, prior modelling only involves selecting a prior on α, for instance a Beta B(a,a) distribution, with a wide range of acceptable values for the hyperparameter a. While the value of a impacts the posterior distribution of α, it can be argued that (a) it nonetheless leads to an accumulation of the mass near 1 or 0, i.e., to favour the most favourable or the true model over the other one, and (b) a sensitivity analysis on the impact of a is straightforward to carry on;
- in most settings, this approach can furthermore be easily calibrated by a parametric bootstrap experiment providing a posterior distribution of α under each of the models under comparison. The prior predictive error can therefore be directly estimated and can drive the choice of the hyperparameter a, if need be.

Filed under: Books, Kids, Statistics, Travel, University life Tagged: Bayes factor, Bayesian hypothesis testing, component of a mixture, consistency, hyperparameter, model posterior probabilities, posterior, prior, testing as mixture estimation

### Statistics slides (5)

**H**ere is the fifth and last set of slides for my third year statistics course, trying to introduce Bayesian statistics in the most natural way and hence starting with… Rasmus’ socks and ABC!!! This is an interesting experiment as I have no idea how my students will react. Either they will see the point besides the anecdotal story or they’ll miss it (being quite unhappy so far about the lack of mathematical rigour in my course and exercises…). We only have two weeks left so I am afraid the concept will not have time to seep through!

Filed under: Books, Kids, Statistics, University life Tagged: Bayesian statistics, Don Rubin, HPD region, map, Paris, Université Paris Dauphine

### Whispers underground [book review]

*“Dr. Walid said that normal human variations were wide enough that you’d need samples of hundreds of subjects to test that. Thousands if you wanted a statistically significant answer.*

* Low sample size—one of the reasons why magic and science are hard to reconcile.”*

**T**his is the third volume in the Rivers of London series, brought back from Gainesville, and possibly the least successful (in my opinion). It indeed takes place underground and not only in the Underground and the underground sewers of London. Which is this literary trick that always irks me in fantasy novels, namely the sudden appearance of massive underground complex with unsuspected societies that are large and evolved enough to reach the Industrial Age.* (Sorry if this is too much of a spoiler!)*

*“It was the various probability calculations that stuffed me—they always do. I’d have been a bad scientist.”*

Not that everything is bad in this novel: I still like the massive infodump about London, the style and humour, the return of PC Lesley trying to get over the (literal) loss of her face, and the appearance of new characters. But the story itself, revolving about a murder investigation, is rather shallow and the (compulsory?) English policeman versus American cop competition is too contrived to be funny. Most of the major plot is hidden from this volume, unless there are clues I missed. (For instance, one death from a previous volume which seemed to get ignored at that time is finally explained here.) Definitely not the book to read on its own, as it still relates and borrow much from the previous volumes, but presumably one to read nonetheless as the next instalment, *Broken homes*.

Filed under: Books, pictures, Travel Tagged: Gainesville, Isaac Newton, London, PC Peter Grant, Thames, Underground

### nested sampling with a test

**O**n my way back from Warwick, I read through a couple preprints, including this statistical test for nested sampling algorithms by Johannes Buchner. As it happens, I had already read and commented it in July! However, without the slightest memory of it (sad, isn’t it?!), I focussed this time much more on the modification proposed to MultiNest than on the test itself, which is in fact a Kolmogorov-Smirnov test applied to a specific target function.

Indeed, when reading the proposed modification of Buchner, I thought of a modification to the modification that sounded more appealing. Without getting back to defining nested sampling in detail, this algorithm follows a swarm of N particles within upper-level sets of the likelihood surface, each step requiring a new simulation above the current value of the likelihood. The remark that set me on this time was that we should exploit the fact that (N-1) particles were already available within this level set. And uniformly distributed herein. Therefore this particle cloud should be exploited as much as possible to return yet another particle distributed just as uniformly as the other ones (!). Buchner proposes an alternative to MultiNest based on a randomised version of the maximal distance to a neighbour and a ball centre picked at random (but not uniformly). But it would be just as feasible to draw a distance from the empirical cdf of the distances to the nearest neighbours or to the k-nearest neighbours. With some possible calibration of k. And somewhat more accurate, because this distribution represents the repartition of the particle within the upper-level set. Although I looked at it briefly in the [sluggish] metro from Roissy airport, I could not figure out a way to account for the additional point to be included in the (N-1) existing particles. That is, how to deform the empirical cdf of those distances to account for an additional point. Unless one included the just-removed particle, which is at the boundary of this upper-level set. (Or rather, which defines the boundary of this upper-level set.) I have no clear intuition as to whether or not this would amount to a uniform generation over the true upper-level set. But simulating from the distance distribution would remove (I think) the clustering effect mentioned by Buchner.

*“Other priors can be mapped [into the uniform prior over the unit hypercube] using the inverse of the cumulative prior distribution.”*

Hence another illustration of the addictive features of nested sampling! Each time I get back to this notion, a new understanding or reinterpretation comes to mind. In any case, an equally endless source of projects for Master students. (Not that I agree with the above quote, mind you!)

Filed under: Books, pictures, Statistics, Travel, University life Tagged: hypercube, Kolmogorov-Smirnov distance, Multinest, nested sampling, uniform distribution, University of Warwick