A referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.
Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms
the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.
This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1″ would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1″.
And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.
“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”
One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity
which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)
does not hold when the two permutations σ and τ give different images of zi…
Filed under: Books, Statistics, University life Tagged: component of a mixture, convergence, finite mixtures, identifiability, ill-posed problem, invariance, label switching, loss function, MCMC algorithms, missing data, multimodality, relabelling
Our paper about evaluating statistics used for ABC model choice has just appeared in Series B! It somewhat paradoxical that it comes out just a few days after we submitted our paper on using random forests for Bayesian model choice, thus bypassing the need for selecting those summary statistics by incorporating all statistics available and letting the trees automatically rank those statistics in term of their discriminating power. Nonetheless, this paper remains an exciting piece of work (!) as it addresses the more general and pressing question of the validity of running a Bayesian analysis with only part of the information contained in the data. Quite usefull in my (biased) opinion when considering the emergence of approximate inference already discussed on this ‘Og…
[As a trivial aside, I had first used fresh from the press(es) as the bracketted comment, before I realised the meaning was not necessarily the same in English and in French.]
Filed under: Books, Statistics, University life Tagged: ABC model choice, Approximate Bayesian computation, JRSSB, Royal Statistical Society, Series B, statistical methodology, summary statistics
An email from one of my Master students who sent his problem sheet (taken from Monte Carlo Statistical Methods) late:
Je « suis » votre cours du mercredi dont le formalisme mathématique me fait froid partout
Avec beaucoup de difficulté je vous envoie mes exercices du premier chapitre de votre livre.
which translates as
Good evening Professor,
I “follow” your Wednesday class which mathematical formalism makes me cold all over. With much hardship, I send you the first batch of problems from your book.
I know that winter is coming, but, still, making students shudder from mathematical cold is not my primary goal when teaching Monte Carlo methods!
Filed under: Books, Kids, Statistics, University life Tagged: computational statistics, ENSAE, Master program, MCMC algorithms, Monte Carlo Statistical Methods, statistical computing, Université Paris Dauphine, Winter is coming
After a somewhat prolonged labour (!), we have at last completed our paper on ABC model choice with random forests and submitted it to PNAS for possible publication. While the paper is entirely methodological, the primary domain of application of ABC model choice methods remains population genetics and the diffusion of this new methodology to the users is thus more likely via a media like PNAS than via a machine learning or statistics journal.
When compared with our recent update of the arXived paper, there is not much different in contents, as it is mostly an issue of fitting the PNAS publication canons. (Which makes the paper less readable in the posted version [in my opinion!] as it needs to fit the main document within the compulsory six pages, relegated part of the experiments and of the explanations to the Supplementary Information section.)
Filed under: pictures, R, Statistics, University life Tagged: 1000 Genomes Project, ABC, ABC model choice, machine learning, model posterior probabilities, posterior predictive, random forests, summary statistics
While I was in Warwick, Dan Simpson [newly arrived from Norway on a postdoc position] mentioned to me he had attended a talk by Aki Vehtari in Norway where my early work with Jérôme Dupuis on projective priors was used. He gave me the link to this paper by Peltola, Havulinna, Salomaa and Vehtari that indeed refers to the idea that a prior on a given Euclidean space defines priors by projections on all subspaces, despite the zero measure of all those subspaces. (This notion first appeared in a joint paper with my friend Costas Goutis, who alas died in a diving accident a few months later.) The projection further allowed for a simple expression of the Kullback-Leibler deviance between the corresponding models and for a Pythagorean theorem on the additivity of the deviances between embedded models. The weakest spot of this approach of ours was, in my opinion and unsurprisingly, about deciding when a submodel was too far from the full model. The lack of explanatory power introduced therein had no absolute scale and later discussions led me to think that the bound should depend on the sample size to ensure consistency. (The recent paper by Nott and Leng that was expanding on this projection has now appeared in CSDA.)
“Specifically, the models with subsets of covariates are found by maximizing the similarity of their predictions to this reference as proposed by Dupuis and Robert . Notably, this approach does not require specifying priors for the submodels and one can instead focus on building a good reference model. Dupuis and Robert (2003) suggest choosing the size of the covariate subset based on an acceptable loss of explanatory power compared to the reference model. We examine using cross-validation based estimates of predictive performance as an alternative.” T. Peltola et al.
The paper also connects with the Bayesian Lasso literature, concluding on the horseshoe prior being more informative than the Laplace prior. It applies the selection approach to identify biomarkers with predictive performances in a study of diabetic patients. The authors rank model according to their (log) predictive density at the observed data, using cross-validation to avoid exploiting the data twice. On the MCMC front, the paper implements the NUTS version of HMC with STAN.
Filed under: Mountains, pictures, Statistics, Travel, University life Tagged: Aki Vehtari, Bayesian lasso, Dan Simpson, embedded models, Hamiltonian Monte Carlo, horseshoe prior, Kullback-Leibler divergence, MCMC, Norway, NUTS, predictive power, prior projection, STAN, variable selection, zero measure set
After several clones of our SAME algorithm appeared in the literature, it is rather fun to see another paper acknowledging the connection. SAME but different was arXived today by Zhao, Jiang and Canny. The point of this short paper is to show that the parallel implementation of SAME leads to efficient performances compared with existing standards. Since the duplicated latent variables are independent [given θ] they can be simulated in parallel. They further assume independence between the components of those latent variables. And finite support. As in document analysis. So they can sample the replicated latent variables all at once. Parallelism is thus used solely for the components of the latent variable(s). SAME is normally associated with an annealing schedule but the authors could not detect an improvement over a fixed and large number of replications. They reported gains comparable to state-of-the-art variational Bayes on two large datasets. Quite fun to see SAME getting a new life thanks to computer scientists!
Filed under: Statistics, University life Tagged: data cloning, document analysis, map, Monte Carlo Statistical Methods, parallel MCMC, SAME, simulated annealing, simulation, stochastic optimisation, variational Bayes methods
The editors of a new blog entitled Marauders of the Lost Sciences (Learn from the giants) sent me an email to signal the start of this blog with a short excerpt from a giant in maths or stats posted every day:There is a new blog I wanted to tell you about which excerpts one interesting or classic paper or book a day from the mathematical sciences. We plan on daily posting across the range of mathematical fields and at any level, but about 20-30% of the posts in queue are from statistics. The goal is to entice people to read the great works of old. The first post today was from an old paper by Fisher applying Group Theory to the design of experiments.
Interesting concept, which will hopefully generate comments to put the quoted passage into context. Somewhat connected to my Reading Statistical Classics posts. Which incidentally if sadly will not take place this year since only two students registered. should take place in the end since more students registered! (I am unsure about the references behind the title of that blog, besides Spielberg’s Raiders of the Lost Ark and Norman’s Marauders of Gor… I just hope Statistics does not qualify as a lost science!)
Filed under: Books, Statistics, University life Tagged: blogging, classics, graduate course, marauders of Gor, R.A. Fisher, Raiders of the Lost Ark, reading list
Yet another book I grabbed on impulse while in Birmingham last month. And which had been waiting for me on a shelf of my office in Warwick. Another buy I do not regret! Rivers of London is delightful, as much for taking place in all corners of London as for the story itself. Not mentioning the highly enjoyable writing style!
“I though you were a sceptic, said Lesley. I though you were scientific”
The first volume in this detective+magic series, Rivers of London, sets the universe of this mix of traditional Metropolitan Police work and of urban magic, the title being about the deities of the rivers of London, including a Mother and a Father Thames… I usually dislike any story mixing modern life and fantasy but this is a definitive exception! What I enjoy in this book setting is primarily the language used in the book that is so uniquely English (to the point of having the U.S. edition edited!, if the author’s blog is to be believed). And the fact that it is so much about London, its history and inhabitants. But mostly about London, as an entity on its own. Even though my experience of London is limited to a few boroughs, there are many passages where I can relate to the location and this obviously makes the story much more appealing. The style is witty, ironic and full of understatements, a true pleasure.
“The tube is a good place for this sort of conceptual breakthrough because, unless you’ve got something to read, there’s bugger all else to do.”
The story itself is rather fun, with at least three levels of plots and two types of magic. It centres around two freshly hired London constables, one of them discovering magical abilities and been drafted to the supernatural section of the Metropolitan Police. And making all the monologues in the book. The supernatural section is made of a single Inspector, plus a few side characters, but with enough fancy details to give it life. In particular, Isaac Newton is credited with having started the section, called The Folly. Which is also the name of Ben Aaronovitch’s webpage.
“There was a poster (…) that said: `Keep Calm and Carry On’, which I thought was good advice.”
This quote is unvoluntarily funny in that it takes place in a cellar holding material from World War II. Except that the now invasive red and white poster was never distributed during the war… On the opposite it was pulped to save paper and the fact that a few copies survived is a sort of (minor) miracle. Hence a double anachronism in that it did not belong to a WWII room and that Peter Grant should have seen its modern avatars all over London.
“Have you ever been to London? Don’t worry, it’s basically just like the country. Only with more people.”
The last part of the book is darker and feels less well-written, maybe simply because of the darker side and of the accumulation of events, while the central character gets rather too central and too much of an unexpected hero that saves the day. There is in particular a part where he seems to forget about his friend Lesley who is in deep trouble at the time and this does not seem to make much sense. But, except for this lapse (maybe due to my quick reading of the book over the week in Warwick), the flow and pace are great, with this constant undertone of satire and wit from the central character. I am definitely looking forward reading tomes 2 and 3 in the series (having already read tome 4 in Austria!, which was a mistake as there were spoilers about earlier volumes).
Filed under: Books, Kids, Travel Tagged: Ben Aaronnovitch, book review, cockney slang, ghosts, Isaac Newton, Keep calm posters, London, magics, Metropolitan Police, Peter Grant series, Thames, Warwick
Yesterday, Rasmus Bååth [of puppies' fame!] posted a very nice blog using ABC to derive the posterior distribution of the total number of socks in the laundry when only pulling out orphan socks and no pair at all in the first eleven draws. Maybe not the most pressing issue for Bayesian inference in the era of Big data but still a challenge of sorts!
Rasmus set a prior on the total number m of socks, a negative Binomial Neg(15,1/3) distribution, and another prior of the proportion of socks that come by pairs, a Beta B(15,2) distribution, then simulated pseudo-data by picking eleven socks at random, and at last applied ABC (in Rubin’s 1984 sense) by waiting for the observed event, i.e. only orphans and no pair [of socks]. Brilliant!
The overall simplicity of the problem set me wondering about an alternative solution using the likelihood. Cannot be that hard, can it?! After a few computations rejected by opposing them to experimental frequencies, I put the problem on hold until I was back home and with access to my Feller volume 1, one of the few [math] books I keep at home… As I was convinced one of the exercises in Chapter II would cover this case. After checking, I found a partial solution, namely Exercice 26:
A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r<n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them?
This is not exactly a solution, but rather a problem, however it leads to the value
as the probability of obtaining j pairs among those 2r shoes. Which also works for an odd number t of shoes:
as I checked against my large simulations. So I solved Exercise 26 in Feller volume 1 (!), but not Rasmus’ problem, since there are those orphan socks on top of the pairs. If one draws 11 socks out of m socks made of f orphans and g pairs, with f+2g=m, the number k of socks from the orphan group is an hypergeometric H(11,m,f) rv and the probability to observe 11 orphan socks total (either from the orphan or from the paired groups) is thus the marginal over all possible values of k:
so it could be argued that we are facing a closed-form likelihood problem. Even though it presumably took me longer to achieve this formula than for Rasmus to run his exact ABC code!
Filed under: Books, Kids, R, Statistics, University life Tagged: ABC, capture-recapture, combinatorics, subjective prior, William Feller
Yesterday night, just before leaving for Coventry, I realised I had about 30 versions of my “mother of all .bib” bib file, spread over directories and with broken links with the original mother file… (I mean, I always create bib files in new directories by a hard link,ln ~/mother.bib
but they eventually and inexplicably end up with a life of their own!) So I decided a Spring clean-up was in order and installed BibTool on my Linux machine to gather all those versions into a new encompassing all-inclusive bib reference. I did not take advantage of the many possibilities of the program, written by Gerd Neugebauer, but it certainly solved my problem: once I realised I had to set the variatescheck.double = on check.double.delete = on pass.comments = off
all I had to do was to callbibtool -s -i ../*/*.bib -o mother.bib bibtool -d -i mother.bib -o mother.bib bibtool -s -i mother.bib -o mother.bib
to merge all bib file and then to get rid of the duplicated entries in mother.bib (the -d option commented out the duplicates and the second call with -s removed them). And to remove the duplicated definitions in the preamble of the file. This took me very little time in the RER train from Paris-Dauphine (where I taught this morning, having a hard time to make the students envision the empirical cdf as an average of Dirac masses!) to Roissy airport, in contrast with my pedestrian replacement of all stray siblings of the mother bib into new proper hard links, one by one. I am sure there is a bash command that could have done it in one line, but I spent instead my flight to Birmingham switching all existing bib files, one by one…
Filed under: Books, Linux, Travel, University life Tagged: bash, BibTeX, BibTool, Birmingham, Charles de Gaulle, LaTeX, link, Linux, RER B, Roissy, University of Warwick
In a comment on our Accelerating Metropolis-Hastings algorithms: Delayed acceptance with prefetching paper, Philip commented that he had experimented with an alternative splitting technique retaining the right stationary measure: the idea behind his alternative acceleration is again (a) to divide the target into bits and (b) run the acceptance step by parts, towards a major reduction in computing time. The difference with our approach is to represent the overall acceptance probability
and, even more surprisingly than in our case, this representation remains associated with the right (posterior) target!!! Provided the ordering of the terms is random with a symmetric distribution on the permutation. This property can be directly checked via the detailed balance condition.
In a toy example, I compared the acceptance rates (acrat) for our delayed solution (letabin.R), for this alternative (letamin.R), and for a non-delayed reference (letabaz.R), when considering more and more fractured decompositions of a Bernoulli likelihood.> system.time(source("letabin.R")) user system elapsed 225.918 0.444 227.200 > acrat  0.3195 0.2424 0.2154 0.1917 0.1305 0.0958 > system.time(source("letamin.R")) user system elapsed 340.677 0.512 345.389 > acrat  0.4045 0.4138 0.4194 0.4003 0.3998 0.4145 > system.time(source("letabaz.R")) user system elapsed 49.271 0.080 49.862 > acrat  0.6078 0.6068 0.6103 0.6086 0.6040 0.6158
A very interesting outcome since the acceptance rate does not change with the number of terms in the decomposition for the alternative delayed acceptance method… Even though it logically takes longer than our solution. However, the drawback is that detailed balance implies picking the order at random, hence loosing on the gain in computing the cheap terms first. If reversibility could be bypassed, then this alternative would definitely get very appealing!
Filed under: Books, Kids, Statistics, University life Tagged: acceleration of MCMC algorithms, delayed acceptance, detailed balance, MCMC, Monte Carlo Statistical Methods, reversibility, simulation
This new arXival by Chris Oates, Mark Girolami, and Nicolas Chopin (warning: they all are colleagues & friends of mine!, at least until they read those comments…) is a variation on control variates, but with a surprising twist namely that the inclusion of a control variate functional may produce a sub-root-n (i.e., faster than √n) convergence rate in the resulting estimator. Surprising as I did not know one could get to sub-root-n rates..! Now I had forgotten that Anne Philippe and I used the score in an earlier paper of ours, as a control variate for Riemann sum approximations, with faster convergence rates, but this is indeed a new twist, in particular because it produces an unbiased estimator.
The control variate writes
where π is the target density and φ is a free function to be optimised. (Under the constraint that πφ is integrable. Then the expectation of ψφ is indeed zero.) The “explanation” for the sub-root-n behaviour is that ψφ is chosen as an L2 regression. When looking at the sub-root-n convergence proof, the explanation is more of a Rao-Blackwellisation type, assuming a first level convergent (or presistent) approximation to the integrand [of the above form ψφ can be found. The optimal φ is the solution of a differential equation that needs estimating and the paper concentrates on approximating strategies. This connects with Antonietta Mira’s zero variance control variates, but in a non-parametric manner, adopting a Gaussian process as the prior on the unknown φ. And this is where the huge innovation in the paper resides, I think, i.e. in assuming a Gaussian process prior on the control functional and in managing to preserve unbiasedness. As in many of its implementations, modelling by Gaussian processes offers nice features, like ψφ being itself a Gaussian process. Except that it cannot be shown to lead to presistency on a theoretical basis. Even though it appears to hold in the examples of the paper. Apart from this theoretical difficulty, the potential hardship with the method seems to be in the implementation, as there are several parameters and functionals to be calibrated, hence calling for cross-validation which may often be time-consuming. The gains are humongous, so the method should be adopted whenever the added cost in implementing it is reasonable, cost which evaluation is not clearly provided by the paper. In the toy Gaussian example where everything can be computed, I am surprised at the relatively poor performance of a Riemann sum approximation to the integral, wondering at the level of quadrature involved therein. The paper also interestingly connects with O’Hagan’s (1991) Bayes-Hermite [polynomials] quadrature and quasi-Monte Carlo [obviously!].
Filed under: Books, Statistics, University life Tagged: control variate, convergence rate, Gaussian processes, Monte Carlo Statistical Methods, simulation, University of Warwick
Filed under: Travel, Wines Tagged: carignan, Faugères, grenache, Languedoc, mourvèdre, red wine, Syrah
Taking advantage of his visit to Paris this month, Shravan Vasishth, from University of Postdam, Germany, will give a talk at 10.30am, next Friday, October 24, at ENSAE on:
Using Bayesian Linear Mixed Models in Psycholinguistics: Some open issues
With the arrival of the probabilistic programming language Stan (and JAGS), it has become relatively easy to fit fairly complex Bayesian linear mixed models. Until now, the main tool that was available in R was lme4. I will talk about how we have fit these models in recently published work (Husain et al 2014, Hofmeister and Vasishth 2014). We are trying to develop a standard approach for fitting these models so that graduate students with minimal training in statistics can fit such models using Stan.
I will discuss some open issues that arose in the course of fitting linear mixed models. In particular, one issue is: should one assume a full variance-covariance matrix for random effects even when there is not enough data to estimate all parameters? In lme4, one often gets convergence failure or degenerate variance-covariance matrices in such cases and so one has to back off to a simpler model. But in Stan it is possible to assume vague priors on each parameter, and fit a full variance-covariance matrix for random effects. The advantage of doing this is that we faithfully express in the model how the data were generated—if there is not enough data to estimate the parameters, the posterior distribution will be dominated by the prior, and if there is enough data, we should get reasonable estimates for each parameter. Currently we fit full variance-covariance matrices, but we have been criticized for doing this. The criticism is that one should not try to fit such models when there is not enough data to estimate parameters. This position is very reasonable when using lme4; but in the Bayesian setting it does not seem to matter.
Filed under: Books, Statistics, University life Tagged: Bayesian linear mixed models., Bayesian modelling, JAGS, linear mixed models, lme4, prior domination, psycholinguistics, STAN, Universität Potsdam
This past week in Warwick has been quite enjoyable and profitable, from staying once again in a math house, to taking advantage of the new bike, to having several long discussions on several prospective and exciting projects, to meeting with some of the new postdocs and visitors, to attending Tony O’Hagan’s talk on “wrong models”. And then having Simo Särkkä who was visiting Warwick this week discussing his paper with me. And Chris Oates doing the same with his recent arXival with Mark Girolami and Nicolas Chopin (soon to be commented, of course!). And managing to run in dry conditions despite the heavy rains (but in pitch dark as sunrise is now quite late, with the help of a headlamp and the beauty of a countryside starry sky). I also evaluated several students’ projects, two of which led me to wonder when using RJMCMC was appropriate in comparing two models. In addition, I also eloped one evening to visit old (1977!) friends in Northern Birmingham, despite fairly dire London Midlands performances between Coventry and Birmingham New Street, the only redeeming feature being that the connecting train there was also late by one hour! (Not mentioning the weirdest taxi-driver ever on my way back, trying to get my opinion on whether or not he should have an affair… which at least kept me awake the whole trip!) Definitely looking forward my next trip there at the end of November.
Filed under: Books, Kids, Running, Statistics, University life Tagged: Birmingham, control variate, Coventry, English train, goose, London Midlands, Mark Girolami, Nicolas Chopin, particle MCMC, simulation model, taxi-driver, Tony O'Hagan, University of Warwick
A very refreshing email from a PhD candidate from abroad:
“Franchement j’ai pas lu encore vos papiers en détails, mais j’apprécie vos axes de recherche et j’aimerai bien en faire autant avec votre collaboration, bien sûr. Actuellement, je suis à la recherche d’un sujet de thèse et c’est pour cela que je vous écris. Je suis prêt à négocier sur tout point et de tout coté.”
[Frankly I have not yet read your papers in detail , but I appreciate your research areas and I would love to do the same with your help , of course. Currently, I am looking for a thesis topic and this is why I write to you. I am willing to negotiate on any point and any side.]
Filed under: Kids, Statistics, University life Tagged: foreign students, PhD s, PhD topic
Approximate Bayesian computation techniques are 2000’s successors of MCMC methods as handling new models where MCMC algorithms are at a loss, in the same way the latter were able in the 1990’s to cover models that regular Monte Carlo approaches could not reach. While they first sounded like “quick-and-dirty” solutions, only to be considered until more elaborate solutions could (not) be found, they have been progressively incorporated within the statistican’s toolbox as a novel form of non-parametric inference handling partly defined models. A statistically relevant feature of those ACB methods is that they require replacing the data with smaller dimension summaries or statistics, because of the complexity of the former. In almost every case when calling ABC is the unique solution, those summaries are not sufficient and the method thus implies a loss of statistical information, at least at a formal level since relying on the raw data is out of question. This forced reduction of statistical information raises many relevant questions, from the choice of summary statistics to the consistency of the ensuing inference.
In this paper of the special MCMSki 4 issue of Statistics and Computing, Stoehr et al. attack the recurrent problem of selecting summary statistics for ABC in a hidden Markov random field, since there is no fixed dimension sufficient statistics in that case. The paper provides a very broad overview of the issues and difficulties related with ABC model choice, which has been the focus of some advanced research only for a few years. Most interestingly, the authors define a novel, local, and somewhat Bayesian misclassification rate, an error that is conditional on the observed value and derived from the ABC reference table. It is the posterior predictive error rate
integrating in both the model index m and the corresponding random variable Y (and the hidden intermediary parameter) given the observation. Or rather given the transform of the observation by the summary statistic S. The authors even go further to define the error rate of a classification rule based on a first (collection of) statistic, conditional on a second (collection of) statistic (see Definition 1). A notion rather delicate to validate on a fully Bayesian basis. And they advocate the substitution of the unreliable (estimates of the) posterior probabilities by this local error rate, estimated by traditional non-parametric kernel methods. Methods that are calibrated by cross-validation. Given a reference summary statistic, this perspective leads (at least in theory) to select the optimal summary statistic as the one leading to the minimal local error rate. Besides its application to hidden Markov random fields, which is of interest per se, this paper thus opens a new vista on calibrating ABC methods and evaluating their true performances conditional on the actual data. (The advocated abandonment of the posterior probabilities could almost justify the denomination of a paradigm shift. This is also the approach advocated in our random forest paper.)
Filed under: Books, Kids, Statistics, University life Tagged: ABC, arXiv, Cross Validation, Gibbs random field, hidden Markov models, Markov random field, Monte Carlo Statistical Methods, paradigm shift, Pierre Pudlo, predictive loss, simulation, summary statistics
This paper by Weixuan Zhu, Juan Miguel Marín [from Carlos III in Madrid, not to be confused with Jean-Michel Marin, from Montpellier!], and Fabrizio Leisen proposes an alternative to our 2013 PNAS paper with Kerrie Mengersen and Pierre Pudlo on empirical likelihood ABC, or BCel. The alternative is based on Davison, Hinkley and Worton’s (1992) bootstrap likelihood, which relies on a double-bootstrap to produce a non-parametric estimate of the distribution of a given estimator of the parameter θ. Including a smooth curve-fitting algorithm step, for which not much description is available from the paper.
“…in contrast with the empirical likelihood method, the bootstrap likelihood doesn’t require any set of subjective constrains taking advantage from the bootstrap methodology. This makes the algorithm an automatic and reliable procedure where only a few parameters need to be specified.”
The spirit is indeed quite similar to ours in that a non-parametric substitute plays the role of the actual likelihood, with no correction for the substitution. Both approaches are convergent, with similar or identical convergence speeds. While the empirical likelihood relies on a choice of parameter identifying constraints, the bootstrap version starts directly from the [subjectively] chosen estimator of θ. For it indeed needs to be chosen. And computed.
“Another benefit of using the bootstrap likelihood (…) is that the construction of bootstrap likelihood could be done once and not at every iteration as the empirical likelihood. This leads to significant improvement in the computing time when different priors are compared.”
This is an improvement that could apply to the empirical likelihood approach, as well, once a large enough collection of likelihood values has been gathered. But only in small enough dimensions where smooth curve-fitting algorithms can operate. The same criticism applying to the derivation of a non-parametric density estimate for the distribution of the estimator of θ. Critically, the paper only processes examples with a few parameters.
In the comparisons between BCel and BCbl that are produced in the paper, the gain is indeed towards BCbl. Since this paper is mostly based on examples and illustrations, not unlike ours, I would like to see more details on the calibration of the non-parametric methods and of regular ABC, as well as on the computing time. And the variability of both methods on more than a single Monte Carlo experiment.
I am however uncertain as to how the authors process the population genetic example. They refer to the composite likelihood used in our paper to set the moment equations. Since this is not the true likelihood, how do the authors select their parameter estimates in the double-bootstrap experiment? The inclusion of Crakel’s and Flegal’s (2013) bivariate Beta, is somewhat superfluous as this example sounds to me like an artificial setting.
In the case of the Ising model, maybe the pre-processing step in our paper with Matt Moores could be compared with the other algorithms. In terms of BCbl, how does the bootstrap operate on an Ising model, i.e. (a) how does one subsample pixels and (b)what are the validity guarantees?
A test that would be of interest is to start from a standard ABC solution and use this solution as the reference estimator of θ, then proceeding to apply BCbl for that estimator. Given that the reference table would have to be produced only once, this would not necessarily increase the computational cost by a large amount…
Filed under: Books, R, Statistics, University life Tagged: ABC, ABCel, bivariate Beta distribution, bootstrap, bootstrap likelihood, double bootstrap, empirical likelihood, Ising model, population genetics