Another paper addressing the estimation of the normalising constant and the wealth of available solutions just came out on arXiv, with the full title of “Target density normalization for Markov chain Monte Carlo algorithms“, written by Allen Caldwell and Chang Liu. (I became aware of it by courtesy of Ewan Cameron, as it appeared in the physics section of arXiv. It is actually a wee bit annoying that papers in the subcategory “Data Analysis, Statistics and Probability” of physics do not get an automated reposting on the statistics lists…)
In this paper, the authors compare three approaches to the problem of finding
when the density f is unormalised, i.e., in more formal terms, when f is proportional to a probability density (and available):
- an “arithmetic mean”, which is an importance sampler based on (a) reducing the integration volume to a neighbourhood ω of the global mode. This neighbourhood is chosen as an hypercube and the importance function turns out to be the uniform over this hypercube. The corresponding estimator is then a rescaled version of the average of f over uniform simulations in ω.
- an “harmonic mean”, of all choices!, with again an integration over the neighbourhood ω of the global mode in order to avoid the almost sure infinite variance of harmonic mean estimators.
- a Laplace approximation, using the target at the mode and the Hessian at the mode as well.
The paper then goes to comparing those three solutions on a few examples, demonstrating how the diameter of the hypercube can be calibrated towards a minimum (estimated) uncertainty. The rather anticlimactic conclusion is that the arithmetic mean is the most reliable solution as harmonic means may fail in larger dimension and more importantly fail to signal its failure, while Laplace approximations only approximate well quasi-Gaussian densities…
What I find most interesting in this paper is the idea of using only one part of the integration space to compute the integral, even though it is not exactly new. Focussing on a specific region ω has pros and cons, the pros being that the reduction to a modal region reduces needs for absolute MCMC convergence and helps in selecting alternative proposals and also prevents from the worst consequences of using a dreaded harmonic mean, the cons being that the region needs be well-identified, which means requirements on the MCMC kernel, and that the estimate is a product of two estimates, the frequency being driven by a Binomial noise. I also like very much the idea of calibrating the diameter Δof the hypercube ex-post by estimating the uncertainty.
As an aside, the paper mentions most of the alternative solutions I just presented in my Monte Carlo graduate course two days ago (like nested or bridge or Rao-Blackwellised sampling, including our proposal with Darren Wraith), but dismisses them as not “directly applicable in an MCMC setting”, i.e., without modifying this setting. I unsurprisingly dispute this labelling, both because something like the Laplace approximation requires extra-work on the MCMC output (and once done this work can lead to advanced Laplace methods like INLA) and because other methods could be considered as well (for instance, bridge sampling over several hypercubes). As shown in the recent paper by Mathieu Gerber and Nicolas Chopin (soon to be discussed at the RSS!), MCqMC has also become a feasible alternative that would compete well with the methods studied in this paper.
Overall, this is a paper that comes in a long list of papers on constant approximations. I do not find the Markov chain of MCMC aspect particularly compelling or specific, once the effective sample size is accounted for. It would be nice to find generic ways of optimising the visit to the hypercube ω and to estimate efficiently the weight of ω. The comparison is solely run over examples, but they all rely on a proper characterisation of the hypercube and the ability to simulate efficiently f over that hypercube.
Filed under: Statistics, University life Tagged: harmonic mean estimator, importance sampling, Laplace approximation, MCMC, MCQMC, Monte Carlo Statistical Methods, normalising constant, Rao-Blackwellisation, untractable normalizing constant
Ten days ago, Gersende Fort, Benjamin Jourdain, Tony Lelièvre, and Gabriel Stoltz arXived a study about an adaptive umbrella sampler that can be re-interpreted as a Wang-Landau algorithm, if not the most efficient version of the latter. This reminded me very much of the workshop we had all together in Edinburgh last June. And even more of the focus of the molecular dynamics talks in this same ICMS workshop about accelerating the MCMC exploration of multimodal targets. The self-healing aspect of the sampler is to adapt to the multimodal structure thanks to a partition that defines a biased sampling scheme spending time in each set of the partition in a frequency proportional to weights. While the optimal weights are the weights of the sets against the target distribution (are they truly optimal?! I would have thought lifting low density regions, i.e., marshes, could improve the mixing of the chain for a given proposal), those are unknown and they need to be estimated by an adaptive scheme that makes staying in a given set the less desirable the more one has visited it. By increasing the inverse weight of a given set by a factor each time it is visited. Which sounds indeed like Wang-Landau. The plus side of the self-healing umbrella sampler is that it only depends on a scale γ (and on the partition). Besides converging to the right weights of course. The downside is that it does not reach the most efficient convergence, since the adaptivity weight decreases in 1/n rather than 1/√n.
Note that the paper contains a massive experimental side where the authors checked the impact of various parameters by Monte Carlo studies of estimators involving more than a billion iterations. Apparently repeated a large number of times.
The next step in adaptivity should be about the adaptive determination of the partition, hoping for a robustness against the dimension of the space. Which may be unreachable if I judge by the apparent deceleration of the method when the number of terms in the partition increases.
Filed under: Kids, pictures, Statistics, University life Tagged: acceleration of MCMC algorithms, adaptive MCMC methods, Monte Carlo experiment, multimodality, Tintin, umbrella sampling, Wang-Landau algorithm, well-tempered algorithm
There is an open call of the Fondation Sciences Mathématiques de Paris (FSMP) about a postdoctoral funding program with 18 position-years available for staying in Université Paris-Dauphine (and other participating universities). The net support is quite decent (wrt French terms and academic salaries) and the application form easy to fill. So, if you are interested in coming to Paris to work on ABC, MCMC, Bayesian model choice, &tc., feel free to contact me (or another Parisian statistician) and to apply! The deadline is December 01, 2014. And the decision will be made by January 15, 2015. The starting date for the postdoc is October 01, 2015.
Filed under: Kids, Statistics, Travel, University life Tagged: ABC, Bayesian model choice, Fondation Sciences Mathématiques de Paris, MCMC, Monte Carlo Statistical Methods, Paris, postdoctoral position, Université Paris Dauphine
I read this arXived paper of Goodman, Lin and Morzfeld on a (long) train trip to the North of Paris, paper that deals with asymptotics of importance sampling, but I still have trouble about its impact on statistical (Bayesian) problems. The paper is indeed mostly about asymptotics of symmetrized versions of importance sampling, but there is a noise parameter ε that does not make sense [to me] when I try to apply it to statistical problems…
The first symmetrization sounds like a special case of the generic group operator of Kong et al., 2003, in their Read Paper. Namely that if one takes advantage in potential symmetries in the importance distribution to multiply the number of proposed values and integrates this choice into a symmetrized importance distribution, hence using a sort of Rao-Blackwellisation, the effective sample size gets better… Beside symmetry, another transform of importance sampling proposals is rescaling so that the importance density at the proposal equates the target density at the rescaled version. (Those notions of symmetry and importance proposal require some specific knowledge about the mode and Hessian of the target.) This version is called a random map, even though I have trouble spotting the randomness in the transform. The third notion in this paper is the small noise version when a small ε is introduced along with the rescaled target
which is used to evaluate the efficiency of those different transforms as ε goes to zero. Symmetrized versions improve the effective sample size by a power of ε. Namely from ε to ε² . But I still fail to see why this matters in the original problem when ε=1.
Filed under: Books, Statistics, University life
Information about social entities is often spread across multiple large databases, each degraded by noise, and without unique identifiers shared across databases.Entity resolution—reconstructing the actual entities and their attributes—is essential to using big data and is challenging not only for inference but also for computation.
In this talk, I motivate entity resolution by the current conflict in Syria. It has been tremendously well documented, however, we still do not know how many people have been killed from conflict-related violence. We describe a novel approach towards estimating death counts in Syria and challenges that are unique to this database. We first introduce computational speed-ups to avoid all-to-all record comparisons based upon locality-sensitive hashing from the computer science literature. We then introduce a novel approach to entity resolution by discovering a bipartite graph, which links manifest records to a common set of latent entities. Our model quantifies the uncertainty in the inference and propagates this uncertainty into subsequent analyses. Finally, we speak to the success and challenges of solving a problem that is at the forefront of national headlines and news.
This is joint work with Rob Hall (Etsy), Steve Fienberg (CMU), and Anshu Shrivastava (Cornell University).
[Note that Rebecca will visit the maths department in Paris-Dauphine for two weeks and give a short course in our data science Master on data confidentiality, privacy and statistical disclosure (syllabus).]
Filed under: Books, Statistics, University life Tagged: Carnegie Mellon University, CEREMADE, course, data science, MASH, privacy, PSL, Rebecca Steorts, seminar, Syria, Université Paris Dauphine
“But is the existence of God just a philosophical question, like, say, the definition of knowledge or the existence of Plato’s forms?” Gary Gutting, NYT
Although I stopped following The Stone‘s interviews of philosophers about their views on religion, six more took place and Gary Gutting has now closed the series he started a while ago with a self-interview. On this occasion, I went quickly through the last interviews, which had the same variability in depth and appeal as the earlier ones. A lot of them were somewhat misplaced in trying to understand or justify the reasons for believing in a god (a.k.a., God), which sounds more appropriate for a psychology or sociology perspective. I presume that what I was expecting from the series was more a “science vs. religion” debate, rather than entries into the metaphysics of various religions…
“Marin Mersenne, Descartes’s close friend and sponsor, and an important mathematician and scientific thinker in his own right, claimed that there were over 50,000 atheists living in Paris in 1623. In his massive commentary on Genesis, where he advanced that claim, he offered 35 arguments for the existence of God. For Mersenne and his contemporaries, the idea of the atheist was terrifying.” Daniel Garber, NYT.
For instance, Daniel Garber quoted above discusses why he remains an atheist while being “convinced that [he] should want to believe” but can’t. That is, he cannot find convincing reasons to believe. And states that following Pascal’s wager argument would be self-deception. The whole thing sounds more like psychology than philosophy. [Incidentally correcting my long-going mistake of writing Mersenne as Meresme!]
“The existence of God is not just any philosophical issue. It’s intimately tied up with what very many in our society feel gives meaning to their lives (…) many are subject to often subtle, but also often powerful, pressure from their religious groups to feel and to act and even to try to be certain of their position. This no doubt creates special dangers. But it also seems that a life of religious faith can lead us to special values we can’t find elsewhere. At any rate, this too is a philosophical issue. In light of all that, I would not want to make any blanket pronouncement (…) that the most reasonable stance on the existence of God is to stay on the sidelines.” Keith DeRose, NYT
Another argument outside philosophy in that it invokes psychology at the individual level and sociology at the societal level. So this contradicts the statement it is a philosophical issue, doesn’t it? This interview of Keith DeRose mentions a quote from Kant: “I had to deny knowledge in order to make room for faith” that I find particularly relevant, as the existence or non existence of God cannot be considered as knowledge in the usual sense but a belief in the former case (with all sorts of causes, as discussed throughout most interviews) and a disbelief in the latter case, albeit supported by rational scepticism that there is not the slightest hint that a deity could exist. (This seems to be the whole point of the interview, which mostly conveys uncertainty and goes round in circles.)
“…karma is more like what Kant called a postulate of practical reason, something one does well to believe in and act according to (for Kant, belief in God was a practical postulate of this sort).” Jonardon Ganeri, NYT
Two more entries from the series are by religious philosophers, illustrating the difficulty of the exercise by engaging into a non-philosophical entry about their religion (Islam and Hinduism) and/or a comparison with other religions. Just like earlier for Buddhism, and the Jewish religion (if not Christianity, which anyway appears as a reference religion in most other interviews). A third interview adopts an unconvincing relativist approach to religion versus science, arguing that science cannot explain everything and that the fact that religion itself is a by-product of evolution does not give a reason for its dismissal. Michael Ruse however dismisses Dawkins’ arguments as “first-year undergraduate philosophy”, which I find a bit short an argument. Just like mentioning Nazis, as a supposedly definitive “argument from evil”.
Filed under: Books, Kids Tagged: atheism, Blaise Pascal, evolution, Mersenne, Philosophy of religions, psychology, religions, Richard Dawkins, sociology, The New York Times, The Stone
Filed under: Kids, pictures, Travel Tagged: Halloween, Massachusset, Roger Conant, Salem, statue, USA, vacations, witchery
A referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.
Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms
the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.
This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1″ would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1″.
And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.
“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”
One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity
which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)
does not hold when the two permutations σ and τ give different images of zi…
Filed under: Books, Statistics, University life Tagged: component of a mixture, convergence, finite mixtures, identifiability, ill-posed problem, invariance, label switching, loss function, MCMC algorithms, missing data, multimodality, relabelling
Our paper about evaluating statistics used for ABC model choice has just appeared in Series B! It somewhat paradoxical that it comes out just a few days after we submitted our paper on using random forests for Bayesian model choice, thus bypassing the need for selecting those summary statistics by incorporating all statistics available and letting the trees automatically rank those statistics in term of their discriminating power. Nonetheless, this paper remains an exciting piece of work (!) as it addresses the more general and pressing question of the validity of running a Bayesian analysis with only part of the information contained in the data. Quite usefull in my (biased) opinion when considering the emergence of approximate inference already discussed on this ‘Og…
[As a trivial aside, I had first used fresh from the press(es) as the bracketted comment, before I realised the meaning was not necessarily the same in English and in French.]
Filed under: Books, Statistics, University life Tagged: ABC model choice, Approximate Bayesian computation, JRSSB, Royal Statistical Society, Series B, statistical methodology, summary statistics
An email from one of my Master students who sent his problem sheet (taken from Monte Carlo Statistical Methods) late:
Je « suis » votre cours du mercredi dont le formalisme mathématique me fait froid partout
Avec beaucoup de difficulté je vous envoie mes exercices du premier chapitre de votre livre.
which translates as
Good evening Professor,
I “follow” your Wednesday class which mathematical formalism makes me cold all over. With much hardship, I send you the first batch of problems from your book.
I know that winter is coming, but, still, making students shudder from mathematical cold is not my primary goal when teaching Monte Carlo methods!
Filed under: Books, Kids, Statistics, University life Tagged: computational statistics, ENSAE, Master program, MCMC algorithms, Monte Carlo Statistical Methods, statistical computing, Université Paris Dauphine, Winter is coming
After a somewhat prolonged labour (!), we have at last completed our paper on ABC model choice with random forests and submitted it to PNAS for possible publication. While the paper is entirely methodological, the primary domain of application of ABC model choice methods remains population genetics and the diffusion of this new methodology to the users is thus more likely via a media like PNAS than via a machine learning or statistics journal.
When compared with our recent update of the arXived paper, there is not much different in contents, as it is mostly an issue of fitting the PNAS publication canons. (Which makes the paper less readable in the posted version [in my opinion!] as it needs to fit the main document within the compulsory six pages, relegated part of the experiments and of the explanations to the Supplementary Information section.)
Filed under: pictures, R, Statistics, University life Tagged: 1000 Genomes Project, ABC, ABC model choice, machine learning, model posterior probabilities, posterior predictive, random forests, summary statistics
While I was in Warwick, Dan Simpson [newly arrived from Norway on a postdoc position] mentioned to me he had attended a talk by Aki Vehtari in Norway where my early work with Jérôme Dupuis on projective priors was used. He gave me the link to this paper by Peltola, Havulinna, Salomaa and Vehtari that indeed refers to the idea that a prior on a given Euclidean space defines priors by projections on all subspaces, despite the zero measure of all those subspaces. (This notion first appeared in a joint paper with my friend Costas Goutis, who alas died in a diving accident a few months later.) The projection further allowed for a simple expression of the Kullback-Leibler deviance between the corresponding models and for a Pythagorean theorem on the additivity of the deviances between embedded models. The weakest spot of this approach of ours was, in my opinion and unsurprisingly, about deciding when a submodel was too far from the full model. The lack of explanatory power introduced therein had no absolute scale and later discussions led me to think that the bound should depend on the sample size to ensure consistency. (The recent paper by Nott and Leng that was expanding on this projection has now appeared in CSDA.)
“Specifically, the models with subsets of covariates are found by maximizing the similarity of their predictions to this reference as proposed by Dupuis and Robert . Notably, this approach does not require specifying priors for the submodels and one can instead focus on building a good reference model. Dupuis and Robert (2003) suggest choosing the size of the covariate subset based on an acceptable loss of explanatory power compared to the reference model. We examine using cross-validation based estimates of predictive performance as an alternative.” T. Peltola et al.
The paper also connects with the Bayesian Lasso literature, concluding on the horseshoe prior being more informative than the Laplace prior. It applies the selection approach to identify biomarkers with predictive performances in a study of diabetic patients. The authors rank model according to their (log) predictive density at the observed data, using cross-validation to avoid exploiting the data twice. On the MCMC front, the paper implements the NUTS version of HMC with STAN.
Filed under: Mountains, pictures, Statistics, Travel, University life Tagged: Aki Vehtari, Bayesian lasso, Dan Simpson, embedded models, Hamiltonian Monte Carlo, horseshoe prior, Kullback-Leibler divergence, MCMC, Norway, NUTS, predictive power, prior projection, STAN, variable selection, zero measure set
After several clones of our SAME algorithm appeared in the literature, it is rather fun to see another paper acknowledging the connection. SAME but different was arXived today by Zhao, Jiang and Canny. The point of this short paper is to show that the parallel implementation of SAME leads to efficient performances compared with existing standards. Since the duplicated latent variables are independent [given θ] they can be simulated in parallel. They further assume independence between the components of those latent variables. And finite support. As in document analysis. So they can sample the replicated latent variables all at once. Parallelism is thus used solely for the components of the latent variable(s). SAME is normally associated with an annealing schedule but the authors could not detect an improvement over a fixed and large number of replications. They reported gains comparable to state-of-the-art variational Bayes on two large datasets. Quite fun to see SAME getting a new life thanks to computer scientists!
Filed under: Statistics, University life Tagged: data cloning, document analysis, map, Monte Carlo Statistical Methods, parallel MCMC, SAME, simulated annealing, simulation, stochastic optimisation, variational Bayes methods
The editors of a new blog entitled Marauders of the Lost Sciences (Learn from the giants) sent me an email to signal the start of this blog with a short excerpt from a giant in maths or stats posted every day:There is a new blog I wanted to tell you about which excerpts one interesting or classic paper or book a day from the mathematical sciences. We plan on daily posting across the range of mathematical fields and at any level, but about 20-30% of the posts in queue are from statistics. The goal is to entice people to read the great works of old. The first post today was from an old paper by Fisher applying Group Theory to the design of experiments.
Interesting concept, which will hopefully generate comments to put the quoted passage into context. Somewhat connected to my Reading Statistical Classics posts. Which incidentally if sadly will not take place this year since only two students registered. should take place in the end since more students registered! (I am unsure about the references behind the title of that blog, besides Spielberg’s Raiders of the Lost Ark and Norman’s Marauders of Gor… I just hope Statistics does not qualify as a lost science!)
Filed under: Books, Statistics, University life Tagged: blogging, classics, graduate course, marauders of Gor, R.A. Fisher, Raiders of the Lost Ark, reading list
Yet another book I grabbed on impulse while in Birmingham last month. And which had been waiting for me on a shelf of my office in Warwick. Another buy I do not regret! Rivers of London is delightful, as much for taking place in all corners of London as for the story itself. Not mentioning the highly enjoyable writing style!
“I though you were a sceptic, said Lesley. I though you were scientific”
The first volume in this detective+magic series, Rivers of London, sets the universe of this mix of traditional Metropolitan Police work and of urban magic, the title being about the deities of the rivers of London, including a Mother and a Father Thames… I usually dislike any story mixing modern life and fantasy but this is a definitive exception! What I enjoy in this book setting is primarily the language used in the book that is so uniquely English (to the point of having the U.S. edition edited!, if the author’s blog is to be believed). And the fact that it is so much about London, its history and inhabitants. But mostly about London, as an entity on its own. Even though my experience of London is limited to a few boroughs, there are many passages where I can relate to the location and this obviously makes the story much more appealing. The style is witty, ironic and full of understatements, a true pleasure.
“The tube is a good place for this sort of conceptual breakthrough because, unless you’ve got something to read, there’s bugger all else to do.”
The story itself is rather fun, with at least three levels of plots and two types of magic. It centres around two freshly hired London constables, one of them discovering magical abilities and been drafted to the supernatural section of the Metropolitan Police. And making all the monologues in the book. The supernatural section is made of a single Inspector, plus a few side characters, but with enough fancy details to give it life. In particular, Isaac Newton is credited with having started the section, called The Folly. Which is also the name of Ben Aaronovitch’s webpage.
“There was a poster (…) that said: `Keep Calm and Carry On’, which I thought was good advice.”
This quote is unvoluntarily funny in that it takes place in a cellar holding material from World War II. Except that the now invasive red and white poster was never distributed during the war… On the opposite it was pulped to save paper and the fact that a few copies survived is a sort of (minor) miracle. Hence a double anachronism in that it did not belong to a WWII room and that Peter Grant should have seen its modern avatars all over London.
“Have you ever been to London? Don’t worry, it’s basically just like the country. Only with more people.”
The last part of the book is darker and feels less well-written, maybe simply because of the darker side and of the accumulation of events, while the central character gets rather too central and too much of an unexpected hero that saves the day. There is in particular a part where he seems to forget about his friend Lesley who is in deep trouble at the time and this does not seem to make much sense. But, except for this lapse (maybe due to my quick reading of the book over the week in Warwick), the flow and pace are great, with this constant undertone of satire and wit from the central character. I am definitely looking forward reading tomes 2 and 3 in the series (having already read tome 4 in Austria!, which was a mistake as there were spoilers about earlier volumes).
Filed under: Books, Kids, Travel Tagged: Ben Aaronnovitch, book review, cockney slang, ghosts, Isaac Newton, Keep calm posters, London, magics, Metropolitan Police, Peter Grant series, Thames, Warwick
Yesterday, Rasmus Bååth [of puppies' fame!] posted a very nice blog using ABC to derive the posterior distribution of the total number of socks in the laundry when only pulling out orphan socks and no pair at all in the first eleven draws. Maybe not the most pressing issue for Bayesian inference in the era of Big data but still a challenge of sorts!
Rasmus set a prior on the total number m of socks, a negative Binomial Neg(15,1/3) distribution, and another prior of the proportion of socks that come by pairs, a Beta B(15,2) distribution, then simulated pseudo-data by picking eleven socks at random, and at last applied ABC (in Rubin’s 1984 sense) by waiting for the observed event, i.e. only orphans and no pair [of socks]. Brilliant!
The overall simplicity of the problem set me wondering about an alternative solution using the likelihood. Cannot be that hard, can it?! After a few computations rejected by opposing them to experimental frequencies, I put the problem on hold until I was back home and with access to my Feller volume 1, one of the few [math] books I keep at home… As I was convinced one of the exercises in Chapter II would cover this case. After checking, I found a partial solution, namely Exercice 26:
A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r<n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them?
This is not exactly a solution, but rather a problem, however it leads to the value
as the probability of obtaining j pairs among those 2r shoes. Which also works for an odd number t of shoes:
as I checked against my large simulations. So I solved Exercise 26 in Feller volume 1 (!), but not Rasmus’ problem, since there are those orphan socks on top of the pairs. If one draws 11 socks out of m socks made of f orphans and g pairs, with f+2g=m, the number k of socks from the orphan group is an hypergeometric H(11,m,f) rv and the probability to observe 11 orphan socks total (either from the orphan or from the paired groups) is thus the marginal over all possible values of k:
so it could be argued that we are facing a closed-form likelihood problem. Even though it presumably took me longer to achieve this formula than for Rasmus to run his exact ABC code!
Filed under: Books, Kids, R, Statistics, University life Tagged: ABC, capture-recapture, combinatorics, subjective prior, William Feller
Yesterday night, just before leaving for Coventry, I realised I had about 30 versions of my “mother of all .bib” bib file, spread over directories and with broken links with the original mother file… (I mean, I always create bib files in new directories by a hard link,ln ~/mother.bib
but they eventually and inexplicably end up with a life of their own!) So I decided a Spring clean-up was in order and installed BibTool on my Linux machine to gather all those versions into a new encompassing all-inclusive bib reference. I did not take advantage of the many possibilities of the program, written by Gerd Neugebauer, but it certainly solved my problem: once I realised I had to set the variatescheck.double = on check.double.delete = on pass.comments = off
all I had to do was to callbibtool -s -i ../*/*.bib -o mother.bib bibtool -d -i mother.bib -o mother.bib bibtool -s -i mother.bib -o mother.bib
to merge all bib file and then to get rid of the duplicated entries in mother.bib (the -d option commented out the duplicates and the second call with -s removed them). And to remove the duplicated definitions in the preamble of the file. This took me very little time in the RER train from Paris-Dauphine (where I taught this morning, having a hard time to make the students envision the empirical cdf as an average of Dirac masses!) to Roissy airport, in contrast with my pedestrian replacement of all stray siblings of the mother bib into new proper hard links, one by one. I am sure there is a bash command that could have done it in one line, but I spent instead my flight to Birmingham switching all existing bib files, one by one…
Filed under: Books, Linux, Travel, University life Tagged: bash, BibTeX, BibTool, Birmingham, Charles de Gaulle, LaTeX, link, Linux, RER B, Roissy, University of Warwick
In a comment on our Accelerating Metropolis-Hastings algorithms: Delayed acceptance with prefetching paper, Philip commented that he had experimented with an alternative splitting technique retaining the right stationary measure: the idea behind his alternative acceleration is again (a) to divide the target into bits and (b) run the acceptance step by parts, towards a major reduction in computing time. The difference with our approach is to represent the overall acceptance probability
and, even more surprisingly than in our case, this representation remains associated with the right (posterior) target!!! Provided the ordering of the terms is random with a symmetric distribution on the permutation. This property can be directly checked via the detailed balance condition.
In a toy example, I compared the acceptance rates (acrat) for our delayed solution (letabin.R), for this alternative (letamin.R), and for a non-delayed reference (letabaz.R), when considering more and more fractured decompositions of a Bernoulli likelihood.> system.time(source("letabin.R")) user system elapsed 225.918 0.444 227.200 > acrat  0.3195 0.2424 0.2154 0.1917 0.1305 0.0958 > system.time(source("letamin.R")) user system elapsed 340.677 0.512 345.389 > acrat  0.4045 0.4138 0.4194 0.4003 0.3998 0.4145 > system.time(source("letabaz.R")) user system elapsed 49.271 0.080 49.862 > acrat  0.6078 0.6068 0.6103 0.6086 0.6040 0.6158
A very interesting outcome since the acceptance rate does not change with the number of terms in the decomposition for the alternative delayed acceptance method… Even though it logically takes longer than our solution. However, the drawback is that detailed balance implies picking the order at random, hence loosing on the gain in computing the cheap terms first. If reversibility could be bypassed, then this alternative would definitely get very appealing!
Filed under: Books, Kids, Statistics, University life Tagged: acceleration of MCMC algorithms, delayed acceptance, detailed balance, MCMC, Monte Carlo Statistical Methods, reversibility, simulation
This new arXival by Chris Oates, Mark Girolami, and Nicolas Chopin (warning: they all are colleagues & friends of mine!, at least until they read those comments…) is a variation on control variates, but with a surprising twist namely that the inclusion of a control variate functional may produce a sub-root-n (i.e., faster than √n) convergence rate in the resulting estimator. Surprising as I did not know one could get to sub-root-n rates..! Now I had forgotten that Anne Philippe and I used the score in an earlier paper of ours, as a control variate for Riemann sum approximations, with faster convergence rates, but this is indeed a new twist, in particular because it produces an unbiased estimator.
The control variate writes
where π is the target density and φ is a free function to be optimised. (Under the constraint that πφ is integrable. Then the expectation of ψφ is indeed zero.) The “explanation” for the sub-root-n behaviour is that ψφ is chosen as an L2 regression. When looking at the sub-root-n convergence proof, the explanation is more of a Rao-Blackwellisation type, assuming a first level convergent (or presistent) approximation to the integrand [of the above form ψφ can be found. The optimal φ is the solution of a differential equation that needs estimating and the paper concentrates on approximating strategies. This connects with Antonietta Mira’s zero variance control variates, but in a non-parametric manner, adopting a Gaussian process as the prior on the unknown φ. And this is where the huge innovation in the paper resides, I think, i.e. in assuming a Gaussian process prior on the control functional and in managing to preserve unbiasedness. As in many of its implementations, modelling by Gaussian processes offers nice features, like ψφ being itself a Gaussian process. Except that it cannot be shown to lead to presistency on a theoretical basis. Even though it appears to hold in the examples of the paper. Apart from this theoretical difficulty, the potential hardship with the method seems to be in the implementation, as there are several parameters and functionals to be calibrated, hence calling for cross-validation which may often be time-consuming. The gains are humongous, so the method should be adopted whenever the added cost in implementing it is reasonable, cost which evaluation is not clearly provided by the paper. In the toy Gaussian example where everything can be computed, I am surprised at the relatively poor performance of a Riemann sum approximation to the integral, wondering at the level of quadrature involved therein. The paper also interestingly connects with O’Hagan’s (1991) Bayes-Hermite [polynomials] quadrature and quasi-Monte Carlo [obviously!].
Filed under: Books, Statistics, University life Tagged: control variate, convergence rate, Gaussian processes, Monte Carlo Statistical Methods, simulation, University of Warwick