## Bayesian News Feeds

### BAYSM ’14 im Wien, Sep. 18-19

**I**t all started in Jim Berger’s basement, drinking with the uttermost reverence an otherworldly Turley Zinfandel during the great party Ann and Jim Berger hosted for the O’Bayes’13 workshop in Duke. I then mentioned to Angela Bitto and Alexandra Posekany, from WU Wien, that I was going to be in Austria next September for a seminar in Linz, at the Johannes Kepler Universität, and, as it happened to take place the day before BAYSM ’14, the second conference of the young Bayesian statisticians, in connection with the j-ISBA section, they most kindly invited me to the meeting! As a senior Bayesian, most obviously! This is quite exciting, all the more because I never visited Vienna before. (Contrary to other parts of Austria, like the Großglockner, where I briefly met Peter Habeler. *Trivia: the cover picture of the ‘Og is actually taken from the Großglockner.*)

Filed under: Kids, Mountains, pictures, Statistics, Travel, University life, Wines Tagged: Austria, BAYSM, Duke, Großglockner, j-ISBA, Johannes Kepler Universität, Linz, O-Bayes 2013, Studlgrat, Wien, WU Wien

### talk in Orsay (message in a beetle)

**Y**esterday (March 27), I gave a seminar at Paris-Sud University, Orsay, in the stats department, on ABC model choice. It was an opportunity to talk about recent advances we have made with Jean-Michel Marin and Pierre Pudlo on using machine-learning devices to improve ABC. (More to come soon!) And to chat with Gilles Celeux about machine learning and classification. Actually, given that one of my examples was about the Asian lady beetle invasion and that the buildings of the Paris-Sud University have suffered from this invasion, I should have advertised the talk with the more catchy title of “message in a beetle”…

**T**his seminar was also an opportunity to experiment with mixed transportation. Indeed, since I had some errands to run in Paris in morning I decided to bike there (in Paris), work at CREST, and then take my bike in the RER train down to Orsay as I did not have the time and leisure to bike all the 20k there. Since it was the middle of the day, the carriage was mostly empty and I managed to type a blog entry without having to worry about the bike being a nuisance…. The only drag was to enter the platform in Paris (Cité Universitaire) as there was no clear access for bike. Fortunately, a student kindly helped me to get over the gate with my bike, as I could not manage on my own… Nonetheless, I will certainly repeat the experience on my next trip to Orsay (but would not dare take the bike inside/under Paris *per se* because of the (over-)crowded carriages there).

Filed under: Kids, Statistics, Travel, University life Tagged: ABC, ABC model choice, bike, Cité Universitaire, message in a beetle, Orsay, RER, seminar, Université Paris-Sud

### Bill Fitzgerald (1948-2014)

**J**ust heard a very sad item of news: our colleague and friend Bill Fitzgerald, Head of Research in the Signal Processing Laboratory in the Department of Engineering at the University of Cambridge, Fellow of Christ’s College, co-founder and Chairman of Featurespace, and fanatic guitar player, passed away yesterday. He wrote one of the very first books on MCMC with Joseph Ó Ruanaidh, Numerical Bayesian Methods Applied to Signal Processing, in 1996. On a more personal level, he invited me to Cambridge for my first visit there in 1998 and he thus was influential in introducing me to my friends Christophe Andrieu and Arnaud Doucet. Farewell, Bill!, and may the blessing of the rain be on you…

Filed under: Books, Statistics, University life Tagged: Bill Fitzgerald, Christ's College, Engineering, signal processing, University of Cambridge

### Le Monde puzzle [#860]

**A** Le Monde mathematical puzzle that connects to my awalé post of last year:

*For N≤18, N balls are placed in N consecutive holes. Two players, Alice and Bob, consecutively take two balls at a time provided those balls are in contiguous holes. The loser is left with orphaned balls. What is the values of N such that Bob can win, no matter what is Alice’s strategy**?*

**I** solved this puzzle by the following R code that works recursively on N by eliminating all possible adjacent pairs of balls and checking whether or not there is a winning strategy for the other player.

which returns the solution

[1] 1 [1] 1 [1] 1 [1] 0 [1] 1 [1] 1 [1] 1 [1] 0 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 0 [1] 1 [1] 1 [1] 1 <pre>(brute-force) answering the question that N=5,9,15 are the values where Alice has no winning strategy if Bob plays in an optimal manner**.** (The case N=5 is obvious as there always remains two adjacent 1′s once Alice removed any adjacent pair. The case N=9 can also be shown to be a lost cause by enumeration of Alice’s options.)

Filed under: Books, Kids, R Tagged: awalé, Le Monde, mathematical puzzle, R, recursive function

### [more] parallel MCMC

**S**cott Schmidler and his Ph.D. student Douglas VanDerwerken have arXived a paper on parallel MCMC the very day I left for Chamonix, prior to MCMSki IV, so it is no wonder I missed it at the time. This work is somewhat in the spirit of the parallel papers Scott et al.’s consensus Bayes, Neiswanger et al.’s embarrassingly parallel MCMC, Wang and Dunson’s Weierstrassed MCMC (and even White et al.’s parallel ABC), namely that the computation of the likelihood can be broken into batches and MCMC run over those batches independently. In their short survey of previous works on parallelization, VanDerwerken and Schmidler overlooked our neat (!) JCGS Rao-Blackwellisation with Pierre Jacob and Murray Smith, maybe because it sounds more like post-processing than genuine parallelization (in that it does not speed up the convergence of the chain but rather improves the Monte Carlo usages one can make of this chain), maybe because they did not know of it.

*“This approach has two shortcomings: first, it requires a number of independent simulations, and thus processors, equal to the size of the partition; this may grow exponentially in dim(Θ). Second, the rejection often needed for the restriction doesn’t permit easy evaluation of transition kernel densities, required below. In addition, estimating the relative weights wi with which they should be combined requires care.” (p.3)*

**T**he idea of the authors is to replace an exploration of the whole space operated via a single Markov chain (or by parallel chains acting independently which all have to “converge”) with parallel and independent explorations of parts of the space by separate Markov chains. “Small is beautiful”: it takes a shorter while to explore each set of the partition, hence to converge, and, more importantly, each chain can work in parallel to the others. More specifically, given a partition of the space, into sets Ai with posterior weights wi, parallel chains are associated with targets equal to the original target restricted to those Ai‘s. This is therefore an MCMC version of partitioned sampling. With regard to the shortcomings listed in the quote above, the authors consider that there does not need to be a bijection between the partition sets and the chains, in that a chain can move across partitions and thus contribute to several integral evaluations simultaneously. I am a bit worried about this argument since it amounts to getting a *random* number of simulations within each partition set Ai. In my (maybe biased) perception of partitioned sampling, this sounds somewhat counter-productive, as it increases the variance of the overall estimator. (Of course, not restricting a chain to a given partition set Ai has the incentive of avoiding a possibly massive amount of rejection steps. It is however unclear (a) whether or not it impacts ergodicity (it all depends on the way the chain is constructed, i.e. against which target(s)…) as it could lead to an over-representation of some boundaries and (b) whether or not it improves the overall convergence properties of the chain(s).)

*“The approach presented here represents a solution to this problem which can completely remove the waiting times for crossing between modes, leaving only the relatively short within-mode equilibration times.” (p.4)*

**A** more delicate issue with the partitioned MCMC approach (in my opinion!) stands with the partitioning. Indeed, in a complex and high-dimension model, the construction of the appropriate partition is a challenge in itself as we often have no prior idea where the modal areas are. Waiting for a correct exploration of the modes is indeed faster than waiting for crossing between modes, *provided* all modes are represented and the chain for each partition set Ai has enough energy to explore this set. It actually sounds (slightly?) unlikely that a target with huge gaps between modes will see a considerable improvement from the partioned version when the partition sets Ai are selected on the go, because some of the boundaries between the partition sets may be hard to reach with a off-the-shelf proposal. (Obviously, the second part of the method on the adaptive construction of partitions is yet in the writing and I am looking forward its aXival!)

**F**urthermore, as noted by Pierre Jacob (of Statisfaction fame!), the adaptive construction of the partition has a lot in common with Wang-Landau schemes. Which goal is to produce a flat histogram proposal from the current exploration of the state space. Connections with Atchadé’s and Liu’s (2010, Statistical Sinica) extension of the original Wang-Landau algorithm could have been spelled out. Esp. as the Voronoï tessellation construct seems quite innovative in this respect.

Filed under: Books, Mountains Tagged: Banff, batch sampling, Chamonix-Mont-Blanc, Duke University, embarassingly parallel, Markov chain Monte Carlo, partition, partitioned sampling, Rao-Blackwellisation, split chain, Voronoi tesselation

### firefly Monte Carlo

**A**nd here is yet another arXived paper using a decomposition of the posterior distirbution as a product of terms to run faster, better and higher MCMC algorithms! This one is by Douglas Maclaurin and Ryan Adams: “Firefly Monte Carlo: Exact MCMC with Subsets of Data“. (While a swarm of fireflies make sense to explain the name, I may miss some cultural subliminal meaning in the title as Firefly and Monte Carlo seem to be places in Las Vegas (?), and a car brand, Firefly is a TV series, a clothes brand, and maybe other things…)

*“The evolution of the chain evokes an image of fireflies, as the individual data blink on and out due to updates of the zn.”*

**T**he fundamental assumption of Maclaurin’s and Adams’ approach is that each product term in the likelihood (expressed as a product) can be bounded by a cheaper lower bound. This lower bound is used to create a Bernoulli auxiliary variable with probability equal to the ratio of the lower bound to the likelihood term, auxiliary variable that helps to reduce the number of evaluations of the original likelihood terms. Obviously, there is a gain only if (a) the lower bound is close or tight enough and (b) simulating the auxiliary variables is cheap enough.

**A**bout (a), the paper gives the tight example of a logistic, with a case of a 98% tightness. How generic is that and how those bounds can be derived in a cheap or automated manner? If one needs to run a variational Bayes approximation first, the gain in efficiency is unlikely to hold. About (b), I do not fully get it: if generating zn requires the evaluation of the original likelihood we loose the entire appeal of the method. Admittedly, I can see the point in changing a very small portion α of the zn‘s between moves on the parameter θ, since the number of likelihood evaluations is the same portion α of the total number of terms N. But decreasing the portion α is also reducing the mixing efficiency of the algorithm. In the efficient ways of updating the auxiliary brightness variables (ways proposed in the paper), I get the idea of making a proposal first before eventually computing the true probability of a Bernoulli. A proposal making use of the previous value of the probability (i.e. for the previous value of the parameter θ) could also reduce the number of evaluations of likelihood terms. However, using a “cached” version of the likelihood is only relevant within the same simulation step since a change in θ requires recomputing the likelihood.

*“In each experiment we compared FlyMC, with two choices of bound selection, to regular full-posterior MCMC. We looked at the average number of likelihoods queried at each iteration and the number of effective samples generated per iteration, accounting for autocorrelation.”*

**T**his comparison does not seem adequate to me: by construction, the algorithm in the paper reduces the number of likelihood evaluations, so this is not a proper comparative instrument. The effective sample size is a transform of the correlation, not an indicator of convergence. For instance, if the zn‘s were hardly to change between iterations, thus the overall sampler was definitely far from converging, we would get θ’s simulated from almost the same distribution, hence being uncorrelated. In other words, if the joint chain in (θ,zn) does not converge, it is harder to establish that the subchain in θ converges at all. Indeed, in this logistic example where the computation of the likelihood is not a massive constraint, I am surprised there is any possibility of a huge gain in using the method, unless the lower bound is essentially the likelihood, which is actually the case for logistic regression models. Another point made by Dan Simpson is that the whole dataset needs to remain on-hold, full-time, which may be a challenge to the computer memory. And stops short of providing really Big Data solutions.

Filed under: Books, Statistics, University life Tagged: auxiliary variable, big data, logistic regression, parallel MCMC

### penalising model component complexity

*“Prior selection is the fundamental issue in Bayesian statistics. Priors are the Bayesian’s greatest tool, but they are also the greatest point for criticism: the arbitrariness of prior selection procedures and the lack of realistic sensitivity analysis (…) are a serious argument against current Bayesian practice.” (p.23)*

**A** paper that I first read and annotated in the very early hours of the morning in Banff, when temperatures were down in the mid minus 20′s now appeared on arXiv, “Penalising model component complexity: A principled, practical approach to constructing priors” by Thiago Martins, Dan Simpson, Andrea Riebler, Håvard Rue, and Sigrunn Sørbye. It is a highly timely and pertinent paper on the selection of default priors! Which shows that the field of “objective” Bayes is still full of open problems and significant advances and makes a great argument for the future president [that I am] of the O’Bayes section of ISBA to encourage young Bayesian researchers to consider this branch of the field.

*“On the other end of the hunt for the holy grail, “objective” priors are data-dependent and are not uniformly accepted among Bayesians on philosophical grounds.” (p.2)*

**A**part from the above quote, as objective priors are *not* data-dependent! (this is presumably a typo, used instead of *model-dependent*), I like very much the introduction (appreciating the reference to the very recent Kamary (2014) that just got rejected by TAS for quoting my blog post way too much… and that we jointly resubmitted to Statistics and Computing). Maybe missing the alternative solution of going hierarchical as far as needed and ending up with default priors [at the top of the ladder]. And not discussing the difficulty in specifying the sensitivity of weakly informative priors.

*“Most model components can be naturally regarded as a flexible version of a base model.” (p.3)*

**T**he starting point for the modelling is the *base model*. How easy is it to define this base model? Does it [always?] translate into a null hypothesis formulation? Is there an automated derivation? I assume this somewhat follows from the “block” idea that I do like but how generic is model construction by blocks?

*“Occam’s razor is the principle of parsimony, for which simpler model formulations should be preferred until there is enough support for a more complex model.” (p.4)*

**I** also like this idea of putting a prior on the distance from the base! Even more because it is parameterisation invariant (at least at the hyperparameter level). (This vaguely reminded me of a paper we wrote with George a while ago replacing tests with distance evaluations.) And because it gives a definitive meaning to Occam’s razor. However, unless the hyperparameter ξ is one-dimensional this does not define a prior on ξ per se. I equally like Eqn (2) as it shows how the base constraint takes one away from Jeffrey’s prior. Plus, if one takes the Kullback as an intrinsic loss function, this also sounds related to Holmes’s and Walker’s substitute loss pseudopriors, no? Now, eqn (2) does not sound right in the general case. Unless one implicitly takes a uniform prior on the Kullback sphere of radius d? There is a feeling of one-d-ness in the description of the paper (at least till page 6) and I wanted to see how it extends to models with many (≥2) hyperparameters. Until I reached Section 6 where the authors state exactly that! There is also a potential difficulty in that d(ξ) cannot be computed in a general setting. (Assuming that d(ξ) has a non-vanishing Jacobian as on page 19 sounds rather unrealistic.) Still about Section 6, handling reference priors on correlation matrices is a major endeavour, which should produce a steady flow of followers..!

*“The current practice of prior specification is, to be honest, not in a good shape. While there has been a strong growth of Bayesian analysis in science, the research field of “practical prior specification” has been left behind.” (*p.23)*

**T**here are still quantities to specify and calibrate in the PC priors, which may actually be deemed a good thing by Bayesians (and some modellers). But overall I think this paper and its message constitute a terrific step for Bayesian statistics and I hope the paper can make it to a major journal.

Filed under: Books, Mountains, pictures, Statistics, University life Tagged: Banff, default prior, Fisher information, ISBA, Jeffreys priors, Kullback-Leibler divergence, model complexity, noninformative priors, O'Bayes, penalisation, Riemann manifold

### Bayesian Data Analysis [BDA3 - part #2]

**H**ere is the second part of my review of Gelman et al.’ *Bayesian Data Analysis* (third edition):

*“When an iterative simulation algorithm is “tuned” (…) the iterations will not in general converge to the target distribution.” (p.297)*

**P**art III covers advanced computation, obviously including MCMC but also model approximations like variational Bayes and expectation propagation (EP), with even a few words on ABC. The novelties in this part are centred at Stan, the language Andrew is developing around Hamiltonian Monte Carlo techniques, a sort of BUGS of the 10′s! (And of course Hamiltonian Monte Carlo techniques themselves. A few (nit)pickings: the book advises important resampling without replacement (p.266) which makes some sense when using a poor importance function but ruins the fundamentals of importance sampling. Plus, no trace of infinite variance importance sampling? of harmonic means and their dangers? In the Metropolis-Hastings algorithm, the proposal is called the jumping rule and denoted by Jt, which, besides giving the impression of a Jacobian, seems to allow for time-varying proposals and hence time-inhomogeneous Markov chains, which convergence properties are much hairier. (The warning comes much later, as exemplified in the above quote.) Moving from “burn-in” to “warm-up” to describe the beginning of an MCMC simulation. Being somewhat 90′s about convergence diagnoses (as shown by the references in Section 11.7), although the book also proposes new diagnoses and relies much more on effective sample sizes. Particle filters are evacuated in hardly half-a-page. Maybe because Stan does not handle particle filters. A lack of intuition about the Hamiltonian Monte Carlo algorithms, as the book plunges immediately into a two-page pseudo-code description. Still using physics vocabulary that put *me* (and maybe only *me*) off. Although I appreciated the advice to check analytical gradients against their numerical counterpart.

*“In principle there is no limit to the number of levels of variation that can be handled in this way. Bayesian methods provide ready guidance in handling the estimation of the unknown parameters.” (p.381)*

**I** also enjoyed reading the part about modes that stand at the boundary of the parameter space (Section 13.2), even though I do not think modes are great summaries in Bayesian frameworks and while I do not see how picking the prior to avoid modes at the boundary avoids the data impacting the prior, *in fine*. The variational Bayes section (13.7) is equally enjoyable, with a proper spelled-out illustration, introducing an unusual feature for Bayesian textbooks. (Except that sampling without replacement is back!) Same comments for the Expectation Propagation (EP) section (13.8) that covers brand new notions. (Will they stand the test of time?!)

*“Geometrically, if β-space is thought of as a room, the model implied by classical model selection claims that the true β has certain prior probabilities of being in the room, on the floor, on the walls, in the edge of the room, or in a corner.” (p.368)*

**P**art IV is a series of five chapters about regression(s). This is somewhat of a classic, nonetheless Chapter 14 surprised me with an elaborate election example that dabbles in advanced topics like causality and counterfactuals. I did not spot any reference to the *g*-prior or to its intuitive justifications and the chapter mentions the lasso as a regularisation technique, but without any proper definition of this “popular non-Bayesian form of regularisation” (p.368). In French: with not a single equation! Additional novelty may lie in the numerical prior information about the correlations. What is rather crucially (cruelly?) missing though is a clearer processing of variable selection in regression models. I know Andrew opposes any notion of a coefficient being exactly equal to zero, as ridiculed through the above quote, but the book does not reject model selection, so why not in this context?! Chapter 15 on hierarchical extensions stresses the link with exchangeability, once again. With another neat election example justifying the progressive complexification of the model and the cranks and toggles of model building. (I am not certain the reparameterisation advice on p.394 is easily ingested by a newcomer.) The chapters on robustness (Chap. 17) and missing data (Chap. 18) sound slightly less convincing to me, esp. the one about robustness as I never got how to make robustness agree with my Bayesian perspective. The book states “we do not have to abandon Bayesian principles to handle outliers” (p.436), but I would object that the Bayesian paradigm compels us to define an alternative model for those outliers and the way they are produced. One can always resort to a drudging exploration of which subsample of the dataset is at odds with the model but this may be unrealistic for large datasets and further tells us nothing about how to handle those datapoints. The missing data chapter is certainly relevant to such a comprehensive textbook and I liked the survey illustration where the missing data was in fact made of missing questions. However, I felt the multiple imputation part was not well-presented, fearing readers would not understand how to handle it…

*“You can use MCMC, normal approximation, variational Bayes, expectation propagation, Stan, or any other method. But your fit must be Bayesian.” (p.517)*

**P**art V concentrates the most advanced material, with Chapter 19 being mostly an illustration of a few complex models, slightly superfluous in my opinion, Chapter 20 a very short introduction to functional bases, including a basis selection section (20.2) that implements the “zero coefficient” variable selection principle refuted in the regression chapter(s), and does not go beyond splines (what about wavelets?), Chapter 21 a (quick) coverage of Gaussian processes with the motivating birth-date example (and two mixture datasets I used eons ago…), Chapter 22 a more (too much?) detailed study of finite mixture models, with no coverage of reversible-jump MCMC, and Chapter 23 an entry on Bayesian non-parametrics through Dirichlet processes.

*“In practice, for well separated components, it is common to remain stuck in one labelling across all the samples that are collected. One could argue that the Gibbs sampler has failed in such a case.” (p.535)*

**T**o get back to mixtures, I liked the quote about the label switching issue above, as I was “one” who argued that the Gibbs sampler fails to converge! The corresponding section seems to favour providing a density estimate for mixture models, rather than component-wise evaluations, but it nonetheless mentions the relabelling by permutation approach (if missing our 2000 JASA paper). The section about inferring on the unknown number of components suggests conducting a regular Gibbs sampler on a model with an upper bound on the number of components and then checking for empty components, an idea I (briefly) considered in the mid-1990′s before the occurrence of RJMCMC. Of course, the prior on the components matters and the book suggests using a Dirichlet with fixed sum like 1 on the coefficients for all numbers of components.

*“14. Objectivity and subjectivity: discuss the statement `People tend to believe results that support their preconceptions and disbelieve results that surprise them. Bayesian methods tend to encourage this undisciplined mode of thinking.’¨ (p.100)*

**O**bviously, this being a third edition begets the question, *what’s up, doc?!,* i.e., what’s new [when compared with the second edition]? Quite a lot, even though I am not enough of a Gelmanian exegist to produce a comparision table. Well, for a starter, David Dunson and Aki Vethtari joined the authorship, mostly contributing to the advanced section on non-parametrics, Gaussian processes, EP algorithms. Then the Hamiltonian Monte Carlo methodology and Stan of course, which is now central to Andrew’s interests. The book does include a short Appendix on running computations in R and in Stan. Further novelties were mentioned above, like the vision of weakly informative priors taking over noninformative priors but I think this edition of *Bayesian Data Analysis* puts more stress on clever and critical model construction and on the fact that it can be done in a Bayesian manner. Hence the insistence on predictive and cross-validation tools. The book may be deemed somewhat short on exercices, providing between 3 and 20 mostly well-developed problems per chapter, often associated with datasets, rather than the less exciting counter-example above. Even though Andrew disagrees and his students at ENSAE this year certainly did not complain, I personally feel a total of 220 exercices is not enough for instructors and self-study readers. (At least, this reduces the number of email requests for solutions! Esp. when 50 of those are solved on the book website.) But this aspect is a minor quip: overall this is truly the reference book for a graduate course on Bayesian statistics and not only Bayesian data analysis.

Filed under: Books, Kids, R, Statistics, University life Tagged: Andrew Gelman, Bayesian data analysis, Bayesian model choice, Bayesian predictive, finite mixtures, graduate course, hierarchical Bayesian modelling, rats, STAN

### speed [quick book review]

**U**eli Steck is a Swiss alpinist who climbed solo the three “last” north face routes of the Alps (Eiger, Jorasses, and Cervino/Matterhorn) in the record times of 2:47, 2:27, and 1:56… He also recently climbed Annapurna in 27 hours from base camp, again solo and with no oxygen. (Which led some to doubt his record time as he had lost his camera on the way.) A climb for which he got one of the 2014 Piolets d’Or. (In connection with this climb, he also faced death threats from the sherpas installing fixed ropes on Everest as reported in an earlier post.) He wrote a book called Speed, where he described how he managed the three above records in a rather detailed way. (It is published in German, Italian and French,

the three major languages of the Swiss Confederation, but apparently not in English.) The book reads fast as well but it should not be very appealing to non-climbers as it concentrates mostly on the three climbs and their difficulties. The book also contains three round-tables between Messner and Steck, Bonatti and Steck, and Profit and Steck, which are of some further interest. The most fascinating part in the book is when he describes deciding to go completely free, forsaking existing protection and hence any survival opportunity were he to fall. When looking at the level of the places he climbed, this sounds to me like an insane Russian roulette, even with a previous recognition of the routes (not in the Jorasses where he even climbed on-sight). I also liked the recollection of his gift of an Eiger Nordwand climb with her wife for her birthday! *(I am unsure any spouse would appreciate such a gift to the same extent!)* The book concludes with Steck envisioning moving away from those speed solos and towards other approaches to climbing and mountains…

**A**s a coincidence, I also watched the film documentary Messner on Arte. A very well-done docu-fiction with reconstitutions of some of the most impressive climbs of Messner in the Alps and the Himalayas… Like the solo climb of the north face of Les Droites. With a single icepick. The film is also an entry into what made Messner the unique climber he is, from a very strict family environment to coping with the literal loss of his brother Guenther on the Nanga Parbat. With a testimony from his companion to the traverse by ski of the North Pole who saw Messner repeatedly calling him Guenther under stress.

Filed under: Books, Mountains, Running Tagged: Annapurna, Cervino, Eiger, Everest, Grandes Jorasses, Matterhorn, north faces, Reinhold Messner, speed climbing, Ueli Steck, Walter Bonatti

### ski with deviation

**I** just learned that a micro-brew brand of homemade skis has connections with statistics and, who knows, could become a sponsor to the next MCMSki… Indeed, the brand is called deviation (as in standard deviation), located in Gresham, Oregon, and sell locally made skis and snowboards with names like The Moment Generator or The Mode! The logo clearly indicates a statistical connection:

**A**s it happens, two of the founding partners of deviation, Tim and Peter Wells, are the sons of my long-time friend Marty Wells from Cornell University. When I first met them, they were great kids, young enough to give no inkling they would end up producing beautiful hardwood core skis in a suburb of Portland, Oregon!!! Best wishes to them and to deviation, the most statistical of all ski brands! *(Here is a report in The Oregonian that tells the story of how deviation was created.)*

Filed under: Kids, Mountains, pictures, Travel Tagged: Cornell University, deviation, Gresham, homemade skis, Ithaca, MCMSki, Oregon, ski brand

### Bayesian Data Analysis [BDA3]

**A**ndrew Gelman and his coauthors, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Don Rubin, have now published the latest edition of their book *Bayesian Data Analysis*. David and Aki are newcomers to the authors’ list, with an extended section on non-linear and non-parametric models. I have been asked by Sam Behseta to write a review of this new edition for JASA (since Sam is now the JASA book review editor). After wondering about my ability to produce an objective review (on the one hand, this is The Competition to Bayesian Essentials!, on the other hand Andrew is a good friend spending the year with me in Paris), I decided to jump for it and write a most subjective review, with the help of Clara Grazian who was Andrew’s teaching assistant this year in Paris and maybe some of my Master students who took Andrew’s course. The second edition was reviewed in the September 2004 issue of JASA and we now stand ten years later with an even more impressive textbook. Which truly what Bayesian data analysis should be.

**T**his edition has five parts, Fundamentals of Bayesian Inference, Fundamentals of Bayesian Data Analysis, Advanced Computation, Regression Models, and Non-linear and Non-parametric Models, plus three appendices. For a total of xiv+662 pages. And a weight of 2.9 pounds (1395g on my kitchen scale!) that makes it hard to carry around in the metro…. I took it to Warwick (and then Nottingham and Oxford and back to Paris) instead.

“*We could avoid the mathematical effort of checking the integrability of the posterior density (…) The result would clearly show the posterior contour drifting off toward infinity.” (p.111)*

**W**hile I cannot go into a detailed reading of those 662 pages (!), I want to highlight a few gems. (I already wrote a detailed and critical analysis of Chapter 6 on model checking in that post.) The very first chapter provides all the necessary items for understanding Bayesian Data Analysis without getting bogged in propaganda or pseudo-philosophy. Then the other chapters of the first part unroll in a smooth way, cruising on the B highway… With the unique feature of introducing weakly informative priors (Sections 2.9 and 5.7), like the half-Cauchy distribution on scale parameters. It may not be completely clear how weak a weakly informative prior, but this novel notion is worth including in a textbook. Maybe a mild reproach at this stage: Chapter 5 on hierarchical models is too verbose for my taste, as it essentially focus on the hierarchical linear model. Of course, this is an essential chapter as it links exchangeability, the “atom” of Bayesian reasoning used by de Finetti, with hierarchical models. Still. Another comment on that chapter: it broaches on the topic of improper posteriors by suggesting to run a Markov chain that can exhibit improperness by enjoying an improper behaviour. When it happens as in the quote above, fine!, but there is no guarantee this is always the case! For instance, improperness may be due to regions near zero rather than infinity. And a last barb: there is a dense table (Table 5.4, p.124) that seems to run contrariwise to Andrew’s avowed dislike of tables. I could also object at the idea of a “true prior distribution” (p.128), or comment on the trivia that hierarchical chapters seem to attract rats (as I also included a rat example in the hierarchical Bayes chapter of *Bayesian Choice* and so does the BUGS Book! Hence, a conclusion that Bayesian textbooks are better be avoided by muriphobiacs…)

*“Bayes factors do not work well for models that are inherently continuous (…) Because we emphasize continuous families of models rather than discrete choices, Bayes factors are rarely relevant in our approach to Bayesian statistics.” (p.183 & p.193)*

**P**art II is about “the creative choices that are required, first to set up a Bayesian model in a complex problem, then to perform the model checking and confidence building that is typically necessary to make posterior inferences scientifically defensible” (p.139). It is certainly one of the strengths of the book that it allows for a critical look at models and tools that are rarely discussed in more theoretical Bayesian books. As detailed in my earlier post on Chapter 6, model checking is strongly advocated, via posterior predictive checks and… posterior predictive p-values, which are at best empirical indicators that something could be wrong, definitely not that everything’s allright! Chapter 7 is the model comparison equivalent of Chapter 6, starting with the predictive density (aka the evidence or the marginal likelihood), but completely bypassing the Bayes factor for information criteria like the Watanabe-Akaike or widely available information criterion (WAIC), and advocating cross-validation, which is empirically satisfying but formally hard to integrate within a full Bayesian perspective. Chapter 8 is about data collection, sample surveys, randomization and related topics, another entry that is missing from most Bayesian textbooks, maybe not that surprising given the research topics of some of the authors. And Chapter 9 is the symmetric in that it focus on the post-modelling step of decision making.

*(Second part of the review to appear on Monday, leaving readers the weekend to recover!)*

Filed under: Books, Kids, R, Statistics, University life Tagged: Andrew Gelman, Bayesian data analysis, Bayesian model choice, Bayesian predictive, finite mixtures, graduate course, hierarchical Bayesian modelling, rats, STAN

### ¼th i-like workshop in St. Anne’s College, Oxford

**D**ue to my previous travelling to and from Nottingham for the seminar and back home early enough to avoid the dreary evening trains from Roissy airport *(no luck there, even at 8pm, the RER train was not operating efficiently!, and no fast lane is planed prior to 2023…)*, I did not see many talks at the i-like workshop. About ¼th, roughly… I even missed the poster session (and the most attractive title of Lazy ABC by Dennis Prangle) thanks to another dreary train ride from Derby to Oxford.

**A**s it happened I had already heard or read parts of the talks in the Friday morning session, but this made understanding them better. As in Banff, Paul Fearnhead‘s talk on reparameterisations for pMCMC on hidden Markov models opened a wide door to possible experiments on those algorithms. The examples in the talk were mostly of the parameter duplication type, somewhat creating unidentifiability to decrease correlation, but I also wondered at the possibility of introducing frequent replicas of the hidden chain in order to fight degeneracy. Then Sumeet Singh gave a talk on the convergence properties of noisy ABC for approximate MLE. Although I had read some of the papers behind the talk, it made me realise how keeping balls around each observation in the ABC acceptance step was not leading to extinction as the number of observations increased. (Summet also had a good line with his ABCDE algorithm, standing for *ABC done exactly*!) Anthony Lee covered his joint work with Krys Łatuszyński on the ergodicity conditions on the ABC-MCMC algorithm, the only positive case being the 1-hit algorithm as discussed in an earlier post. This result will hopefully get more publicity, as I frequently read that increasing the number of pseudo-samples has no clear impact on the ABC approximation. Krys Łatuszyński concluded the morning with an aggregate of the various results he and his co-authors had obtained on the fascinating Bernoulli factory. Including constructive derivations.

**A**fter a few discussions on and around research topics, it was too soon time to take advantage of the grand finale of a March shower to walk from St. Anne’s College to Oxford Station, in order to start the trip back home. I was lucky enough to find a seat and could start experimenting in R the new idea my trip to Nottingham had raised! While discussing a wee bit with my neighbour, a delightful old lady from the New Forest travelling to Coventry, recovering from a brain seizure, wondering about my LaTeX code syntax despite the tiny fonts, and who most suddenly popped a small screen from her bag to start playing Candy Crush!, apologizing all the same. The overall trip was just long enough for my R code to validate this idea of mine, making this week in England quite a profitable one!!!

Filed under: pictures, Statistics, Travel, University life Tagged: ABC-MCMC, ABC-SMC, Bernouilli factory, Derby, HMM, i-like, Nottingham, pMCMC, St. Anne's College, University of Oxford, University of Warwick

### métro static

*[heard in the métro this morning]*

“…les équations à deux inconnues ça va encore, mais à trois inconnues, c’est trop dur!”

*["...systems of equations with two unknowns are still ok, but with three variables it is too hard!"]*

Filed under: Kids, Travel Tagged: high school mathematics, métro, Paris

### Seminar in Nottingham

**L**ast Thursday, I gave a seminar in Nottingham, the true birthplace of the Gibbs sampler!, and I had a quite enjoyable half-day of scientific discussions in the Department of Statistics, with a fine evening tasting a local ale in the oldest (?) inn in England (Ye Olde Trip to Jerusalem) and sampling Indian dishes at 4550 Miles [plus or minus epsilon, since the genuine distance is 4200 miles) from Dehli, plus a short morning run on the very green campus. In particular, I discussed with Theo Kypraios and Simon Preston parallel ABC and their recent paper in Statistics and Computing, their use of the splitting technique of Neiswanger et al. I discussed earlier but intended here towards a better ABC approximation since (a) each term in the product could correspond to a single observation and (b) hence no summary statistic was needed and a zero tolerance could be envisioned. The paper discusses how to handle samples from terms in a product of densities, either by a Gaussian approximation or by a product of kernel estimates. And mentions connections with expectation propagation (EP), albeit not at the ABC level.

**A** minor idea that came to me during this discussion was to check whether or not a reparameterisation towards a uniform prior was a good idea: the plus of a uniform prior was that the power discussion was irrelevant, making both versions of the parallel MCMC algorithm coincide. The minus was not the computational issue since most priors are from standard families, with easily invertible cdfs, but rather why this was supposed to make a difference. When writing this on the train to Oxford, I started wondering as an ABC implementation is impervious to this reparameterisation. Indeed, simulate θ from π and pseudo-data given θ versus simulate μ from uniform and pseudo-data given T(μ) does not make a difference in the simulated pseudo-sample, hence in the distance selected θ’s, and still in one case the power does not matter while in the other case it does..!

**A**nother discussion I had during my visit led me to conclude a bit hastily that a thesis topic I had suggested to a new PhD student a few months ago had already been considered locally and earlier, although it ended up as a different, more computational than conceptual, perspective (so not all was lost for my student!). In a wider discussion around lunch, we also had an interesting foray on possible alternatives to Bayes factors and their shortcomings, which was a nice preparation to my seminar on giving up posterior probabilities for posterior error estimates. And an opportunity to mention the arXival of a proper scoring rules paper by Phil Dawid, Monica Musio and Laura Ventura, related with the one I had blogged about after the Padova workshop. And then again about a connected paper with Steve Fienberg. This lunch discussion even included some (mild) debate about Murray Aitkin’s integrated likelihood.

**A**s a completely irrelevant aside, this trip gave me the opportunity of a “pilgrimage” to Birmingham New Street train station, 38 years after “landing” for the first time in Britain! And to experience *a fresco* the multiple delays and apologies of East Midlands trains (*“we’re sorry we had to wait for this oil train in York”, “we have lost more time since B’ham”, “running a 37 minutes delay now”, “we apologize for the delay, due to trespassing”, *…), the only positive side being that delayed trains made delayed connections possible!

Filed under: Kids, pictures, Running, Statistics, Travel, University life Tagged: 4550 Miles from Dehli, English train, Gibbs sampler, Sherwood Forest, University of Nottingham, Ye Olde Trip to Jerusalem

### MCMC on zero measure sets

**S**imulating a bivariate normal under the constraint (or conditional to the fact) that x²-y²=1 (a non-linear zero measure curve in the 2-dimensional Euclidean space) is not that easy: if running a random walk along that curve (by running a random walk on y and deducing x as x²=y²+1 and accepting with a Metropolis-Hastings ratio based on the bivariate normal density), the outcome differs from the target predicted by a change of variable and the proper derivation of the conditional. The *above* graph resulting from the R code *below* illustrates the discrepancy!

If instead we add the proper Jacobian as in

ace=(runif(1)<(dnorm(propy)*dnorm(propx)/propx)/ (dnorm(ys[t-1])*dnorm(xs[t-1])/xs[t-1]))the fit is there. My open question is how to make this derivation generic, i.e. without requiring the (dreaded) computation of the (dreadful) Jacobian.

Filed under: R, Statistics Tagged: conditional density, Hastings-Metropolis sampler, Jacobian, MCMC, measure theory, measure zero set, projected measure, random walk

### Bared Blade [book review]

**A**s mentioned in my recent review of *Broken Blade* by Kelly McCullough, I had already ordered the sequel *Bared Blade*. And I read this second volume within a few days. Conditional on enjoying fantasy-world detective stories with supernatural beings popping in (or out) at the most convenient times, this volume is indeed very pleasant with a proper whodunnit, a fairly irrelevant McGuffin, a couple of dryads (that actually turn into…well, no spoiler!), several false trails, a radical variation on the “good cop-bad cop” duo, and the compulsory climactic reversal of fortune at the very end (not a spoiler since it is the same in every novel!). Once again, a very light read, to the point of being almost ethereal, with no pretence at depth or epics or myth, but rather funny and guaranteed 100% free of living-deads, which is a relief. I actually found this volume better than the first one, which is a rarity if you have had enough spare time to read thru my non-scientific book reviews, I am thus looking forward to the next break when I can skip through my next volume of Kelly McCullough, Crossed Blades. (And I hope I will not get more crossed with that one than I was bored with the current volume!)

Filed under: Books Tagged: barren blade, book reviews, broken blade, crossed blades, heroic fantasy, Kelly McCullough, magics

### to keep the werewolves at Bayes…

### Le Monde puzzle [#857]

**A** rather bland case of Le Monde mathematical puzzle :

*Two positive integers x and y are turned into s=x+y and p=xy. If Sarah and Primrose are given S and P, respectively, how can the following dialogue happen?*

*I am sure you cannot find my number**Now you told me that, I can, it is 46.*

*and what are the values of x and y?*

**I**n the original version, it was unclear whether or not each person knew she had the sum or the product. Anyway, the first person in the dialogue has to be Sarah, since a product p equal to a prime integer would lead Primrose to figure out x=1 and hence s=p+1. (Conversely, having observed the sum s cannot lead to deduce x and y.) This means x+y-1 is *not* a prime integer. Now the deduction of Primrose that the sum is 46 implies p can be decomposed only once in a product such that x+y-1 is not a prime integer. If p=45, this is the case since 45=15×3 and 45=5×9 lead to 15+3-1=17 and 5+9-1=13, while 45=45×1 leads to 45+1-1=45. Other solutions fail, as demonstrated by the R code:

**B**usser and Cohen argue much more wisely in their solution that any non-prime product p other than 45 would lead to p+1 as an acceptable sum s, hence would prevent Primrose from guessing s.

Filed under: Books, Kids, R Tagged: is.prim, Le Monde, mathematical puzzle, number theory, prime factor decomposition, prime.factor, R, schoolmath

### Pre-processing for approximate Bayesian computation in image analysis

**W**ith Matt Moores and Kerrie Mengersen, from QUT, we wrote this short paper just in time for the MCMSki IV Special Issue of *Statistics & Computing*. And arXived it, as well. The global idea is to cut down on the cost of running an ABC experiment by removing the simulation of a humongous state-space vector, as in Potts and hidden Potts model, and replacing it by an approximate simulation of the 1-d sufficient (summary) statistics. In that case, we used a division of the 1-d parameter interval to simulate the distribution of the sufficient statistic for each of those parameter values and to compute the expectation and variance of the sufficient statistic. Then the conditional distribution of the sufficient statistic is approximated by a Gaussian with these two parameters. And those Gaussian approximations substitute for the true distributions within an ABC-SMC algorithm à la Del Moral, Doucet and Jasra (2012).

**A**cross 20 125 × 125 pixels simulated images, Matt’s algorithm took an average of 21 minutes per image for between 39 and 70 SMC iterations, while resorting to pseudo-data and deriving the genuine sufficient statistic took an average of 46.5 hours for 44 to 85 SMC iterations. On a realistic Landsat image, with a total of 978,380 pixels, the precomputation of the mapping function took 50 minutes, while the total CPU time on 16 parallel threads was 10 hours 38 minutes. By comparison, it took 97 hours for 10,000 MCMC iterations on this image, with a poor effective sample size of 390 values. Regular SMC-ABC algorithms cannot handle this scale: It takes 89 hours to perform *a single* SMC iteration! (Note that path sampling also operates in this framework, thanks to the same precomputation: in that case it took 2.5 hours for 10⁵ iterations, with an effective sample size of 10⁴…)

**S**ince my student’s paper on Seaman et al (2012) got promptly rejected by *TAS* for quoting too extensively from my post, we decided to include me as an extra author and submitted the paper to this special issue as well.

Filed under: R, Statistics, University life Tagged: ABC, Chamonix, image processing, MCMC, MCMSki IV, Monte Carlo Statistical Methods, path sampling, Potts model, QUT, simulation, SMC-ABC, Statistics and Computing, sufficient statistics, summary statistics

### Saint-Joseph pierres sèches

Filed under: Mountains, Wines Tagged: Côtes du Rhône, Chamonix-Mont-Blanc, French wine, Saint-Joseph