Bayesian News Feeds

MCqMC 2014 [day #1]

Xian's Og - Tue, 2014-04-08 18:14

As I have been kindly invited to give a talk at MCqMC 2014, here am I. in Leuven, Belgium, for this conference I have never attended before. (I was also invited for MCqMC 2012 in Sydney The talk topics and the attendees’ “sociology” are quite similar to those of the IMACS meeting in Annecy last summer. Namely, rather little on MCMC, particle filters, and other tools familiar in Bayesian computational statistics, but a lot on diffusions and stochastic differential equations and of course quasi-Monte Carlo methods. I thus find myself at a boundary of the conference range and a wee bit lost by some talks, which even titles make little sense to me.

For instance, I have trouble to connect with multi-level Monte Carlo within my own referential. My understanding of the method is one of a control variate version of tempering, namely of using a sequence of approximations to the true target and using rougher approximations as control variates for the finer approximations. But I cannot find on the Web a statistical application of the method outside of diffusions and SDEs, i.e. outside of continuous time processes… Maybe using a particle filter from one approximation to the next, down in terms of roughness, could help.

“Several years ago, Giles (2008) introduced an intriguing multi-level idea to deal with such biased settings that can dramatically improve the rate of convergence and can even, in some settings, achieve the canonical “square root” convergence rate associated with unbiased Monte Carlo.” Rhee and Glynn, 2012

Those were my thoughts before lunchtime. today (namely April 7, 2014). And then, after lunch, Peter Glynn gave his plenary talk that just answered those questions of mine’s!!! Essentially, he showed that formula Pierre Jacob also used in his Bernoulli factory paper to transform a converging-biased-into-an-unbiased estimator, based on a telescopic series representation and a random truncation… This approach is described in a paper with Chang-han Rhee, arXived a few years ago. The talk also covered more recent work (presumably related with Chang-han Rhee’s thesis) extending the above to Markov chains. As explained to me later by Pierre Jacob [of Statisfaction fame!], a regular chain does not converge fast enough to compensate for the explosive behaviour of the correction factor, which is why Rhee and Glynn used instead a backward chain, linking to the exact or perfect samplers of the 1990′s (which origin can be related to a 1992 paper of Asmussen, Glynn and Thorisson). This was certainly the most riveting talk I attended in the past years in that it brought a direct answer to a question I was starting to investigate. And more. I was also wondering how connected it was with our “exact” representation of the stationary distribution (in an Annals of Probability paper with Jim Hobert).   Since we use a stopping rule based on renewal and a geometric waiting time, a somewhat empirical version of the inverse probability found in Peter’s talk. This talk also led me to re-consider a recent discussion we had in my CREST office with Andrew about using square root(ed) importance weights, since one of Peter’s slides exhibited those square roots as optimal. Paradoxically, Peter started the talk by down-playing it, stating there was a single idea therein and a single important slide, making it a perfect after-lunch talk: I wish I had actually had thrice more time to examine each slide! (In the afternoon session, Éric Moulines also gave a thought-provoking talk on particle islands and double bootstrap, a research project I will comment in more detail the day it gets arXived.)


Filed under: pictures, Running, Statistics, Travel, University life Tagged: Belgium, Bernoulli factory, Leuven, MCMC, MCQMC2014, Monte Carlo Statistical Methods, multi-level Monte Carlo, particle filters, SDEs, unbiasedness
Categories: Bayesian Bloggers

MCqMC 2014 [day #1]

Xian's Og - Tue, 2014-04-08 18:14

As I have been kindly invited to give a talk at MCqMC 2014, here am I. in Leuven, Belgium, for this conference I have never attended before. (I was also invited for MCqMC 2012 in Sydney The talk topics and the attendees’ “sociology” are quite similar to those of the IMACS meeting in Annecy last summer. Namely, rather little on MCMC, particle filters, and other tools familiar in Bayesian computational statistics, but a lot on diffusions and stochastic differential equations and of course quasi-Monte Carlo methods. I thus find myself at a boundary of the conference range and a wee bit lost by some talks, which even titles make little sense to me.

For instance, I have trouble to connect with multi-level Monte Carlo within my own referential. My understanding of the method is one of a control variate version of tempering, namely of using a sequence of approximations to the true target and using rougher approximations as control variates for the finer approximations. But I cannot find on the Web a statistical application of the method outside of diffusions and SDEs, i.e. outside of continuous time processes… Maybe using a particle filter from one approximation to the next, down in terms of roughness, could help.

“Several years ago, Giles (2008) introduced an intriguing multi-level idea to deal with such biased settings that can dramatically improve the rate of convergence and can even, in some settings, achieve the canonical “square root” convergence rate associated with unbiased Monte Carlo.” Rhee and Glynn, 2012

Those were my thoughts before lunchtime. today (namely April 7, 2014). And then, after lunch, Peter Glynn gave his plenary talk that just answered those questions of mine’s!!! Essentially, he showed that formula Pierre Jacob also used in his Bernoulli factory paper to transform a converging-biased-into-an-unbiased estimator, based on a telescopic series representation and a random truncation… This approach is described in a paper with Chang-han Rhee, arXived a few years ago. The talk also covered more recent work (presumably related with Chang-han Rhee’s thesis) extending the above to Markov chains. As explained to me later by Pierre Jacob [of Statisfaction fame!], a regular chain does not converge fast enough to compensate for the explosive behaviour of the correction factor, which is why Rhee and Glynn used instead a backward chain, linking to the exact or perfect samplers of the 1990′s (which origin can be related to a 1992 paper of Asmussen, Glynn and Thorisson). This was certainly the most riveting talk I attended in the past years in that it brought a direct answer to a question I was starting to investigate. And more. I was also wondering how connected it was with our “exact” representation of the stationary distribution (in an Annals of Probability paper with Jim Hobert).   Since we use a stopping rule based on renewal and a geometric waiting time, a somewhat empirical version of the inverse probability found in Peter’s talk. This talk also led me to re-consider a recent discussion we had in my CREST office with Andrew about using square root(ed) importance weights, since one of Peter’s slides exhibited those square roots as optimal. Paradoxically, Peter started the talk by down-playing it, stating there was a single idea therein and a single important slide, making it a perfect after-lunch talk: I wish I had actually had thrice more time to examine each slide! (In the afternoon session, Éric Moulines also gave a thought-provoking talk on particle islands and double bootstrap, a research project I will comment in more detail the day it gets arXived.)


Filed under: pictures, Running, Statistics, Travel, University life Tagged: Belgium, Bernoulli factory, Leuven, MCMC, MCQMC2014, Monte Carlo Statistical Methods, multi-level Monte Carlo, particle filters, SDEs, unbiasedness
Categories: Bayesian Bloggers

Leuven snapshot [#2]

Xian's Og - Tue, 2014-04-08 06:24
Categories: Bayesian Bloggers

Leuven snapshot [#2]

Xian's Og - Tue, 2014-04-08 06:24
Categories: Bayesian Bloggers

data scientist position

Xian's Og - Mon, 2014-04-07 18:14

Our newly created Chaire “Economie et gestion des nouvelles données” in Paris-Dauphine, ENS Ulm, École Polytechnique and ENSAE is recruiting a data scientist starting as early as May 1, the call remaining open till the position is filled. The location is in one of the above labs in Paris, the duration for at least one year, salary is varying, based on the applicant’s profile, and the contacts are Stephane Gaiffas (stephane.gaiffas AT cmap DOT polytechnique.fr), Robin Ryder (ryder AT ceremade DOT dauphine.fr). and Gabriel Peyré (peyre AT ceremade DOT dauphine.fr). Here are more details:

Job description

The chaire “Economie et gestion des nouvelles données” is recruiting a talented young engineer specialized in large scale computing and data processing. The targeted applications include machine learning, imaging sciences and finance. This is a unique opportunity to join a newly created research group between the best Parisian labs in applied mathematics and computer science (ParisDauphine, ENS Ulm, Ecole Polytechnique and ENSAE) working hand in hand with major industrial companies (Havas, BNP Paribas, Warner Bros.). The proposed position consists in helping researchers of the group to develop and implement large scale data processing methods, and applying these methods on real life problems in collaboration with the industrial partners.

A non exhaustive list of methods that are currently investigated by researchers of the group, and that will play a key role in the computational framework developed by the recruited engineer, includes :
● Large scale non smooth optimization methods (proximal schemes, interior points, optimization on manifolds).
● Machine learning problems (kernelized methods, Lasso, collaborative filtering, deep learning, learning for graphs, learning for timedependent systems), with a particular focus on large scale problems and stochastic methods.
● Imaging problems (compressed sensing, superresolution).
● Approximate Bayesian Computation (ABC) methods.
● Particle and Sequential Monte Carlo methods

Candidate profile

The candidate should have a very good background in computer science with various programming environments (e.g. Matlab, Python, C++) and knowledge of high performance computing methods (e.g. GPU, parallelization, cloud computing). He/she should adhere to the open source philosophy and possibly be able to interact with the relevant communities (e.g. scikitlearn initiative). Typical curriculum includes engineering school or Master studies in computer science / applied maths / physics, and possibly a PhD (not required).

Working environment

The recruited engineer will work within one of the labs of the chaire. He will benefit from a very stimulating working environment and all required computing resources. He will work in close interaction with the 4 research labs of the chaire, and will also have regular meetings with the industrial partners. More information about the chaire can be found online at http://www.di.ens.fr/~aspremon/chaire/


Filed under: R, Statistics, University life Tagged: ABC, advanced Monte Carlo methods, École Polytechnique, CREST, Economie et gestion des nouveles données, ENSAE, job offer, machine learning, Matlab, Python, Université Paris Dauphine
Categories: Bayesian Bloggers

accelerating MCMC via parallel predictive prefetching

Xian's Og - Sun, 2014-04-06 18:14

¨The idea is to calculate multiple likelihoods ahead of time (“pre-fetching”), and only use the ones which are needed.” A. Brockwell, 2006

Yet another paper on parallel MCMC, just arXived by Elaine Angelino, Eddie Kohler, Amos Waterland, Margo Seltzer, and Ryan P. Adams. Now,  besides “prefetching” found in the title, I spotted “speculative execution”, “slapdash treatment”, “scheduling decisions” in the very first pages: this paper definitely is far from shying away from using fancy terminology! I actually found the paper rather difficult to read to the point I had to give up my first attempt during an endless university board of governors meeting yesterday. (I also think “prefetching” is awfully painful to type!)

What is “prefetching” then? It refers to a 2006 JCGS paper by Anthony Brockwell. As explained in the above quote from Brockwell, prefetching means computing the 2², 2³, … values of the likelihood that will be needed in 2, 3, … iterations. Running a regular Metropolis-Hastings algorithm then means building a decision tree back to the current iteration and drawing 2,3, … uniform to go down the tree to the appropriate branch. So in the end only one path of the tree is exploited, which does not seem particularly efficient when vanilla Rao-Blackwellisation and recycling could be implemented almost for free.

“Another intriguing possibility, suggested to the author by an anonymous referee, arises in the case where one can guess whether or not acceptance probabilities will be “high” or “low.” In this case, the tree could be made deeper down “high” probability paths and shallower in the “low” probability paths.” A. Brockwell, 2006

The current paper stems from Brockwell’s 2006 final remark, as reproduced above, by those “speculative moves” that considers the reject branch of the prefetching tree more often that not, based on some preliminary or dynamic evaluation of the acceptance rate. Using a fast but close enough approximation to the true target (and a fixed sequence of uniforms) may also produce a “single most likely path on which” prefetched simulations can be run. The basic idea is thus to run simulations and costly likelihood computations on many parallel processors along a prefetched path, path that has been prefetched for its high approximate likelihood. (With of courses cases where this speculative simulation is not helpful because we end up following another path with the genuine target.) The paper actually goes further than the basic idea to avoid spending useless time on paths that will not be chosen, by constructing sequences of approximations for the precomputations. The proposition for the sequence found therein is to subsample the original data and use a normal approximation to the difference of the log (sub-)likelihoods. Even though the authors describe the system implementation of the progressive approximation idea, it remains rather unclear (to me) how the adaptive estimation of the acceptance probability is compatible with the parallelisation idea. Because it seems (to me) that it induces a lot of communication between the cores. Also, the method is advocated mainly for burnin’ (or warmup, to follow Andrew’s terminology!), which seems to remove the need to use exact targets: if the approximation is close enough, the Markov chain will quickly reach a region of interest for the true target and from there there seems to be little speedup in implementing this nonetheless most interesting strategy.


Filed under: Books, Statistics, University life Tagged: approximate target, baobab trees, board of governors, Monte Carlo Statistical Methods, parallel MCMC, parallel processing, precise pangolin, prefetching, speculative moves
Categories: Bayesian Bloggers

BAYSM ’14 im Wien, Sep. 18-19

Xian's Og - Sat, 2014-04-05 18:14

It all started in Jim Berger’s basement, drinking with the uttermost reverence an otherworldly Turley Zinfandel during the great party Ann and Jim Berger hosted for the O’Bayes’13 workshop in Duke. I then mentioned to Angela Bitto and Alexandra Posekany, from WU Wien, that I was going to be in Austria next September for a seminar in Linz, at the Johannes Kepler Universität, and, as it happened to take place the day before BAYSM ’14, the second conference of the young Bayesian statisticians, in connection with the j-ISBA section, they most kindly invited me to the meeting! As a senior Bayesian, most obviously! This is quite exciting, all the more because I never visited Vienna before. (Contrary to other parts of Austria, like the Großglockner, where I briefly met Peter Habeler. Trivia: the cover picture of the ‘Og is actually taken from the Großglockner.)


Filed under: Kids, Mountains, pictures, Statistics, Travel, University life, Wines Tagged: Austria, BAYSM, Duke, Großglockner, j-ISBA, Johannes Kepler Universität, Linz, O-Bayes 2013, Studlgrat, Wien, WU Wien
Categories: Bayesian Bloggers

talk in Orsay (message in a beetle)

Xian's Og - Fri, 2014-04-04 18:14

Yesterday (March 27), I gave a seminar at Paris-Sud University, Orsay, in the stats department, on ABC model choice. It was an opportunity to talk about recent advances we have made with Jean-Michel Marin and Pierre Pudlo on using machine-learning devices to improve ABC. (More to come soon!) And to chat with Gilles Celeux about machine learning and classification. Actually, given that one of my examples was about the Asian lady beetle invasion and that the buildings of the Paris-Sud University have suffered from this invasion, I should have advertised the talk with the more catchy title of “message in a beetle”…

This seminar was also an opportunity to experiment with mixed transportation. Indeed, since I had some errands to run in Paris in morning I decided to bike there (in Paris), work at CREST, and then take my bike in the RER train down to Orsay as I did not have the time and leisure to bike all the 20k there. Since it was the middle of the day, the carriage was mostly empty and I managed to type a blog entry without having to worry about the bike being a nuisance…. The only drag was to enter the platform in Paris (Cité Universitaire) as there was no clear access for bike. Fortunately, a student kindly helped me to get over the gate with my bike, as I could not manage on my own… Nonetheless, I will certainly repeat the experience on my next trip to Orsay (but would not dare take the bike inside/under Paris per se because of the (over-)crowded carriages there).


Filed under: Kids, Statistics, Travel, University life Tagged: ABC, ABC model choice, bike, Cité Universitaire, message in a beetle, Orsay, RER, seminar, Université Paris-Sud
Categories: Bayesian Bloggers

Bill Fitzgerald (1948-2014)

Xian's Og - Fri, 2014-04-04 11:32

 

Just heard a very sad item of news: our colleague and friend Bill Fitzgerald, Head of Research in the Signal Processing Laboratory in the Department of Engineering at the University of Cambridge, Fellow of Christ’s College, co-founder and Chairman of Featurespace, and fanatic guitar player, passed away yesterday. He wrote one of the very first books on MCMC with Joseph Ó Ruanaidh, Numerical Bayesian Methods Applied to Signal Processing, in 1996. On a more personal level, he invited me to Cambridge for my first visit there  in 1998 and he thus was influential in introducing me to my friends Christophe Andrieu and Arnaud Doucet. Farewell, Bill!, and may the blessing of the rain be on you…


Filed under: Books, Statistics, University life Tagged: Bill Fitzgerald, Christ's College, Engineering, signal processing, University of Cambridge
Categories: Bayesian Bloggers

Le Monde puzzle [#860]

Xian's Og - Thu, 2014-04-03 18:14

A Le Monde mathematical puzzle that connects to my awalé post of last year:

For N≤18, N balls are placed in N consecutive holes. Two players, Alice and Bob, consecutively take two balls at a time provided those balls are in contiguous holes. The loser is left with orphaned balls. What is the values of N such that Bob can win, no matter what is Alice’s strategy?

I solved this puzzle by the following R code that works recursively on N by eliminating all possible adjacent pairs of balls and checking whether or not there is a winning strategy for the other player.

topA=function(awale){ # return 1 if current player can win, 0 otherwise best=0 if (max(awale[-1]*awale[-N])==1){ #there are adjacent balls remaining for (i in (1:(N-1))[awale[1:(N-1)]==1]){ if (awale[i+1]==1){ bwale=awale bwale[c(i,i+1)]=0 best=max(best,1-topA(bwale)) } }} return(best) } for (N in 2:18) print(topA(rep(1,N)))

which returns the solution

[1] 1 [1] 1 [1] 1 [1] 0 [1] 1 [1] 1 [1] 1 [1] 0 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 0 [1] 1 [1] 1 [1] 1 <pre>

(brute-force) answering the question that N=5,9,15 are the values where Alice has no winning strategy if Bob plays in an optimal manner. (The case N=5 is obvious as there always remains two adjacent 1′s once Alice removed any adjacent pair. The case N=9 can also be shown to be a lost cause by enumeration of Alice’s options.)


Filed under: Books, Kids, R Tagged: awalé, Le Monde, mathematical puzzle, R, recursive function
Categories: Bayesian Bloggers

[more] parallel MCMC

Xian's Og - Wed, 2014-04-02 18:14

Scott Schmidler and his Ph.D. student Douglas VanDerwerken have arXived a paper on parallel MCMC the very day I left for Chamonix, prior to MCMSki IV, so it is no wonder I missed it at the time. This work is somewhat in the spirit of the parallel papers Scott et al.’s consensus Bayes,  Neiswanger et al.’s embarrassingly parallel MCMC, Wang and Dunson’s Weierstrassed MCMC (and even White et al.’s parallel ABC), namely that the computation of the likelihood can be broken into batches and MCMC run over those batches independently. In their short survey of previous works on parallelization, VanDerwerken and Schmidler overlooked our neat (!) JCGS Rao-Blackwellisation with Pierre Jacob and Murray Smith, maybe because it sounds more like post-processing than genuine parallelization (in that it does not speed up the convergence of the chain but rather improves the Monte Carlo usages one can make of this chain), maybe because they did not know of it.

“This approach has two shortcomings: first, it requires a number of independent simulations, and thus processors, equal to the size of the partition; this may grow exponentially in dim(Θ). Second, the rejection often needed for the restriction doesn’t permit easy evaluation of transition kernel densities, required below. In addition, estimating the relative weights wi with which they should be combined requires care.” (p.3)

The idea of the authors is to replace an exploration of the whole space operated via a single Markov chain (or by parallel chains acting independently which all have to “converge”) with parallel and independent explorations of parts of the space by separate Markov chains. “Small is beautiful”: it takes a shorter while to explore each set of the partition, hence to converge, and, more importantly, each chain can work in parallel to the others. More specifically, given a partition of the space, into sets Ai with posterior weights wi, parallel chains are associated with targets equal to the original target restricted to those Ai‘s. This is therefore an MCMC version of partitioned sampling. With regard to the shortcomings listed in the quote above, the authors consider that there does not need to be a bijection between the partition sets and the chains, in that a chain can move across partitions and thus contribute to several integral evaluations simultaneously. I am a bit worried about this argument since it amounts to getting a random number of simulations within each partition set Ai. In my (maybe biased) perception of partitioned sampling, this sounds somewhat counter-productive, as it increases the variance of the overall estimator. (Of course, not restricting a chain to a given partition set Ai has the incentive of avoiding a possibly massive amount of rejection steps. It is however unclear (a) whether or not it impacts ergodicity (it all depends on the way the chain is constructed, i.e. against which target(s)…) as it could lead to an over-representation of some boundaries and (b) whether or not it improves the overall convergence properties of the chain(s).)

“The approach presented here represents a solution to this problem which can completely remove the waiting times for crossing between modes, leaving only the relatively short within-mode equilibration times.” (p.4)

A more delicate issue with the partitioned MCMC approach (in my opinion!) stands with the partitioning. Indeed, in a complex and high-dimension model, the construction of the appropriate partition is a challenge in itself as we often have no prior idea where the modal areas are. Waiting for a correct exploration of the modes is indeed faster than waiting for crossing between modes, provided all modes are represented and the chain for each partition set Ai has enough energy to explore this set. It actually sounds (slightly?) unlikely that a target with huge gaps between modes will see a considerable improvement from the partioned version when the partition sets Ai are selected on the go, because some of the boundaries between the partition sets may be hard to reach with a off-the-shelf proposal. (Obviously, the second part of the method on the adaptive construction of partitions is yet in the writing and I am looking forward its aXival!)

Furthermore, as noted by Pierre Jacob (of Statisfaction fame!), the adaptive construction of the partition has a lot in common with Wang-Landau schemes. Which goal is to produce a flat histogram proposal from the current exploration of the state space. Connections with Atchadé’s and Liu’s (2010, Statistical Sinica) extension of the original Wang-Landau algorithm could have been spelled out. Esp. as the Voronoï tessellation construct seems quite innovative in this respect.


Filed under: Books, Mountains Tagged: Banff, batch sampling, Chamonix-Mont-Blanc, Duke University, embarassingly parallel, Markov chain Monte Carlo, partition, partitioned sampling, Rao-Blackwellisation, split chain, Voronoi tesselation
Categories: Bayesian Bloggers

firefly Monte Carlo

Xian's Og - Tue, 2014-04-01 18:14

And here is yet another arXived paper using a decomposition of the posterior distirbution as a product of terms to run faster, better and higher MCMC algorithms! This one is by Douglas Maclaurin and Ryan Adams: “Firefly Monte Carlo: Exact MCMC with Subsets of Data“. (While a swarm of fireflies make sense to explain the name, I may miss some cultural subliminal meaning in the title as Firefly and Monte Carlo seem to be places in Las Vegas (?), and a car brand, Firefly is a TV series, a clothes brand, and maybe other things…)

“The evolution of the chain evokes an image of fireflies, as the individual data blink on and out due to updates of the zn.”

The fundamental assumption of Maclaurin’s and Adams’ approach is that each product term in the likelihood (expressed as a product) can be bounded by a cheaper lower bound. This lower bound is used to create a Bernoulli auxiliary variable with probability equal to the ratio of the lower bound to the likelihood term, auxiliary variable that helps to reduce the number of evaluations of the original likelihood terms. Obviously, there is a gain only if (a) the lower bound is close or tight enough and (b) simulating the auxiliary variables is cheap enough.

About (a), the paper gives the tight example of a logistic, with a case of a 98% tightness. How generic is that and how those bounds can be derived in a cheap or automated manner? If one needs to run a variational Bayes approximation first, the gain in efficiency is unlikely to hold. About (b), I do not fully get it: if generating zn requires the evaluation of the original likelihood we loose the entire appeal of the method. Admittedly, I can see the point in changing a very small portion α of the zn‘s between moves on the parameter θ, since the number of likelihood evaluations is the same portion α of the total number of terms N. But decreasing the portion α is also reducing the mixing efficiency of the algorithm. In the efficient ways of updating the auxiliary brightness variables (ways proposed in the paper), I get the idea of making a proposal first before eventually computing the true probability of a Bernoulli. A proposal making use of the previous value of the probability (i.e. for the previous value of the parameter θ) could also reduce the number of evaluations of likelihood terms. However, using a “cached” version of the likelihood is only relevant within the same simulation step since a change in θ requires recomputing the likelihood.

“In each experiment we compared FlyMC, with two choices of bound selection, to regular full-posterior MCMC. We looked at the average number of likelihoods queried at each iteration and the number of effective samples generated per iteration, accounting for autocorrelation.”

This comparison does not seem adequate to me: by construction, the algorithm in the paper reduces the number of likelihood evaluations, so this is not a proper comparative instrument. The effective sample size is a transform of the correlation, not an indicator of convergence. For instance, if the zn‘s were hardly to change between iterations, thus the overall sampler was definitely far from converging, we would get θ’s simulated from almost the same distribution, hence being uncorrelated. In other words, if the joint chain in (θ,zn) does not converge, it is harder to establish that the subchain in θ converges at all. Indeed, in this logistic example where the computation of the likelihood is not a massive constraint, I am surprised there is any possibility of a huge gain in using the method, unless the lower bound is essentially the likelihood, which is actually  the case for logistic regression models. Another point made by Dan Simpson is that the whole dataset needs to remain on-hold, full-time, which may be a challenge to the computer memory. And stops short of providing really Big Data solutions.


Filed under: Books, Statistics, University life Tagged: auxiliary variable, big data, logistic regression, parallel MCMC
Categories: Bayesian Bloggers

penalising model component complexity

Xian's Og - Mon, 2014-03-31 18:14

“Prior selection is the fundamental issue in Bayesian statistics. Priors are the Bayesian’s greatest tool, but they are also the greatest point for criticism: the arbitrariness of prior selection procedures and the lack of realistic sensitivity analysis (…) are a serious argument against current Bayesian practice.” (p.23)

A paper that I first read and annotated in the very early hours of the morning in Banff, when temperatures were down in the mid minus 20′s now appeared on arXiv, “Penalising model component complexity: A principled, practical approach to constructing priors” by Thiago Martins, Dan Simpson, Andrea Riebler, Håvard Rue, and Sigrunn Sørbye. It is a highly timely and pertinent paper on the selection of default priors! Which shows that the field of “objective” Bayes is still full of open problems and significant advances and makes a great argument for the future president [that I am] of the O’Bayes section of ISBA to encourage young Bayesian researchers to consider this branch of the field.

“On the other end of the hunt for the holy grail, “objective” priors are data-dependent and are not uniformly accepted among Bayesians on philosophical grounds.” (p.2)

Apart from the above quote, as objective priors are not data-dependent! (this is presumably a typo, used instead of model-dependent), I like very much the introduction (appreciating the reference to the very recent Kamary (2014) that just got rejected by TAS for quoting my blog post way too much… and that we jointly resubmitted to Statistics and Computing). Maybe missing the alternative solution of going hierarchical as far as needed and ending up with default priors [at the top of the ladder]. And not discussing the difficulty in specifying the sensitivity of weakly informative priors.

“Most model components can be naturally regarded as a flexible version of a base model.” (p.3)

The starting point for the modelling is the base model. How easy is it to define this base model? Does it [always?] translate into a null hypothesis formulation? Is there an automated derivation? I assume this somewhat follows from the “block” idea that I do like but how generic is model construction by blocks?

     

“Occam’s razor is the principle of parsimony, for which simpler model formulations should be preferred until there is enough support for a more complex model.” (p.4)

I also like this idea of putting a prior on the distance from the base! Even more because it is parameterisation invariant (at least at the hyperparameter level). (This vaguely reminded me of a paper we wrote with George a while ago replacing tests with distance evaluations.) And because it gives a definitive meaning to Occam’s razor. However, unless the hyperparameter ξ is one-dimensional this does not define a prior on ξ per se. I equally like Eqn (2) as it shows how the base constraint takes one away from Jeffrey’s prior. Plus, if one takes the Kullback as an intrinsic loss function, this also sounds related to Holmes’s and Walker’s substitute loss pseudopriors, no? Now, eqn (2) does not sound right in the general case. Unless one implicitly takes a uniform prior on the Kullback sphere of radius d? There is a feeling of one-d-ness in the description of the paper (at least till page 6) and I wanted to see how it extends to models with many (≥2) hyperparameters. Until I reached Section 6 where the authors state exactly that! There is also a potential difficulty in that d(ξ) cannot be computed in a general setting. (Assuming that d(ξ) has a non-vanishing Jacobian as on page 19 sounds rather unrealistic.) Still about Section 6, handling reference priors on correlation matrices is a major endeavour, which should produce a steady flow of followers..!

“The current practice of prior specification is, to be honest, not in a good shape. While there has been a strong growth of Bayesian analysis in science, the research field of “practical prior specification” has been left behind.” (*p.23)

There are still quantities to specify and calibrate in the PC priors, which may actually be deemed a good thing by Bayesians (and some modellers). But overall I think this paper and its message constitute a terrific step for Bayesian statistics and I hope the paper can make it to a major journal.


Filed under: Books, Mountains, pictures, Statistics, University life Tagged: Banff, default prior, Fisher information, ISBA, Jeffreys priors, Kullback-Leibler divergence, model complexity, noninformative priors, O'Bayes, penalisation, Riemann manifold
Categories: Bayesian Bloggers

Bayesian Data Analysis [BDA3 - part #2]

Xian's Og - Sun, 2014-03-30 18:14

Here is the second part of my review of Gelman et al.’ Bayesian Data Analysis (third edition):

“When an iterative simulation algorithm is “tuned” (…) the iterations will not in general converge to the target distribution.” (p.297)

Part III covers advanced computation, obviously including MCMC but also model approximations like variational Bayes and expectation propagation (EP), with even a few words on ABC. The novelties in this part are centred at Stan, the language Andrew is developing around Hamiltonian Monte Carlo techniques, a sort of BUGS of the 10′s! (And of course Hamiltonian Monte Carlo techniques themselves. A few (nit)pickings: the book advises important resampling without replacement (p.266) which makes some sense when using a poor importance function but ruins the fundamentals of importance sampling. Plus, no trace of infinite variance importance sampling? of harmonic means and their dangers? In the Metropolis-Hastings algorithm, the proposal is called the jumping rule and denoted by Jt, which, besides giving the impression of a Jacobian, seems to allow for time-varying proposals and hence time-inhomogeneous Markov chains, which convergence properties are much hairier. (The warning comes much later, as exemplified in the above quote.) Moving from “burn-in” to “warm-up” to describe the beginning of an MCMC simulation. Being somewhat 90′s about convergence diagnoses (as shown by the references in Section 11.7), although the book also proposes new diagnoses and relies much more on effective sample sizes. Particle filters are evacuated in hardly half-a-page. Maybe because Stan does not handle particle filters. A lack of intuition about the Hamiltonian Monte Carlo algorithms, as the book plunges immediately into a two-page pseudo-code description. Still using physics vocabulary that put me (and maybe only me) off. Although I appreciated the advice to check analytical gradients against their numerical counterpart.

“In principle there is no limit to the number of levels of variation that can be handled in this way. Bayesian methods provide ready guidance in handling the estimation of the unknown parameters.” (p.381)

I also enjoyed reading the part about modes that stand at the boundary of the parameter space (Section 13.2), even though I do not think modes are great summaries in Bayesian frameworks and while I do not see how picking the prior to avoid modes at the boundary avoids the data impacting the prior, in fine. The variational Bayes section (13.7) is equally enjoyable, with a proper spelled-out illustration, introducing an unusual feature for Bayesian textbooks.  (Except that sampling without replacement is back!) Same comments for the Expectation Propagation (EP) section (13.8) that covers brand new notions. (Will they stand the test of time?!)

“Geometrically, if β-space is thought of as a room, the model implied by classical model selection claims that the true β has certain prior probabilities of being in the room, on the floor, on the walls, in the edge of the room, or in a corner.” (p.368)

Part IV is a series of five chapters about regression(s). This is somewhat of a classic, nonetheless  Chapter 14 surprised me with an elaborate election example that dabbles in advanced topics like causality and counterfactuals. I did not spot any reference to the g-prior or to its intuitive justifications and the chapter mentions the lasso as a regularisation technique, but without any proper definition of this “popular non-Bayesian form of regularisation” (p.368). In French: with not a single equation! Additional novelty may lie in the numerical prior information about the correlations. What is rather crucially (cruelly?) missing though is a clearer processing of variable selection in regression models. I know Andrew opposes any notion of a coefficient being exactly equal to zero, as ridiculed through the above quote, but the book does not reject model selection, so why not in this context?! Chapter 15 on hierarchical extensions stresses the link with exchangeability, once again. With another neat election example justifying the progressive complexification of the model and the cranks and toggles of model building. (I am not certain the reparameterisation advice on p.394 is easily ingested by a newcomer.) The chapters on robustness (Chap. 17) and missing data (Chap. 18) sound slightly less convincing to me, esp. the one about robustness as I never got how to make robustness agree with my Bayesian perspective. The book states “we do not have to abandon Bayesian principles to handle outliers” (p.436), but I would object that the Bayesian paradigm compels us to define an alternative model for those outliers and the way they are produced. One can always resort to a drudging exploration of which subsample of the dataset is at odds with the model but this may be unrealistic for large datasets and further tells us nothing about how to handle those datapoints. The missing data chapter is certainly relevant to such a comprehensive textbook and I liked the survey illustration where the missing data was in fact made of missing questions. However, I felt the multiple imputation part was not well-presented, fearing readers would not understand how to handle it…

“You can use MCMC, normal approximation, variational Bayes, expectation propagation, Stan, or any other method. But your fit must be Bayesian.” (p.517)

Part V concentrates the most advanced material, with Chapter 19 being mostly an illustration of a few complex models, slightly superfluous in my opinion, Chapter 20 a very short introduction to functional bases, including a basis selection section (20.2) that implements the “zero coefficient” variable selection principle refuted in the regression chapter(s), and does not go beyond splines (what about wavelets?), Chapter 21 a (quick) coverage of Gaussian processes with the motivating birth-date example (and two mixture datasets I used eons ago…), Chapter 22 a more (too much?) detailed study of finite mixture models, with no coverage of reversible-jump MCMC, and Chapter 23 an entry on Bayesian non-parametrics through Dirichlet processes.

“In practice, for well separated components, it is common to remain stuck in one labelling across all the samples that are collected. One could argue that the Gibbs sampler has failed in such a case.” (p.535)

To get back to mixtures, I liked the quote about the label switching issue above, as I was “one” who argued that the Gibbs sampler fails to converge! The corresponding section seems to favour providing a density estimate for mixture models, rather than component-wise evaluations, but it nonetheless mentions the relabelling by permutation approach (if missing our 2000 JASA paper). The section about inferring on the unknown number of components suggests conducting a regular Gibbs sampler on a model with an upper bound on the number of components and then checking for empty components, an idea I (briefly) considered in the mid-1990′s before the occurrence of RJMCMC. Of course, the prior on the components matters and the book suggests using a Dirichlet with fixed sum like 1 on the coefficients for all numbers of components.

“14. Objectivity and subjectivity: discuss the statement `People tend to believe results that support their preconceptions and disbelieve results that surprise them. Bayesian methods tend to encourage this undisciplined mode of thinking.’¨ (p.100)

Obviously, this being a third edition begets the question, what’s up, doc?!, i.e., what’s new [when compared with the second edition]? Quite a lot, even though I am not enough of a Gelmanian exegist to produce a comparision table. Well, for a starter, David Dunson and Aki Vethtari joined the authorship, mostly contributing to the advanced section on non-parametrics, Gaussian processes, EP algorithms. Then the Hamiltonian Monte Carlo methodology and Stan of course, which is now central to Andrew’s interests. The book does include a short Appendix on running computations in R and in Stan. Further novelties were mentioned above, like the vision of weakly informative priors taking over noninformative priors but I think this edition of Bayesian Data Analysis puts more stress on clever and critical model construction and on the fact that it can be done in a Bayesian manner. Hence the insistence on predictive and cross-validation tools. The book may be deemed somewhat short on exercices, providing between 3 and 20 mostly well-developed problems per chapter, often associated with datasets, rather than the less exciting counter-example above. Even though Andrew disagrees and his students at ENSAE this year certainly did not complain, I personally feel a total of 220 exercices is not enough for instructors and self-study readers. (At least, this reduces the number of email requests for solutions! Esp. when 50 of those are solved on the book website.) But this aspect is a minor quip: overall this is truly the reference book for a graduate course on Bayesian statistics and not only Bayesian data analysis.


Filed under: Books, Kids, R, Statistics, University life Tagged: Andrew Gelman, Bayesian data analysis, Bayesian model choice, Bayesian predictive, finite mixtures, graduate course, hierarchical Bayesian modelling, rats, STAN
Categories: Bayesian Bloggers

speed [quick book review]

Xian's Og - Sat, 2014-03-29 19:14

Ueli Steck is a Swiss alpinist who climbed solo the three “last” north face routes of the Alps (Eiger, Jorasses, and Cervino/Matterhorn) in the record times of 2:47, 2:27, and 1:56… He also recently climbed Annapurna in 27 hours from base camp, again solo and with no oxygen. (Which led some to doubt his record time as he had lost his camera on the way.) A climb for which he got one of the 2014 Piolets d’Or. (In connection with this climb, he also faced death threats from the sherpas installing fixed ropes on Everest as reported in an earlier post.) He wrote a book called Speed, where he described how he managed the three above records in a rather detailed way. (It is published in German, Italian and French,

the three major languages of the Swiss Confederation, but apparently not in English.) The book reads fast as well but it should not be very appealing to non-climbers as it concentrates mostly on the three climbs and their difficulties. The book also contains three round-tables between Messner and Steck, Bonatti and Steck, and Profit and Steck, which are of some further interest. The most fascinating part in the book is when he describes deciding to go completely free, forsaking existing protection and hence any survival opportunity were he to fall. When looking at the level of the places he climbed, this sounds to me like an insane Russian roulette, even with a previous recognition of the routes (not in the Jorasses where he even climbed on-sight).  I also liked the recollection of his gift of an Eiger Nordwand climb with her wife for her birthday! (I am unsure any spouse would appreciate such a gift to the same extent!) The book concludes with Steck envisioning moving away from those speed solos and towards other approaches to climbing and mountains…

As a coincidence, I also watched the film documentary Messner on Arte. A very well-done docu-fiction with reconstitutions of some of the most impressive climbs of Messner in the Alps and the Himalayas… Like the solo climb of the north face of Les Droites. With a single icepick. The film is also an entry into what made Messner the unique climber he is, from a very strict family environment to coping with the literal loss of his brother Guenther on the Nanga Parbat. With a testimony from his companion to the traverse by ski of the North Pole who saw Messner repeatedly calling him Guenther under stress.


Filed under: Books, Mountains, Running Tagged: Annapurna, Cervino, Eiger, Everest, Grandes Jorasses, Matterhorn, north faces, Reinhold Messner, speed climbing, Ueli Steck, Walter Bonatti
Categories: Bayesian Bloggers

ski with deviation

Xian's Og - Fri, 2014-03-28 19:14

I just learned that a micro-brew brand of homemade skis has connections with statistics and, who knows, could become a sponsor to the next MCMSki…  Indeed, the brand is called deviation (as in standard deviation), located in Gresham, Oregon, and sell locally made skis and snowboards with names like The Moment Generator or The Mode! The logo  clearly indicates a statistical connection:

As it happens, two of the founding partners of deviation, Tim and Peter Wells, are the sons of my long-time friend Marty Wells from Cornell University. When I first met them, they were great kids, young enough to give no inkling they would end up producing beautiful hardwood core skis in a suburb of Portland, Oregon!!! Best wishes to them and to deviation, the most statistical of all ski brands! (Here is a report in The Oregonian that tells the story of how deviation was created.)


Filed under: Kids, Mountains, pictures, Travel Tagged: Cornell University, deviation, Gresham, homemade skis, Ithaca, MCMSki, Oregon, ski brand

Categories: Bayesian Bloggers

Bayesian Data Analysis [BDA3]

Xian's Og - Thu, 2014-03-27 19:14

Andrew Gelman and his coauthors, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Don Rubin, have now published the latest edition of their book Bayesian Data Analysis. David and Aki are newcomers to the authors’ list, with an extended section on non-linear and non-parametric models. I have been asked by Sam Behseta to write a review of this new edition for JASA (since Sam is now the JASA book review editor). After wondering about my ability to produce an objective review (on the one hand, this is The Competition  to Bayesian Essentials!, on the other hand Andrew is a good friend spending the year with me in Paris), I decided to jump for it and write a most subjective review, with the help of Clara Grazian who was Andrew’s teaching assistant this year in Paris and maybe some of my Master students who took Andrew’s course. The second edition was reviewed in the September 2004 issue of JASA and we now stand ten years later with an even more impressive textbook. Which truly what Bayesian data analysis should be.

This edition has five parts, Fundamentals of Bayesian Inference, Fundamentals of Bayesian Data Analysis, Advanced Computation, Regression Models, and Non-linear and Non-parametric Models, plus three appendices. For a total of xiv+662 pages. And a weight of 2.9 pounds (1395g on my kitchen scale!) that makes it hard to carry around in the metro…. I took it to Warwick (and then Nottingham and Oxford and back to Paris) instead.

We could avoid the mathematical effort of checking the integrability of the posterior density (…) The result would clearly show the posterior contour drifting off toward infinity.” (p.111)

While I cannot go into a detailed reading of those 662 pages (!), I want to highlight a few gems. (I already wrote a detailed and critical analysis of Chapter 6 on model checking in that post.) The very first chapter provides all the necessary items for understanding Bayesian Data Analysis without getting bogged in propaganda or pseudo-philosophy. Then the other chapters of the first part unroll in a smooth way, cruising on the B highway… With the unique feature of introducing weakly informative priors (Sections 2.9 and 5.7), like the half-Cauchy distribution on scale parameters. It may not be completely clear how weak a weakly informative prior, but this novel notion is worth including in a textbook. Maybe a mild reproach at this stage: Chapter 5 on hierarchical models is too verbose for my taste, as it essentially focus on the hierarchical linear model. Of course, this is an essential chapter as it links exchangeability, the “atom” of Bayesian reasoning used by de Finetti, with hierarchical models. Still. Another comment on that chapter: it broaches on the topic of improper posteriors by suggesting to run a Markov chain that can exhibit improperness by enjoying an improper behaviour. When it happens as in the quote above, fine!, but there is no guarantee this is always the case! For instance, improperness may be due to regions near zero rather than infinity. And a last barb: there is a dense table (Table 5.4, p.124) that seems to run contrariwise to Andrew’s avowed dislike of tables. I could also object at the idea of a “true prior distribution” (p.128), or comment on the trivia that hierarchical chapters seem to attract rats (as I also included a rat example in the hierarchical Bayes chapter of Bayesian Choice and so does the BUGS Book! Hence, a conclusion that Bayesian textbooks are better be avoided by muriphobiacs…)

“Bayes factors do not work well for models that are inherently continuous (…) Because we emphasize continuous families of models rather than discrete choices, Bayes factors are rarely relevant in our approach to Bayesian statistics.” (p.183 & p.193)

Part II is about “the creative choices that are required, first to set up a Bayesian model in a complex problem, then to perform the model checking and confidence building that is typically necessary to make posterior inferences scientifically defensible” (p.139). It is certainly one of the strengths of the book that it allows for a critical look at models and tools that are rarely discussed in more theoretical Bayesian books. As detailed in my  earlier post on Chapter 6, model checking is strongly advocated, via posterior predictive checks and… posterior predictive p-values, which are at best empirical indicators that something could be wrong, definitely not that everything’s allright! Chapter 7 is the model comparison equivalent of Chapter 6, starting with the predictive density (aka the evidence or the marginal likelihood), but completely bypassing the Bayes factor for information criteria like the Watanabe-Akaike or widely available information criterion (WAIC), and advocating cross-validation, which is empirically satisfying but formally hard to integrate within a full Bayesian perspective. Chapter 8 is about data collection, sample surveys, randomization and related topics, another entry that is missing from most Bayesian textbooks, maybe not that surprising given the research topics of some of the authors. And Chapter 9 is the symmetric in that it focus on the post-modelling step of decision making.

(Second part of the review to appear on Monday, leaving readers the weekend to recover!)


Filed under: Books, Kids, R, Statistics, University life Tagged: Andrew Gelman, Bayesian data analysis, Bayesian model choice, Bayesian predictive, finite mixtures, graduate course, hierarchical Bayesian modelling, rats, STAN
Categories: Bayesian Bloggers

¼th i-like workshop in St. Anne’s College, Oxford

Xian's Og - Wed, 2014-03-26 19:09

Due to my previous travelling to and from Nottingham for the seminar and back home early enough to avoid the dreary evening trains from Roissy airport (no luck there, even at 8pm, the RER train was not operating efficiently!, and no fast lane is planed prior to 2023…), I did not see many talks at the i-like workshop. About ¼th, roughly… I even missed the poster session (and the most attractive title of Lazy ABC by Dennis Prangle) thanks to another dreary train ride from Derby to Oxford.

As it happened I had already heard or read parts of the talks in the Friday morning session, but this made understanding them better. As in Banff, Paul Fearnhead‘s talk on reparameterisations for pMCMC on hidden Markov models opened a wide door to possible experiments on those algorithms. The examples in the talk were mostly of the parameter duplication type, somewhat creating unidentifiability to decrease correlation, but I also wondered at the possibility of introducing frequent replicas of the hidden chain in order to fight degeneracy. Then Sumeet Singh gave a talk on the convergence properties of noisy ABC for approximate MLE. Although I had read some of the papers behind the talk, it made me realise how keeping balls around each observation in the ABC acceptance step was not leading to extinction as the number of observations increased. (Summet also had a good line with his ABCDE algorithm, standing for ABC done exactly!) Anthony Lee covered his joint work with Krys Łatuszyński on the ergodicity conditions on the ABC-MCMC algorithm, the only positive case being the 1-hit algorithm as discussed in an earlier post. This result will hopefully get more publicity, as I frequently read that increasing the number of pseudo-samples has no clear impact on the ABC approximation. Krys Łatuszyński concluded the morning with an aggregate of the various results he and his co-authors had obtained on the fascinating Bernoulli factory. Including constructive derivations.

After a few discussions on and around research topics, it was too soon time to take advantage of the grand finale of a March shower to walk from St. Anne’s College to Oxford Station, in order to start the trip back home. I was lucky enough to find a seat and could start experimenting in R the new idea my trip to Nottingham had raised! While discussing a wee bit with my neighbour, a delightful old lady from the New Forest travelling to Coventry, recovering from a brain seizure, wondering about my LaTeX code syntax despite the tiny fonts, and who most suddenly popped a small screen from her bag to start playing Candy Crush!, apologizing all the same. The overall trip was just long enough for my R code to validate this idea of mine, making this week in England quite a profitable one!!! 


Filed under: pictures, Statistics, Travel, University life Tagged: ABC-MCMC, ABC-SMC, Bernouilli factory, Derby, HMM, i-like, Nottingham, pMCMC, St. Anne's College, University of Oxford, University of Warwick
Categories: Bayesian Bloggers