Bayesian News Feeds
When in Warwick last October, I met Simo Särkkä, who told me he had published an IMS monograph on Bayesian filtering and smoothing the year before. I thought it would be an appropriate book to review for CHANCE and tried to get a copy from Oxford University Press, unsuccessfully. I thus bought my own book that I received two weeks ago and took the opportunity of my Czech vacations to read it… [A warning pre-empting accusations of self-plagiarism: this is a preliminary draft for a review to appear in CHANCE under my true name!]
“From the Bayesian estimation point of view both the states and the static parameters are unknown (random) parameters of the system.” (p.20)
Bayesian filtering and smoothing is an introduction to the topic that essentially starts from ground zero. Chapter 1 motivates the use of filtering and smoothing through examples and highlights the naturally Bayesian approach to the problem(s). Two graphs illustrate the difference between filtering and smoothing by plotting for the same series of observations the successive confidence bands. The performances are obviously poorer with filtering but the fact that those intervals are point-wise rather than joint, i.e., that the graphs do not provide a confidence band. (The exercise section of that chapter is superfluous in that it suggests re-reading Kalman’s original paper and rephrases the Monty Hall paradox in a story unconnected with filtering!) Chapter 2 gives an introduction to Bayesian statistics in general, with a few pages on Bayesian computational methods. A first remark is that the above quote is both correct and mildly confusing in that the parameters can be consistently estimated, while the latent states cannot. A second remark is that justifying the MAP as associated with the 0-1 loss is incorrect in continuous settings. The third chapter deals with the batch updating of the posterior distribution, i.e., that the posterior at time t is the prior at time t+1. With applications to state-space systems including the Kalman filter. The fourth to sixth chapters concentrate on this Kalman filter and its extension, and I find it somewhat unsatisfactory in that the collection of such filters is overwhelming for a neophyte. And no assessment of the estimation error when the model is misspecified appears at this stage. And, as usual, I find the unscented Kalman filter hard to fathom! The same feeling applies to the smoothing chapters, from Chapter 8 to Chapter 10. Which mimic the earlier ones.
“The degeneracy problem can be solved by a resampling procedure.” (p.123)
By comparison, the seventh chapter on particle filters appears too introductory from my biased perspective. For instance, the above motivation for resampling in sequential importance (re)sampling is not clear enough. As stated it sounds too much like a trick, not mentioning the fast decrease in the number of first generation ancestors as the number of generations grows. And thus the need for either increasing the number of particles fast enough or checking for quick-forgetting. Chapter 11 is the equivalent of the above for particle smoothing. I would have like more details on the full posterior smoothing distribution, instead of the marginal posterior smoothing distribution at a given time t. And more of a discussion on the comparative merits of the different algorithms.
Chapter 12 is much longer than the other chapters as it caters to the much more realistic issue of parameter estimation. The chapter borrows at time from Cappé, Moulines and Rydèn (2007), where I contributed to the Bayesian estimation chapter. This is actually the first time in Bayesian filtering and smoothing when MCMC is mentioned. Including reference to adaptive MCMC and HMC. The chapter also covers some EM versions. And pMCMC à la Andrieu et al. (2010). Although a picture like Fig. 12.2 seems to convey the message that this particle MCMC approach is actually quite inefficient.
“An important question (…) which of the numerous methods should I choose?”
The book ends up with an Epilogue (Chapter 13). Suggesting to use (Monte Carlo) sampling only after all other methods have failed. Which implies assessing that those methods have indeed failed. Maybe the suggestion of running what seems like the most appropriate method first with synthetic data (rather than the real data) could be included. For one thing, it does not add much to the computing cost. All in all, and despite some criticisms voiced above, I find the book quite an handy and compact introduction to the field, albeit slightly terse for an undergraduate audience.
Filed under: Books, Statistics, Travel, University life Tagged: book review, CHANCE, EM algorithm, filtering, IMS Textbooks, Kalman filter, MAP estimators, particle filter, particle MCMC, plagiarism, Simo Särkkä, smoothing, The Monty Hall problem
Today was the final session of our Reading Classics Seminar for the academic year 2014-2015. I have not reported on this seminar much so far because it has had starting problems, namely hardly any student present on the first classes and therefore several re-starts until we reached a small group of interested students. And this is truly The End for this enjoyable experiment as this is the final year for my TSI Master at Paris-Dauphine, as it will become integrated within the new MASH Master next year.
As a last presentation for the entire series, my student picked John Skilling’s Nested Sampling, not that it was in my list of “classics”, but he had worked on the paper in a summer project and was thus reasonably fluent with the topic. As he did a good enough job (!), here are his slides.
Some of the questions that came to me during the talk were on how to run nested sampling sequentially, both in the data and in the number of simulated points, and on incorporating more deterministic moves in order to remove some of the Monte Carlo variability. I was about to ask about (!) the Hamiltonian version of nested sampling but then he mentioned his last summer internship on this very topic! I also realised during that talk that the formula (for positive random variables)
does not require absolute continuity of the distribution F.
Filed under: Books, Kids, Statistics, University life Tagged: advanced Monte Carlo methods, classics, efficient importance sampling, evidence, Hamiltonian Monte Carlo, Monte Carlo Statistical Methods, nested sampling, seminar, slides, Université Paris Dauphine
Filed under: Kids, pictures, Travel Tagged: Czech Republic, Gothic cathedral, Prague, Prague Castle, Saint Vitus cathedral
Last week, Michael Betancourt, from Warwick, arXived a neat wee note on the fundamental difficulties in running HMC on a subsample of the original data. The core message is that using only one fraction of the data to run an HMC with the hope that it will preserve the stationary distribution does not work. The only way to recover from the bias is to use a Metropolis-Hastings step using the whole data, a step that both kills most of the computing gain and has very low acceptance probabilities. Even the strategy that subsamples for each step in a single trajectory fails: there cannot be a significant gain in time without a significant bias in the outcome. Too bad..! Now, there are ways of accelerating HMC, for instance by parallelising the computation of gradients but, just as in any other approach (?), the information provided by the whole data is only available when looking at the whole data.
Filed under: Books, Statistics, University life Tagged: Bayesian computing, Hamiltonian Monte Carlo, leapfrog generator, limited information inference, Monte Carlo Statistical Methods, subsampling, University of Warwick
Filed under: Kids, pictures, Travel Tagged: baroque, church, Church of Saint Nicolas, Czech Republic, Kostel svatého Mikuláše, Prague
“I don’t trust my own intuition when an apparent coincidence occurs; I have to sit down and do the calculations to check whether it’s the kind of thing I might expect to occur at some time and place.” D. Spiegelhalter
I just read in The Guardian an article on the case of the nurse Benjamin Geen, whose conviction to 30 years in jail in 2006 for the murder of two elderly patients rely on inappropriate statistical expertise. As for Sally Clark, the evidence was built around “unusual patterns” of deaths associated with a particular nurse, without taking into account the possible biases in building such patterns. The case against the 2006 expertise is based on reports by David Spiegelhalter, Norman Fenton, Stephen Senn and Sheila Bird, who constitute enough of a dream team towards reconsidering a revision of the conviction. As put forward by Prof Fenton, “at least one hospital in the country would be expected to see this many events over a four-year period, purely by chance.”
Filed under: Statistics, University life Tagged: Benjamin Geen, David Spiegelhalter, Norman Fenton, Sally Clark, Sheila Bird, statistical evidence, Stephen Senn
Filed under: Mountains, pictures, Running, Travel Tagged: Czech Republic, Giant Mountains, Snow Crash, Špindlerův Mlýn
The slides are directly extracted from the paper but it still took me quite a while to translate the paper into those, during the early hours of our Czech break this week.
One added perk of travelling to Nice is the flight there, as it parallels the entire French Alps, a terrific view in nice weather!
Filed under: Books, Statistics, Travel, University life Tagged: Alps, ANR, Bayesian testing, calibration, finite mixtures, France, improper priors, Nice, objective Bayes
Filed under: Kids, pictures, Running, Travel Tagged: art nouveau, Czech Republic, Municipal Building, Obecní dům, Prague, Smetana Hall
When playing with Peter Rossi’s bayesm R package during a visit of Jean-Michel Marin to Paris, last week, we came up with the above Gibbs outcome. The setting is a Gaussian mixture model with three components in dimension 5 and the prior distributions are standard conjugate. In this case, with 500 observations and 5000 Gibbs iterations, the Markov chain (for one component of one mean of the mixture) has two highly distinct regimes: one that revolves around the true value of the parameter, 2.5, and one that explores a much broader area (which is associated with a much smaller value of the component weight). What we found amazing is the Gibbs ability to entertain both regimes, simultaneously.
Filed under: Books, pictures, R, Statistics, University life Tagged: bayesm, convergence assessment, Gibbs sampler, Jean-Michel Marin, Markov chain Monte Carlo, mixtures, R
Filed under: Mountains, pictures, Travel Tagged: Bohemia, Czech Republic, Giant Mountains, Krkonošská magistrála, sunset, vacations, Špindlerův Mlýn
When preparing my OxWaSP projects a few weeks ago, I came perchance on a set of slides, entitled “Hierarchical models are not Bayesian“, written by Brian Dennis (University of Idaho), where the author argues against Bayesian inference in hierarchical models in ecology, much in relation with the previously discussed paper of Subhash Lele. The argument is the same, namely a possibly major impact of the prior modelling on the resulting inference, in particular when some parameters are hardly identifiable, the more when the model is complex and when there are many parameters. And that “data cloning” being available since 2007, frequentist methods have “caught up” with Bayesian computational abilities.
Let me remind the reader that “data cloning” means constructing a sequence of Bayes estimators corresponding to the data being duplicated (or cloned) once, twice, &tc., until the point estimator stabilises. Since this corresponds to using increasing powers of the likelihood, the posteriors concentrate more and more around the maximum likelihood estimator. And even recover the Hessian matrix. This technique is actually older than 2007 since I proposed it in the early 1990’s under the name of prior feedback, with earlier occurrences in the literature like D’Epifanio (1989) and even the discussion of Aitkin (1991). A more efficient version of this approach is the SAME algorithm we developed in 2002 with Arnaud Doucet and Simon Godsill where the power of the likelihood is increased during iterations in a simulated annealing version (with a preliminary version found in Duflo, 1996).
I completely agree with the author that a hierarchical model does not have to be Bayesian: when the random parameters in the model are analysed as sources of additional variations, as for instance in animal breeding or ecology, and integrated out, the resulting model can be analysed by any statistical method. Even though one may wonder at the motivations for selecting this particular randomness structure in the model. And at an increasing blurring between what is prior modelling and what is sampling modelling as the number of levels in the hierarchy goes up. This rather amusing set of slides somewhat misses a few points, in particular the ability of data cloning to overcome identifiability and multimodality issues. Indeed, as with all simulated annealing techniques, there is a practical difficulty in avoiding the fatal attraction of a local mode using MCMC techniques. There are thus high chances data cloning ends up in the “wrong” mode. Moreover, when the likelihood is multimodal, it is a general issue to decide which of the modes is most relevant for inference. In which sense is the MLE more objective than a Bayes estimate, then? Further, the impact of a prior on some aspects of the posterior distribution can be tested by re-running a Bayesian analysis with different priors, including empirical Bayes versions or, why not?!, data cloning, in order to understand where and why huge discrepancies occur. This is part of model building, in the end.
Filed under: Books, Kids, Statistics, University life Tagged: Bayes estimators, Bayesian foundations, data cloning, Idaho, maximum likelihood estimation, prior feedback, SAME algorithm, simulated annealing
Filed under: Mountains, Travel Tagged: Czech Republic, Giant Mountains, Krkonošská magistrála, ski resorts, sunset, vacations, Špindlerův Mlýn
Bayesian optimization for likelihood-free inference of simulator-based statistical models [guest post]
Here are some comments on the paper of Gutmann and Corander. My brief skim read through this concentrated on the second half of the paper, the applied methodology. So my comments should be quite complementary to Christian’s on the theoretical part!
ABC algorithms generally follow the template of proposing parameter values, simulating datasets and accepting/rejecting/weighting the results based on similarity to the observations. The output is a Monte Carlo sample from a target distribution, an approximation to the posterior. The most naive proposal distribution for the parameters is simply the prior, but this is inefficient if the prior is highly diffuse compared to the posterior. MCMC and SMC methods can be used to provide better proposal distributions. Nevertheless they often still seem quite inefficient, requiring repeated simulations in parts of parameter space which have already been well explored.
The strategy of this paper is to instead attempt to fit a non-parametric model to the target distribution (or in fact to a slight variation of it). Hopefully this will require many fewer simulations. This approach is quite similar to Richard Wilkinson’s recent paper. Richard fitted a Gaussian process to the ABC analogue of the log-likelihood. Gutmann and Corander introduce two main novelties:
- They model the expected discrepancy (i.e. distance) Δθ between the simulated and observed summary statistics. This is then transformed to estimate the likelihood. This is in contrast to Richard who transformed the discrepancy before modelling. This is the standard ABC approach of weighting the discrepancy depending on how close to 0 it is. The drawback of the latter approach is it requires picking a tuning parameter (the ABC acceptance threshold or bandwidth) in advance of the algorithm. The new approach still requires a tuning parameter but its choice can be delayed until the transformation is performed.
- They generate the θ values on-line using “Bayesian optimisation”. The idea is to pick θ to concentrate on the region near the minimum of the objective function, and also to reduce uncertainty in the Gaussian process. Thus well explored regions can usually be neglected. This is in contrast to Richard who chose θs using space filling design prior to performing any simulations.
I didn’t read the paper’s theory closely enough to decide whether (1) is a good idea. Certainly the results for the paper’s examples look convincing. Also, one issue with Richard‘s approach was that because the log-likelihood varied over such a wide variety of magnitudes, he needed to fit several “waves” of GPs. It would be nice to know if the approach of modelling the discrepancy has removed this problem, or if a single GP is still sometimes an insufficiently flexible model.
Novelty (2) is a very nice and natural approach to take here. I did wonder why the particular criterion in Equation (45) was used to decide on the next θ. Does this correspond to optimising some information theoretic quantity? Other practical questions were whether it’s possible to parallelise the method (I seem to remember talking to Michael Gutmann about this at NIPS but can’t remember his answer!), and how well the approach scales up with the dimension of the parameters.
Filed under: Books, Statistics, University life Tagged: ABC, arXiv, Dennis Prangle, dimension curse, Gaussian processes, guest post, NIPS, nonparametric probability density estimation
Filed under: Kids, pictures, Travel Tagged: Apostles, astronomical clock, Czech Republic, mechanical clock, old town, Prague
I read this book by Albert Camus over my week in Oxford, having found it on my daughter’s bookshelf (as she had presumably read it in high school…). It is a very special book in that (a) Camus was working on it when he died in a car accident, (b) the manuscript was found among the wreckage, and (c) it differs very much from Camus’ other books. Indeed, the book is partly autobiographical and written with an unsentimental realism that is raw and brutal. It describes the youth of Jacques, the son of French colons in Algiers, whose father had died in the first days of WW I and whose family lives in the uttermost poverty, with both his mother and grandmother doing menial jobs to simply survive. Thanks to a supportive teacher, he manages to get a grant to attend secondary school. What is most moving about the book is how Camus describes the numbing effects of poverty, namely how his relatives see their universe shrinking so much that notions like the Mother Country (France) or books loose meaning for them. Without moving them towards or against native Algerians, who never penetrate the inner circles in the novel, moving behind a sort of glass screen. It is not that the tensions and horrors of the colonisation and of the resistance to colonisation are hidden, quite the opposite, but the narrator considers those with a sort of fatalism without questioning the colonisation itself. (The book reminded me very much of my grand-father‘s childhood, with a father also among the dead soldiers of WW I, being raised by a single mother in harsh conditions. With the major difference that my grandfather decided to stop school very early to become a gardener…) There are also obvious parallels with Pagnol’s autobiographical novels like My Father’s Glory, written at about the same time, from the boy friendship to the major role of the instituteur, to the hunting party, to the funny uncle, but everything opposes the two authors, from Pagnol light truculence to Camus’ tragic depiction. Pagnol’s books are great teen books (and I still remember my mother buying the first one on a vacation road trip) but nothing more. Camus’ book could have been his greatest book, had he survived the car accident of January 1960.
Filed under: Books, Kids, pictures, Travel Tagged: Albert Camus, Algeria, Algerian colons, Algiers, book review, Carlos Sampayo, hunting, José Muñoz, Marcel Pagnol, The First Man, WW I