While I now try to abstain from participating to the Cross Validated forum, as it proves too much of a time-consuming activity with little added value (in the sense that answers are much too often treated as disposable napkins by users who cannot be bothered to open a textbook and who usually do not exhibit any long-term impact of the provided answer, while clogging the forum with so many questions that the individual entries seem to get so little traffic, when compared say with the stackoverflow forum, to the point of making the analogy with disposable wipes more appropriate!), I came across a truly interesting question the other night. Truly interesting for me in that I had never considered the issue before.
The question is essentially wondering at how to simulate from a distribution defined by its failure rate function, which is connected with the density f of the distribution by
From a purely probabilistic perspective, defining the distribution through f or through η is equivalent, as shown by the relation
but, from a simulation point of view, it may provide a different entry. Indeed, all that is needed is the ability to solve (in X) the equation
when U is a Uniform (0,1) variable. Which may help in that it does not require a derivation of f. Obviously, this also begs the question as to why would a distribution be defined by its failure rate function.
Filed under: Books, Kids, Statistics, University life Tagged: cross validated, failure rate, Monte Carlo Statistical Methods, probability theory, reliability, simulation, StackExchange, stackoverflow, survival analysis
[A rather stinky piece in The Guardian today, written by a consultant self-styled Higher Education expert… No further comments needed!]
“The reasons cited for this laggardly response [to innovations] will be familiar to any observer of the university system: an inherently conservative and risk-averse culture in most institutions; sclerotic systems and processes designed for a different world, and a lack of capacity, skills and willingness to change among an ageing academic community. All these are reinforced by perceptions that most proposed innovations are over-hyped and that current ways of operating have plenty of life left in them yet.”
Filed under: Books, Kids, pictures, University life Tagged: marketing, privatisation, reform, The Guardian, United Kingdom
Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltán Szabó, and Arthur Gretton arXived paper about Kamiltonian MCMC generated comments from Michael Betancourt, Dan Simpson and myself, which themselves induced the following reply by Heiko, detailed enough to deserve a post of its own.
Adaptation and ergodicity.
We certainly agree that the naive approach of using a non-parametric kernel density estimator on the chain history (as in [Christian’s book, Example 8.8]) as a *proposal* fails spectacularly on simple examples: the probability of proposing in unexplored regions is extremely small, independent of the current position of the MCMC trajectory. This is not what we do though. Instead, we use the gradient of a density estimator, and not the density itself, for our HMC proposal. Just like KAMH, KMC lite in fact falls back to Random Walk Metropolis in previously unexplored regions and therefore inherits geometric ergodicity properties. This in particular includes the ability to explore previously “unseen” regions, even if adaptation has stopped. I implemented a simple illustration and comparison here.
The main point of the ABC example, is that our method does not suffer from the additional bias from Gaussian synthetic likelihoods when being confronted with skewed models. But there is also a computational efficiency aspect. The scheme by Meeds et al. relies on finite differences and requires $2D$ simulations from the likelihood *every time* the gradient is evaluated (i.e. every leapfrog iteration) and H-ABC discards this valuable information subsequently. In contrast, KMC accumulates gradient information from simulations: it only requires to simulate from the likelihood *once* in the accept/reject step after the leapfrog integration (where gradients are available in closed form). The density is only updated then, and not during the leapfrog integration. Similar work on speeding up HMC via energy surrogates can be applied in the tall data scenario.
Monte Carlo gradients.
Approximating HMC when gradients aren’t available is in general a difficult problem. One approach (like surrogate models) may work well in some scenarios while a different approach (i.e. Monte Carlo) may work better in others, and the ABC example showcases such a case. We very much doubt that one size will fit all — but rather claim that it is of interest to find and document these scenarios.
Michael raised the concern that intractable gradients in the Pseudo-Marginal case can be avoided by running an MCMC chain on the joint space (e.g. $(f,\theta)$ for the GP classifier). To us, however, the situation is not that clear. In many cases, the correlations between variables can cause convergence problems (see e.g. here) for the MCMC and have to be addressed by de-correlation schemes (as here), or e.g. by incorporating geometric information, which also needs fixes as Michaels’s very own one. Which is the method of choice with a particular statistical problem at hand? Which method gives the smallest estimation error (if that is the goal?) for a given problem? Estimation error per time? A thorough comparison of these different classes of algorithms in terms of performance related to problem class would help here. Most papers (including ours) only show experiments favouring their own method.
GP estimator quality.
Finally, to address Michael’s point on the consistency of the GP estimator of the density gradient: this is discussed In the original paper on the infinite dimensional exponential family. As Michael points out, higher dimensional problems are unavoidably harder, however the specific details are rather involved. First, in terms of theory: both the well-specified case (when the natural parameter is in the RKHS, Section 4), and the ill-specified case (the natural parameter is in a “reasonable”, larger class of functions, Section 5), the estimate is consistent. Consistency is obtained in various metrics, including the L² error on gradients. The rates depend on how smooth the natural parameter is (and indeed a poor choice of hyper-parameter will mean slower convergence). The key point, in regards to Michael’s question, is that the smoothness requirement becomes more restrictive as the dimension increases: see Section 4.2, “range space assumption”.
Second, in terms of practice: we have found in experiments that the infinite dimensional exponential family does perform considerably better than a kernel density estimator when the dimension increases (Section 6). In other words, our density estimator can take advantage of smoothness properties of the “true” target density to get good convergence rates. As a practical strategy for hyper-parameter choice, we cross-validate, which works well empirically despite being distasteful to Bayesians. Experiments in the KMC paper also indicate that we can scale these estimators up to dimensions in the 100s on Laptop computers (unlike most other gradient estimation techniques in HMC, e.g. the ones in your HMC & sub-sampling note, or the finite differences in Meeds et al).
Filed under: Books, Statistics, University life Tagged: adaptive MCMC methods, Bayesian quadrature, Gatsby, Hamiltonian Monte Carlo, London, Markov chain, Monte Carlo Statistical Methods, non-parametric kernel estimation, reproducing kernel Hilbert space, RKHS, smoothness
“Unfortunately, the factorization does not make it immediately clear how to aggregate on the level of samples without first having to obtain an estimate of the densities themselves.” (p.2)
The recently arXived variational consensus Monte Carlo is a paper by Maxim Rabinovich, Elaine Angelino, and Michael Jordan that approaches the consensus Monte Carlo principle from a variational perspective. As in the embarrassingly parallel version, the target is split into a product of K terms, each being interpreted as an unnormalised density and being fed to a different parallel processor. The most natural partition is to break the data into K subsamples and to raise the prior to the power 1/K in each term. While this decomposition makes sense from a storage perspective, since each bit corresponds to a different subsample of the data, it raises the question of the statistical pertinence of splitting the prior and my feelings about it are now more lukewarm than when I commented on the embarrassingly parallel version, mainly for the reason that it is not reparameterisation invariant—getting different targets if one does the reparameterisation before or after the partition—and hence does not treat the prior as the reference measure it should be. I therefore prefer the version where the same original prior is attached to each part of the partitioned likelihood (and even more the random subsampling approaches discussed in the recent paper of Bardenet, Doucet, and Holmes). Another difficulty with the decomposition is that a product of densities is not a density in most cases (it may even be of infinite mass) and does not offer a natural path to the analysis of samples generated from each term in the product. Nor an explanation as to why those samples should be relevant to construct a sample for the original target.
“The performance of our algorithm depends critically on the choice of aggregation function family.” (p.5)
Since the variational Bayes approach is a common answer to complex products models, Rabinovich et al. explore the use of variational Bayes techniques to build the consensus distribution out of the separate samples. As in Scott et al., and Neiswanger et al., the simulation from the consensus distribution is a transform of simulations from each of the terms in the product, e.g., a weighted average. Which determines the consensus distribution as a member of an aggregation family defined loosely by a Dirac mass. When the transform is a sum of individual terms, variational Bayes solutions get much easier to find and the authors work under this restriction… In the empirical evaluation of this variational Bayes approach as opposed to the uniform and Gaussian averaging options in Scott et al., it improves upon those, except in a mixture example with a large enough common variance.
In fine, despite the relevance of variational Bayes to improve the consensus approximation, I still remain unconvinced about the use of the product of (pseudo-)densities and the subsequent mix of simulations from those components, for the reason mentioned above and also because the tail behaviour of those components is not related with the tail behaviour of the target. Still, this is a working solution to a real problem and as such is a reference for future works.
Filed under: Books, Statistics, University life Tagged: big data, consensus Monte Carlo, embarassingly parallel, large data problems, subsampling, tall data, variational Bayes methods
Following my earlier comments on Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers, from Amsterdam, Joris Mulder, a special issue editor of the Journal of Mathematical Psychology, kindly asked me for a written discussion of that paper, discussion that I wrote last week and arXived this weekend. Besides the above comments on ToP, this discussion contains some of my usual arguments against the use of the Bayes factor as well as a short introduction to our recent proposal via mixtures. Short introduction as I had to restrain myself from reproducing the arguments in the original paper, for fear it would jeopardize its chances of getting published and, who knows?, discussed.
Filed under: Books, Kids, pictures, Running, Statistics, Travel, University life Tagged: Amsterdam, Bayes factor, boat, Harold Jeffreys, Holland, Journal of Mathematical Psychology, psychometrics, sunrise, Theory of Probability, XXX
While cooking for a late Sunday lunch today [sweet-potatoes röstis], I was listening as usual to the French Public Radio (France Inter) and at some point heard the short [10mn] Périphéries that gives every weekend an insight on the suburbs [on the “other side’ of the Parisian Périphérique boulevard]. The idea proposed by a geographer from Montpellier, Emmanuel Vigneron, was to point out the health inequalities between the wealthy 5th arrondissement of Paris and the not-so-far-away suburbs, by following the RER B train line from Luxembourg to La Plaine-Stade de France…
The disparities between the heart of Paris and some suburbs are numerous and massive, actually the more one gets away from the lifeline represented by the RER A and RER B train lines, so far from me the idea of negating this opposition, but the presentation made during those 10 minutes of Périphéries was quite approximative in statistical terms. For instance, the mortality rate in La Plaine is 30% higher than the mortality rate in Luxembourg and this was translated into the chances for a given individual from La Plaine to die in the coming year are 30% higher than if he [or she] lives in Luxembourg. Then a few minutes later the chances for a given individual from Luxembourg to die are 30% lower than he [or she] lives in La Plaine…. Reading from the above map, it appears that the reference is the mortality rate for the Greater Paris. (Those are 2010 figures.) This opposition that Vigneron attributes to a different access to health facilities, like the number of medical general practitioners per inhabitant, does not account for the huge socio-demographic differences between both places, for instance the much younger and maybe larger population in suburbs like La Plaine. And for other confounding factors: see, e.g., the equally large difference between the neighbouring stations of Luxembourg and Saint-Michel. There is no socio-demographic difference and the accessibility of health services is about the same. Or the similar opposition between the southern suburban stops of Bagneux and [my local] Bourg-la-Reine, with the same access to health services… Or yet again the massive decrease in the Yvette valley near Orsay. The analysis is thus statistically poor and somewhat ideologically biased in that I am unsure the data discussed during this radio show tells us much more than the sad fact that suburbs with less favoured populations show a higher mortality rate.
Filed under: Statistics, Travel Tagged: Bagneux, boulevard périphérique, Bourg-la-Rein, France, France Inter, inequalities, Luxembourg, national public radio, Orsay, Paris, Paris suburbs, Périphéries, RER B, Saint-Michel, Stade de France, Yvette
Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltán Szabó, and Arthur Gretton arXived a paper last week about Kamiltonian MCMC, the K being related with RKHS. (RKHS as in another KAMH paper for adaptive Metropolis-Hastings by essentially the same authors, plus Maria Lomeli and Christophe Andrieu. And another paper by some of the authors on density estimation via infinite exponential family models.) The goal here is to bypass the computation of the derivatives in the moves of the Hamiltonian MCMC algorithm by using a kernel surrogate. While the genuine RKHS approach operates within an infinite exponential family model, two versions are proposed, KMC lite with an increasing sequence of RKHS subspaces, and KMC finite, with a finite dimensional space. In practice, this means using a leapfrog integrator with a different potential function, hence with a different dynamics.
The estimation of the infinite exponential family model is somewhat of an issue, as it is estimated from the past history of the Markov chain, simplified into a random subsample from this history [presumably without replacement, meaning the Markovian structure is lost on the subsample]. This is puzzling because there is dependence on the whole past, which cancels ergodicity guarantees… For instance, we gave an illustration in Introducing Monte Carlo Methods with R [Chapter 8] of the poor impact of approximating the target by non-parametric kernel estimates. I would thus lean towards the requirement of a secondary Markov chain to build this kernel estimate. The authors are obviously aware of this difficulty and advocate an attenuation scheme. There is also the issue of the cost of a kernel estimate, in O(n³) for a subsample of size n. If, instead, a fixed dimension m for the RKHS is selected, the cost is in O(tm²+m³), with the advantage of a feasible on-line update, making it an O(m³) cost in fine. But again the worry of using the whole past of the Markov chain to set its future path…
Among the experiments, a KMC for ABC that follows the recent proposal of Hamiltonian ABC by Meeds et al. The arguments are interesting albeit sketchy: KMC-ABC does not require simulations at each leapfrog step, is it because the kernel approximation does not get updated at each step? Puzzling.
I also discussed the paper with Michael Betancourt (Warwick) and here his comments:
“I’m hesitant for the same reason I’ve been hesitant about algorithms like Bayesian quadrature and GP emulators in general. Outside of a few dimensions I’m not convinced that GP priors have enough regularization to really specify the interpolation between the available samples, so any algorithm that uses a single interpolation will be fundamentally limited (as I believe is born out in non-trivial scaling examples) and trying to marginalize over interpolations will be too awkward.
They’re really using kernel methods to model the target density which then gives the gradient analytically. RKHS/kernel methods/ Gaussian processes are all the same math — they’re putting prior measures over functions. My hesitancy is that these measures are at once more diffuse than people think (there are lots of functions satisfying a given smoothness criterion) and more rigid than people think (perturb any of the smoothness hyper-parameters and you get an entirely new space of functions).
When using these methods as an emulator you have to set the values of the hyper-parameters which locks in a very singular definition of smoothness and neglects all others. But even within this singular definition there are a huge number of possible functions. So when you only have a few points to constrain the emulation surface, how accurate can you expect the emulator to be between the points?
In most cases where the gradient is unavailable it’s either because (a) people are using decades-old Fortran black boxes that no one understands, in which case there are bigger problems than trying to improve statistical methods or (b) there’s a marginalization, in which case the gradients are given by integrals which can be approximated with more MCMC. Lots of options.”
Filed under: Books, Statistics, University life Tagged: adaptive MCMC methods, Bayesian quadrature, Gatsby, Hamiltonian Monte Carlo, Introducing Monte Carlo Methods with R, London, Markov chain, non-parametric kernel estimation, reproducing kernel Hilbert space, RKHS, smoothness
Filed under: pictures, Travel, University life Tagged: bike, tangerine, University of Warwick, winter
When visiting a bookstore in Florence last month, during our short trip to Tuscany, I came upon this book with enough of a funny cover and enough of a funny title (possibly capitalising on the similarity with “the girl who played with fire”] to make me buy it. I am glad I gave in to this impulse as the book is simply hilarious! The style and narrative relate rather strongly to the series of similarly [mostly] hilarious picaresque tales written by Paasilina and not only because both authors are from Scandinavia. There is the same absurd feeling that the book characters should not have this sort of things happening to them and still the morbid fascination to watch catastrophe after catastrophe being piled upon them. While the story is deeply embedded within the recent history of South Africa and [not so much] of Sweden for the past 30 years, including major political figures, there is no true attempt at making the story in the least realistic, which is another characteristic of the best stories of Paasilina. Here, a young girl escapes the poverty of the slums of Soweto, to eventually make her way to Sweden along with a spare nuclear bomb and a fistful of diamonds. Which alas are not eternal… Her intelligence helps her to overcome most difficulties, but even her needs from time to time to face absurd situations as another victim. All is well that ends well for most characters in the story, some of whom one would prefer to vanish in a gruesome accident. Which seemed to happen until another thread in the story saved the idiot. The satire of South Africa and of Sweden is most enjoyable if somewhat easy! Now I have to read the previous volume in the series, The Hundred-Year-Old Man Who Climbed Out of the Window and Disappeared!
Filed under: Books, Kids, Travel Tagged: Arto Paasilinna, book review, Finland, Firenze, Italy, Scandinavia, South Africa, Soweto, Sweden, the girl who saved the king of Sweden, The Hundred-Year-Old Man Who Climbed Out of the Window and Disappeared, Tuscany
Here are the download figures for my e-book with George as sent to me last week by my publisher Springer-Verlag. With an interesting surge in the past year. Maybe simply due to new selling strategies of the published rather to a wider interest in the book. (My royalties have certainly not increased!) Anyway thanks to all readers. As an aside for wordpress wannabe bloggers, I realised it is now almost impossible to write tables with WordPress, another illustration of the move towards small-device-supported blogs. Along with a new annoying “simpler” (or more accurately dumber) interface and a default font far too small for my eyesight. So I advise alternatives to wordpress that are more sympathetic to maths contents (e.g., using MathJax) and comfortable editing.
And the same for the e-book with Jean-Michel, which only appeared in late 2013. And contains more chapters than Introduction to Monte Carlo methods with R. Incidentally, a reader recently pointed out to me the availability of a pirated version of The Bayesian Choice on a Saudi (religious) university website. And of a pirated version of Introducing Monte Carlo with R on a Saõ Paulo (Brazil) university website. This may be alas inevitable, given the diffusion by publishers of e-chapters that can be copied with no limitations…
Filed under: Books, R, Statistics, University life Tagged: Bayesian Essentials with R, book sales, Brazil, copyright, Introduction to Monte Carlo Methods with R, Saudi Arabia, Springer-Verlag
As I was putting together a proposal for a special ABC session at ISBA 2016 in Santa Margherita di Pula, Sardinia, I received worried replies from would-be participants about the affordability of the meeting given their research funds! Since I had a similar worry with supporting myself and several of my PhD students, I looked around for low-cost alternatives and [already] booked a nearby villa in Santa Margherita di Pula for about 100€ per person for the whole week. Including bikes. Plus, several low-cost airlines like Easy Jet and Ryanair fly to Cagliari from European cities like Berlin, Paris, London, Geneva, and most Italian cities, for less than 100€ round-trip [with enough advanced planning] and if one really is on half a shoestring, there are regular buses connecting Cagliari to Santa Margherita di Pula for a few euros. This means in the end that supporting a PhD student or a postdoc within 5 years of a Ph.D. to attend ISBA 2016 from Europe can be budgeted at as low as a tight 500€ under limited funding resources, including the registration fees of 290€… So definitely affordable with long-term planning!
Filed under: Kids, pictures, Travel, University life, Wines Tagged: ABC, Bayesian conference, budget, Cagliari, pink flamingos, registration fees, research grant, Santa Margherita di Pula, Sardinia
Following yesterday’s surprise at the unpleasant conference business run by WASET, I was once again confronted today with conference fees that sound like an unacceptable siphoning of research funds and public money. One of my PhD students got earlier personally invited to present a talk at EUSIPCO 2015, a European signal processing conference taking place in Nice next September and she accepted the invitation. Now, contrary to yesterday’s example, this EUSIPCO 2015 is a genuine conference sponsored by several European signal processing societies. From what I understand, speakers and poster presenters must submit papers that are reviewed and then published in the conference proceedings, part of the IEEE Xplore on-line digital library (impact factor of 0.04). As the conference is drawing near, my student is asked to register and is “reminded” of small prints in the conference rules, namely that “at least one author per paper must register by June 19, 2015 at the full rate”, student or not student, which means a 300€ difference in the fees and has absolutely no justification whatsoever since the papers are only processed electronically…
I checked across a few of the past editions of EUSIPCO and the same rip-off rule applies to those as well. I see no rational explanation for this rule that sounds like highway robbery and leads to the de facto exclusion of students from conferences… In fine, my student withdrew her paper and participation at EUSIPCO.
Filed under: Kids, University life Tagged: conference fees, EUSIPCO 2015, IEEE, IEEE Xplore, Nice, signal processing, student fees
“The Statistics and Computing journal gratefully acknowledges the contributions for this special issue, celebrating 25 years of publication. In the past 25 years, the journal has published innovative, distinguished research by leading scholars and professionals. Papers have been read by thousands of researchers world-wide, demonstrating the global importance of this field. The Statistics and Computing journal looks forward to many more years of exciting research as the field continues to expand.” Mark Girolami, Editor in Chief for The Statistics and Computing journal
Our joint [Peter Green, Krzysztof Łatuszyński, Marcelo Pereyra, and myself] review [open access!] on the important features of Bayesian computation has already appeared in the special 25th anniversary issue of Statistics & Computing! Along with the following papers
- Statistics and computing: the genesis of data science, David J. Hand, Founding Editor
- EM for mixtures: Initialization requires special care, Jean-Patrick Baudry, Gilles Celeux
- Sequential Monte Carlo methods for Bayesian elliptic inverse problems, Alexandros Beskos, Ajay Jasra, Ege A. Muzaffer, Andrew M. Stuart
- Bayesian inference via projections, Ricardo Silva, Alfredo Kalaitzis
- Computing functions of random variables via reproducing kernel Hilbert space representations, Bernhard Schölkopf, Krikamol Muandet, Kenji Fukumizu, Stefan Harmeling, Jonas Peters
- The Poisson transform for unnormalised statistical models, Simon Barthelmé, Nicolas Chopin
- Scalable estimation strategies based on stochastic approximations: classical results and new insights, Panos Toulis, Edoardo M. Airoldi
- de Finetti Priors using Markov chain Monte Carlo computations, Sergio Bacallado, Persi Diaconis, Susan Holmes
- Simulation-efficient shortest probability intervals, Ying Liu, Andrew Gelman, Tian Zheng
- Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters, Christian Hennig, Chien-Ju Lin
which means very good company, indeed! And happy B’day to Statistics & Computing!
Filed under: Books, Statistics, University life Tagged: 25th anniversary, Bayesian computation, computational statistics, David Hand, Gilles Celeux, Mark Girolami, Monte Carlo Statistical Methods, open access, Statistics & Computing
- 1 – 6 February, 2016 Learning
- 8 – 12 February, 2016 Mathématical statistics
- 15 – 19 February, 2016 Processes
- 22 – 26 February, 2016 Extremes, Copulas and Actuarial Science
- 29 February – 4 March, 2016 Bayesian statistics and algorithms
Each week will see minicourses of a few hours (2-3) and advanced talks, leaving time for interactions and collaborations. (I will give one of those minicourses on Bayesian foundations.) The scientific organisers of the B’ week are Gilles Celeux and Nicolas Chopin.
The CIRM is a wonderful meeting place, in the mountains between Marseilles and Cassis, with many trails to walk and run, and hundreds of fantastic climbing routes in the Calanques at all levels. (In February, the sea is too cold to contemplate swimming. The good side is that it is not too warm to climb and the risk of bush fire is very low!) We stayed there with Jean-Michel Marin a few years ago when preparing Bayesian Essentials. The maths and stats library is well-provided, with permanent access for quiet working sessions. This is the French version of the equally fantastic German Mathematik Forschungsinstitut Oberwolfach. There will be financial support available from the supporting societies and research bodies, at least for young participants and the costs if any are low, for excellent food and excellent lodging. Definitely not a scam conference!
Filed under: Books, Kids, Mountains, pictures, Running, Statistics, Travel, University life, Wines Tagged: Bayesian Essentials with R, Bayesian statistics, bouillabaisse, calanques, Cassis, CIRM, CNRS, copulas, extremes, France, machine learning, Marseille, minicourse, SMF, stochastic processes
Earlier today, I received an invitation to give a plenary talk at a Probability and Statistics Conference in Marrakech, a nice location if any! As it came from a former graduate student from the University of Rouen (where I taught before Paris-Dauphine), and despite an already heavy travelling schedule for 2016!, I considered his offer. And looked for the conference webpage to find the dates as my correspondent had forgotten to include those. Instead of the genuine conference webpage, which had not yet been created, what I found was a fairly unpleasant scheme playing on the same conference name and location, but run by a predator conglomerate called WASET. WASET stands for World Academy of Science, Engineering, and Technology. Their website lists thousands of conferences, all in nice, touristy, places, and all with an identical webpage. For instance, there is the ICMS 2015: 17th International Conference on Mathematics and Statistics next week. With a huge “conference committee” but no a single name I can identify. And no-one from France. Actually, the website kindly offers entry by city as well as topics, which helps in spotting that a large number of ICMS conferences all take place on the same dates and at the same hotel in Paris… The trick is indeed to attract speakers with the promise of publication in a special issue of a bogus journal and to have them pay 600€ for registration and publication fees, only to have all topics mixed together in a few conference rooms, according to many testimonies I later found on the web. And as clear from the posted conference program! In the “best” of cases since other testimonies mention lost fees and rejected registrations. Testimonies also mention this tendency to reproduce the acronym of a local conference. While it is not unheard of conferences amounting to academic tourism, even from the most established scientific societies!, I am quite amazed at the scale of this enterprise, even though I cannot completely understand how people can fall for it. Looking at the website, the fees, the unrelated scientific committee, and the lack of scientific program should be enough to put those victims off. Unless they truly want to partake to academic tourism, obviously.
Filed under: Kids, Mountains, pictures, Travel, University life Tagged: academic tourism, conferences, France, ICMS 2015, Marrakech, MICPS 2016, Morocco, Paris, registration fees, Rouen, scam conference, WASET, Western Union
In the few past days, there has been so many arXiv postings of interest—presumably the NIPS submission effect!—that I cannot hope to cover them in the coming weeks! Hopefully, some will still come out on the ‘Og in a near future:
- arXiv:1506.06629: Scalable Approximations of Marginal Posteriors in Variable Selection by Willem van den Boom, Galen Reeves, David B. Dunson
- arXiv:1506.06285: The MCMC split sampler: A block Gibbs sampling scheme for latent Gaussian models by Óli Páll Geirsson, Birgir Hrafnkelsson, Daniel Simpson, Helgi Sigurðarson [also deserves a special mention for gathering only ***son authors!]
- arXiv:1506.06268: Bayesian Nonparametric Modeling of Higher Order Markov Chains by Abhra Sarkar, David B. Dunson
- arXiv:1506.06117: Convergence of Sequential Quasi-Monte Carlo Smoothing Algorithms by Mathieu Gerber, Nicolas Chopin
- arXiv:1506.06101: Robust Bayesian inference via coarsening by Jeffrey W. Miller, David B. Dunson
- arXiv:1506.05934: Expectation Particle Belief Propagation by Thibaut Lienart, Yee Whye Teh, Arnaud Doucet
- arXiv:1506.05860: Variational Gaussian Copula Inference by Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin
- arXiv:1506.05855: The Frequentist Information Criterion (FIC): The unification of information-based and frequentist inference by Colin H. LaMont, Paul A. Wiggins
- arXiv:1506.05757: Bayesian Inference for the Multivariate Extended-Skew Normal Distribution by Mathieu Gerber, Florian Pelgrin
- arXiv:1506.05741: Accelerated dimension-independent adaptive Metropolis by Yuxin Chen, David Keyes, Kody J.H. Law, Hatem Ltaief
- arXiv:1506.05269: Bayesian Survival Model based on Moment Characterization by Julyan Arbel, Antonio Lijoi, Bernardo Nipoti
- arXiv:1506.04778: Fast sampling with Gaussian scale-mixture priors in high-dimensional regression by Anirban Bhattacharya, Antik Chakraborty, Bani K. Mallick
- arXiv:1506.04416: Bayesian Dark Knowledge by Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling [a special mention for this title!]
- arXiv:1506.03693: Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference by Edward Meeds, Max Welling
- arXiv:1506.03074: Variational consensus Monte Carlo by Maxim Rabinovich, Elaine Angelino, Michael I. Jordan
- arXiv:1506.02564: Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families by Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton [comments coming soon!]
Filed under: R, Statistics, University life Tagged: arXiv, Bayesian statistics, MCMC, Monte Carlo Statistical Methods, Montréal, NIPS 2015, particle filter
“The results in this paper suggest that ABC can scale to large data, at least for models with a xed number of parameters, under the assumption that the summary statistics obey a central limit theorem.”
In a week rich with arXiv submissions about MCMC and “big data”, like the Variational consensus Monte Carlo of Rabinovich et al., or scalable Bayesian inference via particle mirror descent by Dai et al., Wentao Li and Paul Fearnhead contributed an impressive paper entitled Behaviour of ABC for big data. However, a word of warning: the title is somewhat misleading in that the paper does not address the issue of big or tall data per se, e.g., the impossibility to handle the whole data at once and to reproduce it by simulation, but rather the asymptotics of ABC. The setting is not dissimilar to the earlier Fearnhead and Prangle (2012) Read Paper. The central theme of this theoretical paper [with 24 pages of proofs!] is to study the connection between the number N of Monte Carlo simulations and the tolerance value ε when the number of observations n goes to infinity. A main result in the paper is that the ABC posterior mean can have the same asymptotic distribution as the MLE when ε=o(n-1/4). This is however in opposition with of no direct use in practice as the second main result that the Monte Carlo variance is well-controlled only when ε=O(n-1/2). There is therefore a sort of contradiction in the conclusion, between the positive equivalence with the MLE and
Something I have (slight) trouble with is the construction of an importance sampling function of the fABC(s|θ)α when, obviously, this function cannot be used for simulation purposes. The authors point out this fact, but still build an argument about the optimal choice of α, namely away from 0 and 1, like ½. Actually, any value different from 0,1, is sensible, meaning that the range of acceptable importance functions is wide. Most interestingly (!), the paper constructs an iterative importance sampling ABC in a spirit similar to Beaumont et al. (2009) ABC-PMC. Even more interestingly, the ½ factor amounts to updating the scale of the proposal as twice the scale of the target, just as in PMC.
Another aspect of the analysis I do not catch is the reason for keeping the Monte Carlo sample size to a fixed value N, while setting a sequence of acceptance probabilities (or of tolerances) along iterations. This is a very surprising result in that the Monte Carlo error does remain under control and does not dominate the overall error!
“Whilst our theoretical results suggest that point estimates based on the ABC posterior have good properties, they do not suggest that the ABC posterior is a good approximation to the true posterior, nor that the ABC posterior will accurately quantify the uncertainty in estimates.”
Overall, this is clearly a paper worth reading for understanding the convergence issues related with ABC. With more theoretical support than the earlier Fearnhead and Prangle (2012). However, it does not provide guidance into the construction of a sequence of Monte Carlo samples nor does it discuss the selection of the summary statistic, which has obviously a major impact on the efficiency of the estimation. And to relate to the earlier warning, it does not cope with “big data” in that it reproduces the original simulation of the n sized sample.
Filed under: Books, Statistics, University life Tagged: ABC, ABC-PMC, asymptotics, big data, iterated importance sampling, MCMC, particle system, simulation
Rémi Bardenet, Arnaud Doucet, and Chris Holmes arXived a long paper (with the above title) a month ago, paper that I did not have time to read in detail till today. The paper is quite comprehensive in its analysis of the current literature on MCMC for huge, tall, or big data. Even including our delayed acceptance paper! Now, it is indeed the case that we are all still struggling with this size difficulty. Making proposals in a wide range of directions, hopefully improving the efficiency of dealing with tall data. However, we are not there yet in that the outcome is either about as costly as the original MCMC implementation or its degree of approximation is unknown, even when bounds are available.
Most of the paper proposal is based on aiming at an unbiased estimator of the likelihood function in a pseudo-marginal manner à la Andrieu and Roberts (2009) and on a random subsampling scheme that presumes (a) iid-ness and (b) a lower bound on each term in the likelihood. It seems to me slightly unrealistic to assume that a much cheaper and tight lower bound on those terms could be available. Firmly set in the iid framework, the problem itself is unclear: do we need 10⁸ observations of a logistic model with a few parameters? The real challenge is rather in non-iid hierarchical models with random effects and complex dependence structures. For which subsampling gets much more delicate. None of the methods surveyed in the paper broaches upon such situations where the entire data cannot be explored at once.
An interesting experiment therein, based on the Glynn and Rhee (2014) unbiased representation, shows that the approach does not work well. This could lead the community to reconsider the focus on unbiasedness by coming full circle to the opposition between bias and variance. And between intractable likelihood and representative subsample likelihood.
Reading the (superb) coverage of earlier proposals made me trace back on the perceived appeal of the decomposition of Neiswanger et al. (2014) as I came to realise that the product of functions renormalised into densities has no immediate probabilistic connection with its components. As an extreme example, terms may fail to integrate. (Of course, there are many Monte Carlo features that exploit such a decomposition, from the pseudo-marginal to accept-reject algorithms. And more to come.) Taking samples from terms in the product is thus not directly related to taking samples from each term, in opposition with the arithmetic mixture representation. I was first convinced by using a fraction of the prior in each term but now find it unappealing because there is no reason the prior should change for a smaller sampler and no equivalent to the prohibition of using the data several times. At this stage, I would be much more in favour of raising a random portion of the likelihood function to the right power. An approach that I suggested to a graduate student earlier this year and which is also discussed in the paper. And considered too naïve and a “very poor approach” (Section 6, p.18), even though there must be versions that do not run afoul of the non-Gaussian nature of the log likelihood ratio. I am certainly going to peruse more thoroughly this Section 6 of the paper.
Another interesting suggestion in this definitely rich paper is the foray into an alternative bypassing the uniform sampling in the Metropolis-Hastings step, using instead the subsampled likelihood ratio. The authors call this “exchanging acceptance noise for subsampling noise” (p.22). However, there is no indication about the resulting stationary and I find the notion of only moving to higher likelihoods (or estimates of) counter to the spirit of Metropolis-Hastings algorithms. (I have also eventually realised the meaning of the log-normal “difficult” benchmark that I missed in the earlier : it means log-normal data is modelled by a normal density.) And yet another innovation along the lines of a control variate for the log likelihood ratio, no matter it sounds somewhat surrealistic.
Filed under: Books, Statistics, University life Tagged: big data, divide-and-conquer strategy, Metropolis-Hastings algorithm, parallel MCMC, subsampling, tall data