Bayesian News Feeds

ABC model choice by random forests [guest post]

Xian's Og - Sun, 2014-08-10 17:14

[Dennis Prangle sent me his comments on our ABC model choice by random forests paper. Here they are! And I appreciate very much contributors commenting on my paper or others, so please feel free to join.]

This paper proposes a new approach to likelihood-free model choice based on random forest classifiers. These are fit to simulated model/data pairs and then run on the observed data to produce a predicted model. A novel “posterior predictive error rate” is proposed to quantify the degree of uncertainty placed on this prediction. Another interesting use of this is to tune the threshold of the standard ABC rejection approach, which is outperformed by random forests.

The paper has lots of thought-provoking new ideas and was an enjoyable read, as well as giving me the encouragement I needed to read another chapter of the indispensable Elements of Statistical Learning However I’m not fully convinced by the approach yet for a few reasons which are below along with other comments.

Alternative schemes

The paper shows that random forests outperform rejection based ABC. I’d like to see a comparison to more efficient ABC model choice algorithms such as that of Toni et al 2009. Also I’d like to see if the output of random forests could be used as summary statistics within ABC rather than as a separate inference method.

Posterior predictive error rate (PPER)

This is proposed to quantify the performance of a classifier given a particular data set. The PPER is the proportion of times the classifier’s most favoured model is incorrect for simulated model/data pairs drawn from an approximation to the posterior predictive. The approximation is produced by a standard ABC analysis.

Misclassification could be due to (a) a poor classifier or (b) uninformative data, so the PPER aggregrates these two sources of uncertainty. I think it is still very desirable to have an estimate of the uncertainty due to (b) only i.e. a posterior weight estimate. However the PPER is useful. Firstly end users may sometimes only care about the aggregated uncertainty. Secondly relative PPER values for a fixed dataset are a useful measure of uncertainty due to (a), for example in tuning the ABC threshold. Finally, one drawback of the PPER is the dependence on an ABC estimate of the posterior: how robust are the results to the details of how this is obtained?

Classification

This paper illustrates an important link between ABC and machine learning classification methods: model choice can be viewed as a classification problem. There are some other links: some classifiers make good model choice summary statistics (Prangle et al 2014) or good estimates of ABC-MCMC acceptance ratios for parameter inference problems (Pham et al 2014). So the good performance random forests makes them seem a generally useful tool for ABC (indeed they are used in the Pham et al al paper).


Filed under: pictures, R, Statistics, University life Tagged: ABC, ABC model choice, arXiv, classification, Dennis Prangle, Elements of Statistical Learning, machine learning, model posterior probabilities, posterior predictive, PPER, random forests
Categories: Bayesian Bloggers

Dracula [book review]

Xian's Og - Sat, 2014-08-09 17:14

As I was waiting for my plane to Bangalore a week ago, I spotted a cheap English edition of Bram Stoker’s Dracula in De Gaulle airport. I had not re-read the book since my teenage years (quite a while ago, even by wampyr’s standards!), so I bought it for the trip ahead. I remembered very little of the style of the [French translation of the] book if the story itself was still rather fresh on my mind (as were the uneasy nights after reading the novel!).

“I can hazard no opinion. I do not know what to think and I have no data on which to found a conjecture.”

Dracula is definitely a Victorian gothic novel in the same spirit as Radcliffe’s Mysteries of Udolpho I read last year, if of a late and lighter style… Characters do not feel very realistic (!), maybe because the novel is written in the epistolary style, which makes those characters only express noble or proper sentiments and praise virtues in their companions. (The book could obviously be re-read with this filter, attempting at guessing the true feelings of those poor characters forced into a mental straitjacket by the Victorian moral codes.) However, even without this deconstructive approach, the book is quite fascinating as a representation of the codes of the time. More than for a rather unconvincing plot which leaves the main protagonist mostly in the dark [of a coffin, obviously!]. The small band of wampyr-hunters pursuing Dracula seems bound to commit every mistake in the book and miss clues about his local victims and opportunities to end up Dracula’s taste of England earlier… And the progress of Dracula in his invasion is too slow to be frightening.  Anyway, what I found highly interesting in Dracula is the position and treatment of women in this novel, from innocent vaporous victims to wanton seductresses once un-dead, from saintly and devoted wives to unusually bright women “more clever than men” but still prone to hysteria… Once again, many filters of (modern) societal and sociological constraints could be lifted from this presentation. I also noticed that no legal authority ever appears in the novel: the few policemen therein lift rescued children from cemeteries or nod at the heroes breaking into Dracula’s house in London. This absence may point out issues with Victorian society that may prove impossible to solve with out radical changes. (Or I may be reading too much!)


Filed under: Books, Kids, pictures Tagged: boko review, Bram Stoker, Dracula, gothic novels, Transylvania, Victorian society
Categories: Bayesian Bloggers

JSM 2014, Boston [#4]

Xian's Og - Fri, 2014-08-08 17:14

Last and final day and post at and about JSM 2014! It is very rare that I stay till the last day and it is solely due to family constraints that I attended the very last sessions. It was a bit eerie, walking through the huge structure of the Boston Convention Centre that could easily house several A380 and meeting a few souls dragging a suitcase to the mostly empty rooms… Getting scheduled on the final day of the conference is not the nicest thing and I offer my condolences to all speakers ending up speaking today! Including my former Master student Anne Sabourin.

I first attended the Frontiers of Computer Experiments: Big Data, Calibration, and Validation session with a talk by David Hingdon on the extrapolation limits of computer model, talk that linked very nicely with Stephen Stigler’s Presidential Address and stressed the need for incorporating the often neglected fact that models are not reality. Jared Niemi also presented an approximative way of dealing with large dataset Gaussian process modelling. It was only natural to link this talk with David’s and wonder about the extrapola-bility of the modelling and the risk of over-fitting and the potential for detecting sudden drops in the function.

The major reason why I made the one-hour trip back to the Boston Convention Centre was however theonder about the extrapola-bility of the modelling and the risk of over-fitting and the potential for detecting sudden drops in the function.

The major reason why I made the one-hour trip back to the Boston Convention Centre was however the Human Rights Violations: How Do We Begin Counting the Dead? session. It was both of direct interest to me as I had wondered in the past days about statistically assessing the number of political kidnappings and murders in Eastern Ukraine. And of methodological relevance, as the techniques were connected with capture-recapture and random forests. And of close connections with two speakers who alas could not make it and were replaced by co-authors. The first talk by Samuel Ventura considered ways of accelerating the comparison of entries into multiple lists for identifying unique individuals, with the open methodological question of handling populations of probabilities. As the outcome of random forests. My virtual question related to this talk was why the causes for duplications and errors in the record were completely ignored. At least in the example of the Syrian death, some analysis could be conducted on the reasons for differences in the entries. And maybe a prior model constructed. The second talk by Daniel Manrique-Vallier was about using non-parametric capture-recapture to count the number of dead from several lists. Once again bypassing the use of potential covariates for explaining the differences.  As I noticed a while ago when analysing the population of (police) captured drug addicts in the Greater Paris, the prior modelling has a strong impact on the estimated population. Another point I would have liked to discuss was the repeated argument that Arabic (script?) made the identification of individuals more difficult: my naïve reaction was to wonder whether or not this was due to the absence of fluent Arabic speakers in the team. Who could have further helped to build a model on the potential alternative spellings and derivations of Arabic names. But I maybe missed more subtle difficulties.


Filed under: Books, Statistics, Travel, University life Tagged: Boston, capture-recapture, Gaussian processes, JSM 2014, Massachusset, record linkage, Syria, Ukraine
Categories: Bayesian Bloggers

Mystic sunset

Xian's Og - Fri, 2014-08-08 07:18
Categories: Bayesian Bloggers

JSM 2014, Boston [#3]

Xian's Og - Thu, 2014-08-07 17:14

Today I gave a talk in the Advances in model selection session. Organised by Veronika Rockova and Ed George. (A bit of pre-talk stress: I actually attempted to change my slides at 5am and only managed to erase the current version! I thus left early enough to stop by the presentation room…) Here are the final slides, which have much in common with earlier versions, but also borrowed from Jean-Michel Marin’s talk in Cambridge. A posteriori, I think the talk missed one slide on the practical run of the ABC random forest algorithm, since later questions showed miscomprehension from the audience.

The other talks in this session were by Andreas Buja [whom I last met in Budapest last year] on valid post-modelling inference. A very relevant reflection on the fundamental bias in statistical modelling. Then by Nick Polson, about efficient ways to compute MAP for objective functions that are irregular.  Great entry into optimisation methods I had never heard of earlier.! (The abstract is unrelated.) And last but not least by Veronika Rockova, on mixing Indian buffet processes with spike-and-slab priors for factor analysis with unknown numbers of factors. A definitely advanced contribution to factor analysis, with a very nice idea of introducing a non-identifiable rotation to align on orthogonal designs. (Here too the abstract is unrelated, a side effect of the ASA requiring abstracts sent very long in advance.)

Although discussions lasted well into the following Bayesian Inference: Theory and Foundations session, I managed to listen to a few talks there. In particular, a talk by Keli Liu on constructing non-informative priors. A question of direct relevance. The notion of objectivity is to achieve a frequentist distribution of the Bayes factor associated with the point null that is constant. Or has a constant quantile at a given level. The second talk by Alexandra Bolotskikh related to older interests of mine’s, namely the construction of improved confidence regions in the spirit of Stein. (Not that surprising, given that a coauthor is Marty Wells, who worked with George and I on the topic.) A third talk by Abhishek Pal Majumder (jointly with Jan Hanning) dealt on a new type of fiducial distributions, with matching prior properties. This sentence popped a lot over the past days, but this is yet another area where I remain puzzled by the very notion. I mean the notion of fiducial distribution. Esp. in this case where the matching prior gets even closer to being plain Bayesian.


Filed under: Statistics, University life Tagged: ABC, Bayes factor, Bayesian model choice, fiducial distribution, JSM 2014, objective Bayes, random forest, slides
Categories: Bayesian Bloggers

Boston sunrise

Xian's Og - Thu, 2014-08-07 06:03
Categories: Bayesian Bloggers

JSM 2014, Boston [#2]

Xian's Og - Wed, 2014-08-06 17:14

Day #2 at JSM started quite early as I had to be on site by 7am for the CHANCE editors breakfast. No running then, except to Porter metro station. Interesting exchange full of new ideas to keep the journal cruising. In particular, a call for proposals on special issues on sexy topics (reproducible research anyone? I already have some book reviews.). And directions to increase the international scope and readership. And possibly adding or reporting on a data challenge. After this great start, I attended the Bayesian Time Series and Dynamic Models session, where David Scott Matteson from Cornell University presented an extension of the Toronto ambulance data analysis Dawn Woodard had exposed in Banff at an earlier workshop. The extension dealt with the spatio-temporal nature of the data,  using a mixture model with time-dependent weights that revolved cyclically in an autoexponential manner. And rekindling the interest in the birth-and-death alternative to reversible jump. Plus another talk by Scott Holan mixing Bayesian analysis with frequency data, an issue that always puzzled me. The second session I attended was Multiscale Modeling for Complex Massive Data, with a modelling of brain connections through a non-parametric mixture by David Dunson. And a machine learning talk by Mauro Maggioni on a projection cum optimisation technique to fight the curse of dimension. Who proposed a solution to an optimal transport problem that is much more convincing than the one I discussed a while ago. Unfortunately, this made me miss the Biometrics showcase session, where Debashis Mondal presented a joint work with Julian Besag on Exact Goodness-of-Fit Tests for Markov Chains. And where both my friends Michael Newton and Peter Green were discussants… An idle question that came to me during this last talk was about the existence of particle filters for spatial Markov structures (rather than the usual ones on temporal Markov models).

After a [no] lunch break spent on pondering over a conjecture laid to me by Natesh Pillai yesterday, I eventually joined the Feature Allocation session. Eventually as I basically had to run the entire perimeter of the conference centre! The three talks by Finale Doshi-Velez, Tamara Broderick, and Yuan Ji were all impressive and this may have been my best session so far at JSM! Thanks to Peter Müller for organising it! Tamara Broderick focussed on a generic way to build conjugate priors for non-parametric models, with all talks involving Indian buffets. Maybe a suggestion for tonight’s meal..! (In the end, great local food onn Harvard Square.)


Filed under: Statistics, Travel, University life Tagged: Boston, CHANCE, conjugate priors, dynamic model, feature allocation model, Harvard Square, JSM 2014, Markov random field, optimal transport
Categories: Bayesian Bloggers

JSM 2014, Boston

Xian's Og - Tue, 2014-08-05 17:14

A new Joint Statistical meeting (JSM), first one since JSM 2011 in Miami Beach. After solving [or not] a few issues on the home front (late arrival, one lost bag, morning run, flat in a purely residential area with no grocery store nearby and hence no milk for tea!), I “trekked” to [and then through] the faraway and sprawling Boston Convention Centre and was there in (plenty of) time for Mathias Drton’s Medalion Lecture on linear structural equations. (The room was small and crowded and I was glad to be there early enough!, although there were no Cerberus [Cerberi?] to prevent additional listeners to sit on the ground, as in Washington D.C. a few years ago.) The award was delivered to Mathias by Nancy Reid from Toronto (and reminded me of my Medallion Lecture in exotic Fairbanks ten years ago). I had alas missed Gareth Roberts’ Blackwell Lecture on Rao-Blackwellisation, as I was still in the plane from Paris, trying to cut on my slides and to spot known Icelandic locations from glancing sideways at the movie The Secret Life of Walter Mitty played on my neighbour’s screen. (Vik?)

Mathias started his wide-ranging lecture by linking linear structural models with graphical models and specific features of covariance matrices. I did not spot a motivation for the introduction of confounding factors, a point that always puzzles me in this literature [as I must have repeatedly mentioned here]. The “reality check” slide made me hopeful but it was mostly about causality [another of or the same among my stumbling blocks]… What I have trouble understanding is how much results from the modelling and how much follows from this “reality check”. A novel notion revealed by the talk was the “trek rule“, expressing the covariance between variables as a product of “treks” (sequence of edges) linking those variables. This is not a new notion, introduced by Wright (1921), but it is a very elegant representation of the matrix inversion of (I-Λ) as a power series. Mathias made it sound quite intuitive even though I would have difficulties rephrasing the principle solely from memory! It made me [vaguely] wonder at computational implications for simulation of posterior distributions on covariance matrices. Although I missed the fundamental motivation for those mathematical representations. The last part of the talk was a series of mostly open questions about the maximum likelihood estimation of covariance matrices, from existence to unimodality to likelihood-ratio tests. And an interesting instance of favouring bootstrap subsampling. As in random forests.

I also attended the ASA Presidential address of Stephen Stigler on the seven pillars of statistical wisdom. In connection with T.E. Lawrence’s 1927 book. (Actually, 1922.) Itself in connection with Proverbs IX:1. Unfortunately wrongly translated as seven pillars rather than seven sages.  Here are Stephen’s pillars:

  1. aggregation, which leads to gain information by throwing away information, aka the sufficiency principle [one may wonder at the extension of this principleto non-exponantial families]
  2. information accumulating at the √n rate, aka precision of statistical estimates, aka CLT confidence [quoting our friend de Moivre at the core of this discovery]
  3. likelihood as the right calibration of the amount of information brought by a dataset [including Bayes' essay]
  4. intercomparison [i.e. scaling procedures from variability within the data, sample variation], eventually leading to the bootstrap
  5. regression [linked with Darwin's evolution of species, albeit paradoxically] as conditional expectation, hence as a Bayesian tool
  6. design of experiment [enters Fisher, with his revolutionary vision of changing all factors in Latin square designs]
  7. residuals [aka goodness of fit but also ABC!]

Maybe missing the positive impact of the arbitrariness of picking or imposing a statistical model upon an observed dataset. Maybe not as it is somewhat covered by #3, #4 and #7. The reliance on the reproducibility of the data could be the ground on which those pillars stand.


Filed under: Books, Mountains, pictures, Running, Statistics, Travel, University life Tagged: ABC, Abraham De Moivre, Boston, Cambridge, graphical models, Harvard University, JSM 2014, likelihood, Massachusset, residuals, Stephen Stigler, T.E. Lawrence, trek rule theorem
Categories: Bayesian Bloggers

JSM 2014, Boston

Xian's Og - Tue, 2014-08-05 11:50


Filed under: pictures, Statistics, Travel
Categories: Bayesian Bloggers

likelihood-free inference via classification

Xian's Og - Mon, 2014-08-04 17:14

Last week, Michael Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander posted on arXiv the last current of the paper they had presented at MCMSki 4. As indicated by its (above) title, it suggests implementing ABC based on classification tools. Thus making it somewhat connected to our recent random forest paper.

The starting idea in the paper is that datasets generated from distributions with different parameters should be easier to classify than datasets generated from distributions with different parameters. And that classification accuracy naturally induces a distance between datasets and between the parameters behind those datasets. We had followed some of the same track when starting using random forests, before realising that for our model choice setting, proceeding the entire ABC way once the random forest procedure had been constructed was counter-productive. Random forests are just too deadly as efficient model choice machines to try to compete with them through an ABC postprocessing. Performances are just… Not. As. Good!

A side question: I have obviously never thought about that before but why is the naïve Bayes classification rule so called?! It never sounded very Bayesian to me to (a) use the true value of the parameter and (b) average the classification performances. Interestingly, the authors (i) show identical performances of other classification methods (Fig. 2) and (ii) an exception for MA time series: when we first experimented random forests, raw data from an MA(2) model was tested to select between MA(1) and  MA(2) models, and the performances of the resulting random forest were quite poor.

Now, an opposition between our two approaches is that Michael and his coauthors also include point estimation within the range of classification-based ABC inference. As we stressed in our paper, we restrict the range to classification and model choice because we do not think those machine learning tools are stable and powerful enough to perform regression and posterior probability approximation. I also see a practical weakness in the estimation scheme proposed in this new paper. Namely that the Monte Carlo gets into the way of the consistency theorem. And possibly of the simulation method itself. Another remark is that, while the authors compare the fit produced by different classification methods, there should be a way to aggregate them towards higher efficiency. Returning once more to our random forest paper, we saw improved performances each time we included a reference method, from LDA to SVMs. It would be interesting to see a (summary) variable selection version of the proposed method. A final remark is that computing time and effort do not seem to get mentioned in the paper (unless Indian jetlag confuses me more than usual). I wonder how fast the computing effort grows with the sample size to reach parametric and quadratic convergence rates.


Filed under: Books, Mountains, pictures, Statistics, Travel, University life Tagged: ABC, Chamonix, classification, MCMSki IV, random forests, summary statistics
Categories: Bayesian Bloggers

a thesis on random forests

Xian's Og - Sun, 2014-08-03 17:14

During a session of the IFCAM workshop this morning I noticed a new arXiv posting on random forests. Entitled Understanding Random Forests: From Theory to Practice, it actually corresponds to a PhD thesis written by Gilles Louppe on the topic. At the Université de Liège, Belgie/Belgium/Belgique. In this thesis, Gilles Louppe provides a rather comprehensive coverage of the random forest methodology, from specific bias-variance decompositions and convergence properties to the historical steps towards random forests, to implementation details and recommendations, to describing how to rank (co)variates by order of importance. The last point was of particular relevance for our current work on ABC model choice with random forests as it relies on random forests and relies on the frequency of appearance of a given variable to label its importance. The thesis showed me this was not a great way of selecting covariates as it did not account for correlation and could easily miss important covariates. It is a very complete, well-written and beautifully LaTeXed (with fancy grey boxes and all that jazz!). As part of his thesis, Gilles Louppe also contributed to the open source machine learning library Scikit.  The thesis thus makes a most profitable and up-to-date entry into the topic of random forests…


Filed under: Books, Kids, Statistics, University life Tagged: Belgium, Liège, machine learning, PhD thesis, random forests
Categories: Bayesian Bloggers

Informative g-Priors for Logistic Regression

Timothy E. Hanson, Adam J. Branscum, Wesley O. Johnson
Categories: Bayesian Analysis

Cluster Analysis, Model Selection, and Prior Distributions on Models

George Casella, Elias Moreno and F.J. Giron
Categories: Bayesian Analysis

Toward Rational Social Decisions: A Review and Some Results

Joseph B. Kadane and Steven N. MacEachern
Categories: Bayesian Analysis