## Bayesian News Feeds

### speeding up MCMC

**J**ust before I left for Iceland, Matias Quiroz, Mattias Villani and Robert Kohn arXived a paper entitled “speeding up MCMC by efficient data subsampling”. Somewhat connected with the earlier papers by Koattikara et al., and Bardenet et al., both discussed on the ‘Og, the idea is to replace the log-likelihood by an unbiased subsampled version and to correct for the resulting bias of the exponentiation of this (Horwitz-Thompson or Hansen-Hurwitz) estimator. They ground their approach within the (currently cruising!) pseudo-marginal paradigm, even though their likelihood estimates are not completely unbiased. Since the optimal weights in the sampling step are proportional to the log-likelihood terms, they need to build a surrogate of the true likelihood, using either a Gaussian process or a spline approximation. This is all in all a very interesting contribution to the on-going debate about increasing MCMC speed when dealing with large datasets and ungainly likelihood functions. The proposed solution however has a major drawback in that the entire dataset must be stored at all times to ensure unbiasedness. For instance, the paper considers a bivariate probit model with a sample of 500,000 observations. Which must be available at all times. Further, unless I am confused, the subsampling step requires computing the surrogate likelihood for all observations, before running the subsampling step, another costly requirement.

Filed under: Books, Mountains, pictures, Running, Statistics, Travel, University life Tagged: AISTATS 2014, Gaussian processes, Hansen-Hurwitz estimator, Horwitz-Thompson estimator, Iceland, pseudo-marginal MCMC, unbiasedness

### Jeffreys prior with improper posterior

**I**n a complete coincidence with my visit to Warwick this week, I became aware of the paper “Inference in two-piece location-scale models with Jeffreys priors” recently published in Bayesian Analysis by Francisco Rubio and Mark Steel, both from Warwick. Paper where they exhibit a closed-form Jeffreys prior for the skewed distribution

where f is a symmetric density, namely

where

only to show immediately after that this prior does not allow for a proper posterior, no matter what the sample size is. While the above skewed distribution can always be interpreted as a mixture, being a weighted sum of two terms, it is not strictly speaking a mixture, if only because the “component” can be identified from the observation (depending on which side of μ is stands). The likelihood is therefore a product of simple terms rather than a product of a sum of two terms.

**A**s a solution to this conundrum, the authors consider the alternative of the “independent Jeffreys priors”, which are made of a product of conditional Jeffreys priors, i.e., by computing the Jeffreys prior one parameter at a time with all other parameters considered to be fixed. Which differs from the reference prior, of course, but would have been my second choice as well. Despite criticisms expressed by José Bernardo in the discussion of the paper… The difficulty (in my opinion) resides in the choice (and difficulty) of the parameterisation of the model, since those priors are not parameterisation-invariant. (Xinyi Xu makes the important comment that even those priors incorporate strong if hidden information. Which relates to our earlier discussion with Kaniav Kamari on the “dangers” of prior modelling.)

**A**lthough the outcome is puzzling, I remain just slightly sceptical of the income, namely Jeffreys prior and the corresponding Fisher information: the fact that the density involves an indicator function and is thus discontinuous in the location μ at the observation x makes the likelihood function not differentiable and hence the derivation of the Fisher information not strictly valid. Since the indicator part cannot be differentiated. Not that I am seeing the Jeffreys prior as the ultimate grail for non-informative priors, far from it, but there is definitely something specific in the discontinuity in the density. (In connection with the later point, Weiss and Suchard deliver a highly critical commentary on the non-need for reference priors and the preference given to a non-parametric Bayes primary analysis. Maybe making the point towards a greater convergence of the two perspectives, objective Bayes and non-parametric Bayes.)

**T**his paper and the ensuing discussion about the properness of the Jeffreys posterior reminded me of our earliest paper on the topic with Jean Diebolt. Where we used improper priors on location and scale parameters but prohibited allocations (in the Gibbs sampler) that would lead to less than two observations per components, thereby ensuring that the (truncated) posterior was well-defined. (This feature also remained in the Series B paper, submitted at the same time, namely mid-1990, but only published in 1994!) Larry Wasserman proved ten years later that this truncation led to consistent estimators, but I had not thought about it in very long while. I still like this notion of forcing some (enough) datapoints into each component for an allocation (of the latent indicator variables) to be an acceptable Gibbs move. This is obviously not compatible with the iid representation of a mixture model, but it expresses the requirement that components all have a meaning *in terms of the data*, namely that all components contributed to generating a part of the data. This translates as a form of weak prior information on how much we trust the model and how meaningful each component is (in opposition to adding meaningless extra-components with almost zero weights or almost identical parameters).

**A**s a marginalia, the insistence in Rubio and Steel’s paper that all observations in the sample be different also reminded me of a discussion I wrote for one of the Valencia proceedings (Valencia 6 in 1998) where Mark presented a paper with Carmen Fernández on this issue of handling duplicated observations modelled by absolutely continuous distributions. (I am afraid my discussion is not worth the $250 price tag given by amazon!)

Filed under: Books, Statistics, University life Tagged: finite mixtures, improper posteriors, improper priors, Jean Diebolt, Jeffreys priors, Larry Wasserman, non-informative priors, properness, reference priors, skewed distribution, Valencia conferences

### the ice princess [book review]

**T**his week in Warwick, I read *The Ice Princess*, the first novel of Camilla Lackberg and a book I purchased in Toronto last Fall. I remember seeing the novel fairly frequently in the Paris métro a few years ago and, judging from the banner on top of my edition (“7 million books sold”), it was not only popular in Paris… I actually fail to understand why. Indeed, the plot sounds like a beginner level exercise in a creative writing class, with all possible memes of a detective story appearing together, from suicide, to adultery, to paedophilia, to rich inheritors, to domestic violence, to incompetent bosses, to small town gossip, etc., etc. The hidden story that is central to explain the murder(s) is just unbelievable, as are some of the related subplots. And the style is appalling: the two main protagonists are withholding clues and information from the reader, their love affair takes hundred of pages to unravel, the sentences are often unnatural, or repetitive, some characters are so clichés as to be ultimately unbelievable. Negatives just pile up so high it is laughable. And unbelievable the book got so popular. Or received prizes. Like the 2008 Grand Prix de Littérature Policière for Best International Crime Novel… (Prize which picked in other times major writers like Patricia Highsmith, Chester Himes, John Dickson Carr, Eric Ambler, Manuel Vázquez Montalbán, Tony Hillerman, P. D. James, Ian Rankin, and Arnaldur Indriðason.) Anyway, this was a very poor beginning to a highly succesfull series and I am glad I read *The Hidden Child* before *The Ice Princess*, as the former had more depth and a much better plot than this first novel.

Filed under: Books, Kids, Travel Tagged: Arnaldur Indriðason, Camilla Lackberg, detective stories, Eric Ambler, Ian Rankin, Manuel Vásquez Montalbán, Sweden, Tony Hillerman

### bridging the gap between machine learning and statistics

**T**oday in Warwick, I had a very nice discussion with Michael Betancourt on many statistical and computational issues but at one point in the conversation we came upon the trouble of bridging the gap between the machine learning and statistics communities. While a conference like AISTATS is certainly contributing to this, it does not reach the main bulk of the statistics community. Since, in Reykjavik, we had discussed the corresponding difficulty of people publishing a longer and “more” statistical paper in a “more” statistical journal, once the central idea was published in a machine learning conference proceeding like NIPS or AISTATS. we had this idea that creating a special fast-track in a mainstream statistics journal for a subset of those papers, using for instance a tailor-made committee in that original conference, or creating an annual survey of the top machine learning conference proceedings rewritten in a more” statistical way (and once again selected by an ad hoc committee) would help, at not too much of a cost for inducing machine learners to make the extra-effort of switching to another style. From there, we enlarged the suggestion to enlist a sufficient number of (diverse) bloggers in each major conference towards producing quick but sufficiently informative entries on their epiphany talks (if any), possibly supported by the conference organisers or the sponsoring societies. (I am always happy to welcome any guest blogger in conferences I attend!)

Filed under: pictures, Statistics, Travel, University life Tagged: AISTATS, blogging, machine learning, NIPS, Reykjavik, statistics journals, University of Warwick

### big’MC’minar next week

**T**he next big’MC seminar in Paris will be delivered on Thursday, May 15, by

**15 h : Luke Bornn, ****Towards the Derandomization of Markov chain Monte Carlo**

**16 h 15 : Pierre Jacob,**

**On exact inference and unbiased estimation**

see the seminar webpage for more details. And make sure to attend if in or near Paris! It is definitely big and MC. Most sadly (for us!), Chris Holmes will give a Smile (Statistical machine learning) seminar at the very same time a few streets away… At least, we can conveniently meet right after for a drink!

Filed under: Kids, Statistics, Travel, University life Tagged: Big'MC, Institut Henri Poincaré, Luke Bornn, Monte Carlo Statistical Methods, Panthéon, Paris, Pierre Jacob, seminar

### stopping rule impact

**H**ere is a question from my friend Shravan Vasishth about the consequences of using a stopping rule:

Psycholinguists and psychologists often adopt the following type of data-gathering procedure: The experimenter gathers n data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (n more data points). Since time and money are limited, he might decide to stop anyway at sample size, say, some multiple of n. One can play with different scenarios here. A typical n might be 10 or 15.

This approach would give us a distribution of t-values and p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses (I also track the t-statistic, in order to compare it with my stopping rule code below).

Here’s a simulation showing what happens. I wanted to ask you whether this simulation makes sense. I assume here that the experimenter gathers 10 data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (10 more data points). Since time and money are limited, he might decide to stop anyway at sample size 60. This gives us p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses: ##Standard: pvals<-NULL tstat_standard<-NULL n<-10 # sample size nsim<-1000 # number of simulations stddev<-1 # standard dev mn<-0 ## mean for(i in 1:nsim){ samp<-rnorm(n,mean=mn,sd=stddev) pvals[i]<-t.test(samp)$p.value tstat_standard[i]<-t.test(samp)$statistic} ## Type I error rate: about 5% as theory says: table(pvals<0.05)[2]/nsimBut the situation quickly deteriorates as soon as adopt the strategy I outlined above:

pvals<-NULL tstat<-NULL ## how many subjects can I run? upper_bound<-n*6 for(i in 1:nsim){ ## at the outset we have no significant result: significant<-FALSE ## null hyp is going to be true, ## so any rejection is a mistake. ## take sample: x<-rnorm(n,mean=mn,sd=stddev) while(!significant & length(x)<upper_bound){ ## if not significant: if(t.test(x)$p.value>0.05){ ## get more data: x<-append(x,rnorm(n,mean=mn,sd=stddev)) ## otherwise stop: } else {significant<-TRUE}} ## will be either significant or not: pvals[i]<-t.test(x)$p.value tstat[i]<-t.test(x)$statistic}Now let’s compare the distribution of the t-statistic in the standard case vs with the above stopping rule. We get fatter tails with the above stopping rule, as shown by the histogram below.

Is this a correct way to think about the stopping rule problem?

**T**o which I replied the following:

By adopting a stopping rule on a random iid sequence, you favour values in the sequence that agree with your stopping condition, hence modify the distribution of the outcome. To take an extreme example, if you draw N(0,1) variates *until* the empirical average is between -2 and 2, the average thus produced cannot remain N(0,1/n) but have a different distribution.

The t-test statistic you build from your experiment is no longer distributed as a uniform variate because of the stopping rule: the sample(x1,…,x10m) (with random size 10m [resulting from increases in the sample size by adding 10 more observations at a time] is distributed from

if 10m<60 [assuming the maximal acceptable sample size is 60] and from

otherwise. The histogram at the top of this post is the empirical distribution of the average of those observations, clearly far from a normal distribution.

Filed under: Books, R, Statistics, University life Tagged: bias, p-values, probability theory, stopping rule

### snapshot from Kenilworth

### art brut

### sunrise over Warwickshire

Filed under: pictures, Running, Travel Tagged: countryside, Kenilworth, sunrise, University of Warwick

### efficient implementation of MCMC when using an unbiased likelihood estimator

**I** read this paper by Arnaud Doucet, Mike Pitt, George Deligiannidis and Robert Kohn, re-arXived last month, when travelling to Warwick this morning. In a very pleasant weather, both sides of the Channel.* (Little was I aware then that it was a public (“bank”) holiday in the UK and hence that the department here would be empty of people.)* Actually, Mike had already talked with me about it during my previous visit to Warwick, as the proof in the paper is making use of our vanilla Rao-Blackwellisation paper, by considering the jump kernels associated with the original kernels.

**T**he purpose of the paper is to determine the precision of (i.e., the number of terms N in) an unbiased estimation of the likelihood function in order to minimise the asymptotic variance of the corresponding Metropolis-Hastings estimate. For a given total number of simulations. While this is a very pertinent issue with pseudo-marginal and particle MCMC algorithms, I would overall deem the paper to be more theoretical than methodological in that it relies on special assumptions like a known parametric family for the distribution of the noise in the approximation of the log-likelihood and independence (of this distribution) from the parameter value. The central result of the paper is that the number of terms N should be such that the variance of the log-likelihood estimator is around 1. Definitely a manageable target. (The above assumptions are used to break the Metropolis-Hastings acceptance probability in two independent parts and to run two separate acceptance checks. Ending up with an upper bound on the asymptotic variance.)

Filed under: pictures, Statistics, Travel, University life Tagged: air pictures, bank holiday, Birmingham, England, University of Warwick

### AISTATS 2014 (tee-shirt)

**I**t took me a fairly long while to realise there was a map of Iceland as a tag-cloud at the back of the AISTATS 2014 tee-shirt! As it was far too large for me, I thought about leaving it at the conference desk last week. I did bring it back for someone the proper size though and discovered the above when unfolding the tee… Nice but still not my size!

Filed under: Kids, pictures, Statistics, Travel, University life Tagged: AISTATS 2014, ash cloud, Iceland, Reykjavik, tag cloud, tee-shirt

### Le Monde puzzle [#865]

**A** Le Monde mathematical puzzle in combinatorics:

*Given a permutation σ of {1,…,5}, if **σ(1)=n, the n first values of σ are inverted. If the process is iterated until σ(1)=1, does this always happen and if so what is the maximal number of iterations? Solve the same question for the set {1,…,2014}.*

**I** ran the following basic R code:

obtaining 7 as the outcome. Here is the evolution of the maximum as a function of the number of terms in the set. If we push the regression to N=2014, the predicted value is around 600,000… Running a million simulations of the above only gets me to 23,871!**A** wee minutes of reflection lead me to conjecture that the maximum number of steps wN should be satisfy wN=wN-1+N-2. However, the values resulting from the simulations do not grow as fast. (And, as Jean-Louis Fouley commented, it does not even work for N=6.) Monte Carlo effect or true discrepancy?

Filed under: Books, Kids, R Tagged: Le Monde, mathematical puzzle, R

### skyndimynd frá Íslandi (#9)

Filed under: Kids, Mountains, pictures, Travel Tagged: AISTATS 2014, Esja, Iceland, puffins, Reykjanes Peninsula

### a pseudo-marginal perspective on the ABC algorithm

**M**y friends Luke Bornn, Natesh Pillai and Dawn Woodard just arXived along with Aaron Smith a short note on the convergence properties of ABC. When compared with acceptance-rejection or regular MCMC. Unsurprisingly, ABC does worse in both cases. What is central to this note is that ABC can be (re)interpreted as a pseudo-marginal method where the data comparison step acts like an unbiased estimator of the true ABC target (not of the original ABC target, mind!). From there, it is mostly an application of Christophe Andrieu’s and Matti Vihola’s results in this setup. The authors also argue that using a single pseudo-data simulation per parameter value is the optimal strategy (as compared with using several), when considering asymptotic variance. This makes sense in terms of simulating in a larger dimensional space but what of the cost of producing those pseudo-datasets against the cost of producing a new parameter? There are a few (rare) cases where the datasets are much cheaper to produce.

Filed under: Mountains, pictures, Statistics, University life Tagged: ABC, ABC-MCMC, acceptance rate, Alps, asymptotics, Chamonix, MCMSki IV, pseudo-data, ranking

### The Republic of Thieves [book review]

**A**t last! The third volume in Scott Lynch’s *Gentlemen Bastards* series has appeared!After several years of despairing ever seeing the sequel to *The Lies of Locke Lamora* and of *Red Seas under Red Skies*, *The Republic of Thieves* eventually appeared. The author thus managed to get over his chronic depression to produce a book in par with the previous two volumes… Judging from the many reviews found on the Web, reception ranges from disappointed to ecstatic. I do think this volume is very good, if below the initial *The Lies of Locke Lamora* in terms of freshness and plot. There is consistency in terms of the series, some explanations are provided wrt earlier obscure points, new obscure points are created in preparation for the next volumes, and the main characters broaden and grow in depth and complexity. Mostly.

**T**he book *The Republic of Thieves* is much more innovative than its predecessor from a purely literary viewpoint, with story told within story, with on top of this a constant feedback to the origins of the *Gentlemen Bastards* upper-scale thieves band. The inclusion of a real play which title is the same as the title of the book is a great idea, albeit not exactly new (from *Cyrano de Bergerac* to *The Wheel of Time* to *The Name of the Wind*), as it gives more coherence to the overall plot. The *Gentlemen Bastards* as depicted along those books are indeed primarily fabulous actors and they manage their heists mostly by clever acting, rather than force and violence. (Covers hence miss the point completely by using weapons and blood.) It thus makes sense that they had had training with an acting troop… Now, the weakest point in the book is the relationship between the two central characters, Locke Lamora and Sabetha Belacoros. This is rather unfortunate as there are a lot of moments and a lot of pages and a lot of dialogues centred on this relationship! Lynch seems unable to strike the right balance and Locke remains an awkward pre-teen whose apologies infuriate Sabetha at every corner… After the third occurence of this repeated duo, it gets quickly annoying. The couple only seems to grow up at the very end of the book. At last! Apart from this weakness, the plot is predictable at one level, which sounds like the primarily level… *(spoiler?!)* until a much deeper one is revealed, once again in the final pages of the book which, even more than in the previous ones, turn all perspectives upside-down and desperately beg for the next book to appear. Hopefully in less than six years…

Filed under: Books, Kids Tagged: depression, gentlemen bastards, Scott Lynch, the lies of Locke Lamora, the republic of thieves

### Reykjavik street art

Filed under: Kids, pictures, Running, Travel Tagged: AISTATS 2014, Iceland, murals, Reykjavik, street art

### RSS statistical analytics challenge 2014

**G**reat news! The RSS is setting a data analysis challenge this year, sponsored by the Young Statisticians Section and Research Section of the Royal Statistical Society: Details are available on the wordpress website of the Challenge. Registration is open and the Challenge goes live on Tuesday 6 May 2014 for an exciting 6 weeks competition. (A wee bit of an unfortunate timing for those of us considering submitting a paper to NIPS!) Truly terrific, I have been looking for this kind of event to happen for many years (without finding the momentum to set it rolling…) and hope it will generate a lot of exciting activity *and* replicas in other societies.

Filed under: Kids, R, Statistics, University life, Wines Tagged: data challenge, NIPS, Royal Statistical Society, RSS, Wordpress

### art brut

### skyndimynd frá Íslandi (#8)

Filed under: Kids, Mountains, pictures, Travel Tagged: AISTATS 2014, Eyjafjallayökull, Iceland, red roof houses, Skögafoss, volcanoes, waterfalls

### controlled thermodynamic integral for Bayesian model comparison [reply]

*C**hris Oates wrotes the following reply to my Icelandic comments on his paper with Theodore Papamarkou, and Mark Girolami, reply that is detailed enough to deserve a post on its own:*

Thank you Christian for your discussion of our work on the Og, and also for your helpful thoughts in the early days of this project! It might be interesting to speculate on some aspects of this procedure:

(i) Quadrature error is present in all estimates of evidence that are based on thermodynamic integration. It remains unknown how to exactly compute the optimal (variance minimising) temperature ladder “on-the-fly”; indeed this may be impossible, since the optimum is defined via a boundary value problem rather than an initial value problem. Other proposals for approximating this optimum are compatible with control variates (e.g. Grosse et al, NIPS 2013, Friel and Wyse, 2014). In empirical experiments we have found that the second order quadrature rule proposed by Friel and Wyse 2014 leads to substantially reduced bias, regardless of the specific choice of ladder.

(ii) Our experiments considered first and second degree polynomials as ZV control variates. In fact, intuition specifically motivates the use of second degree polynomials: Let us presume a linear expansion of the log-likelihood in θ. Then the implied score function is constant, not depending on θ. The quadratic ZV control variates are, in effect, obtained by multiplying the score function by θ. Thus control variates can be chosen to perfectly correlate with the log-likelihood, leading to zero-variance estimators. Of course, there is an empirical question of whether higher-order polynomials are useful when this Taylor approximation is inappropriate, but they would require the estimation of many more coefficients and in practice may be less stable.

(iii) We require that the control variates are stored along the chain and that their sample covariance is computed after the MCMC has terminated. For the specific examples in the paper such additional computation is a negligible fraction of the total computational, so that we did not provide specific timings. When non-diffegeometric MCMC is used to obtain samples, or when the score is unavailable in closed-form and must be estimated, the computational cost of the procedure would necessarily increase.

For the wide class of statistical models with tractable likelihoods, employed in almost all areas of statistical application, the CTI we propose should provide state-of-the-art estimation performance with negligible increase in computational costs.

Filed under: Books, pictures, Running, Statistics, University life Tagged: advanced Monte Carlo methods, arXiv, control variate, Iceland, MCMC algorithms, Monte Carlo Statistical Methods, path sampling, Pima Indians, pMCMC, quadrature rule, Reykjavik, Riemann manifold, thermodynamic integration