## Bayesian Bloggers

### eliminating an important obstacle to creative thinking: statistics…

*“We hope and anticipate that banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking.”*

**A**bout a month ago, David Trafimow and Michael Marks, the current editors of the journal *Basic and Applied Social Psychology* published an editorial banning all null hypothesis significance testing procedures (acronym-ed into the ugly NHSTP which sounds like a particularly nasty venereal disease!) from papers published by the journal. My first reaction was “Great! This will bring more substance to the papers by preventing significance fishing and undisclosed multiple testing! Power to the statisticians!” However, after reading the said editorial, I realised it was inspired by a nihilistic anti-statistical stance, backed by an apparent lack of understanding of the nature of statistical inference, rather than a call for saner and safer statistical practice. The editors most clearly state that inferential statistical procedures are no longer needed to publish in the journal, only “strong descriptive statistics”. Maybe to keep in tune with the “Basic” in the name of the journal!

*“In the NHSTP, the problem is in traversing the distance from the probability of the finding, given the null hypothesis, to the probability of the null hypothesis, given the finding. Regarding confidence intervals, the problem is that, for example, a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval.”*

The above quote could be a motivation for a Bayesian approach to the testing problem, a revolutionary stance for journal editors!, but it only illustrate that the editors wish for a procedure that would eliminate the uncertainty inherent to statistical inference, i.e., to decision making under… erm, uncertainty: *“The state of the art remains uncertain.”* To fail to separate significance from certainty is fairly appalling from an epistemological perspective and should be a case for impeachment, were any such thing to exist for a journal board. This means the editors cannot distinguish data from parameter and model from reality! Even more fundamentally, to bar statistical procedures from being used in a scientific study is nothing short of reactionary. While encouraging the inclusion of data is a step forward, restricting the validation or in-validation of hypotheses to gazing at descriptive statistics is many steps backward and does completely jeopardize the academic reputation of the journal, which editorial may end up being the last quoted paper. Is deconstruction now reaching psychology journals?! To quote from a critic of this approach, “Thus, the general weaknesses of the deconstructive enterprise become self-justifying. With such an approach I am indeed not sympathetic.” (Searle, 1983).

*“The usual problem with Bayesian procedures is that they depend on some sort of Laplacian assumption to generate numbers where none exist (…) With respect to Bayesian procedures, we reserve the right to make case-by-case judgments, and thus Bayesian procedures are neither required nor banned from BASP.”*

The section of Bayesian approaches is trying to be sympathetic to the Bayesian paradigm but again reflects upon the poor understanding of the authors. By “Laplacian assumption”, they mean Laplace´s Principle of Indifference, i.e., the use of uniform priors, which is not seriously considered as a sound principle since the mid-1930’s. Except maybe in recent papers of Trafimow. I also love the notion of “generat[ing] numbers when none exist”, as if the prior distribution had to be grounded in some physical reality! Although it is meaningless, it has some poetic value… (Plus, bringing Popper and Fisher to the rescue sounds like shooting Bayes himself in the foot.) At least, the fact that the editors will consider Bayesian papers in a case-by-case basis indicate they may engage in a subjective Bayesian analysis of each paper rather than using an automated p-value against the 100% rejection bound!

*[Note: this entry was suggested by Alexandra Schmidt, current ISBA President, towards an incoming column on this decision of Basic and Applied Social Psychology for the ISBA Bulletin.]*

Filed under: Books, Kids, Statistics, University life Tagged: Basic and Applied Social Psychology, Bayesian hypothesis testing, confidence intervals, editor, ISBA, ISBA Bulletin, Karl Popper, NHSTP, null hypothesis, p-values, Pierre Simon de Laplace, Principle of Indifference, Thomas Bayes, xkcd

### Edmond Malinvaud (1923-2015)

**T**he statistician, econometrician, macro- and micro-economist, Edmond Malinvaud died on Saturday, March 7. He had been director of my alma mater ENSAE (1962–1966), directeur de la Prévision at the Finance Department (1972–1974), director of INSEE (1974–1987), and Professeur at Collège de France (1988–1993). While primarily an economist, with his theories of disequilibrium and unemployment, reflected in his famous book Théorie macro-économique (1981) that he taught us at ENSAE, he was also instrumental in shaping the French econometrics school, see his equally famous Statistical Methods of Econometrics (1970), and in the reorganisation of INSEE as the post-war State census and economic planning tool. He was also an honorary Fellow of the Royal Statistical Society and the 1981 president of the International Institute of Statistics. Edmond Malinvaud studied under Maurice Allais, Nobel Prize in economics in 1988, and was himself considered as a potential Nobel for several years. My personal memories of him at ENSAE and CREST are of a very clear teacher and of a kind and considerate man, with the reserve and style of a now-bygone era…

Filed under: Books, Kids, Statistics, University life Tagged: Collège de France, CREST, disequilibrium, econometrics, Edmond Malinvaud, ENSAE, INSEE, macroeconomics, Maurice Allais

### ABC of simulation estimation with auxiliary statistics

*“In the ABC literature, an estimator that uses a general kernel is known as a noisy ABC estimator.”*

**A**nother arXival relating M-estimation econometrics techniques with ABC. Written by Jean-Jacques Forneron and Serena Ng from the Department of Economics at Columbia University, the paper tries to draw links between indirect inference and ABC, following the tracks of Drovandi and Pettitt [not quoted there] and proposes a *reverse* ABC sampler by

- given a randomness realisation, ε, creating a
*one-to-one*transform of the parameter θ that corresponds to a realisation of a summary statistics; - determine the value of the parameter θ that minimises the distance between this summary statistics and the observed summary statistics;
- weight the above value of the parameter θ by π(θ) J(θ) where J is the Jacobian of the one-to-one transform.

I have difficulties to see why this sequence produces a weighted sample associated with the posterior. Unless perhaps when the minimum of the distance is zero, in which case this amounts to some inversion of the summary statistic (function). And even then, the role of the random bit ε is unclear. Since there is no rejection. The inversion of the summary statistics seems hard to promote in practice since the transform of the parameter θ into a (random) summary is most likely highly complex.

*“The posterior mean of θ constructed from the reverse sampler is the same as the posterior mean of θ computed under the original ABC sampler.”*

The authors also state (p.16) that the estimators derived by their reverse method are the same as the original ABC approach but this only happens to hold asymptotically in the sample size. And I am not even sure of this weaker statement as the tolerance does not seem to play a role then. And also because the authors later oppose ABC to their reverse sampler as the latter produces iid draws from the posterior (p.25).

*“The prior can be potentially used to further reduce bias, which is a feature of the ABC.”*

As an aside, while the paper reviews extensively the literature on minimum distance estimators (called M-estimators in the statistics literature) and on ABC, the first quote is missing the meaning of noisy ABC, which consists in a randomised version of ABC where the observed summary statistic is randomised at the same level as the simulated statistics. And the last quote does not sound right either, as it should be seen as a feature of the Bayesian approach rather than of the ABC algorithm. The paper also attributes the paternity of ABC to Don Rubin’s 1984 paper, “who suggested that computational methods can be used to estimate the posterior distribution of interest even when a model is analytically intractable” (pp.7-8). This is incorrect in that Rubin uses ABC to explain the nature of the Bayesian reasoning, but does not in the least address computational issues.

Filed under: Statistics, University life Tagged: ABC, Columbia University, consistency, indirect inference, noisy ABC

### Professor position at ENSAE, on the Paris Saclay campus

**T**here is an opening at the Statistics School ENSAE for a Statistics associate or full professor position, starting on September 2015. Currently located on the South-West boundary of Paris, the school is soon to move to the mega-campus of Paris Saclay, near École Polytechnique, along with a dozen other schools. See this description of the position. The deadline is very close, March 23!

Filed under: Statistics Tagged: academic position, École Polytechnique, CREST, ENSAE, France, INSEE, Malakoff, Paris, Paris-Saclay campus

### mixtures of mixtures

**A**nd yet another arXival of a paper on mixtures! This one is written by Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, and Bettina Grün, from the Johannes Kepler University Linz and the Wirtschaftsuniversitat Wien I visited last September. With the exact title being Identifying mixtures of mixtures using Bayesian estimation.

So, what *is* a mixture of mixtures if not a mixture?! Or if not *only* a mixture. The upper mixture level is associated with clusters, while the lower mixture level is used for modelling the distribution of a given cluster. Because the cluster needs to be real enough, the components of the mixture are assumed to be heavily overlapping. The paper thus spends a large amount of space on detailing the construction of the associated hierarchical prior. Which in particular implies defining through the prior what a cluster means. The paper also connects with the overfitting mixture idea of Rousseau and Mengersen (2011, Series B). At the cluster level, the Dirichlet hyperparameter is chosen to be very small, 0.001, which empties superfluous clusters but sounds rather arbitrary (which is the reason why we did not go for such small values in our testing/mixture modelling). On the opposite, the mixture weights have an hyperparameter staying (far) away from zero. The MCMC implementation is based on a standard Gibbs sampler and the outcome is analysed and sorted by estimating the “true” number of clusters as the MAP and by selecting MCMC simulations conditional on that value. From there clusters are identified via the point process representation of a mixture posterior. Using a standard k-means algorithm.

The remainder of the paper illustrates the approach on simulated and real datasets. Recovering in those small dimension setups the number of clusters used in the simulation or found in other studies. As noted in the conclusion, using solely a Gibbs sampler with such a large number of components is rather perilous since it may get stuck close to suboptimal configurations. Especially with very small Dirichlet hyperparameters.

Filed under: pictures, Statistics, University life Tagged: arXiv, Austria, clustering, k-mean clustering algorithm, Linkz, map, MCMC, mixture, overfitting, Wien

### Assyrian art

Filed under: pictures Tagged: Ashurbanipal, Assyrian art, British Museum, Iraq, Mesopotamia, Nineveh

### Le Monde puzzle [#902]

**A**nother arithmetics Le Monde mathematical puzzle:

*From the set of the integers between 1 and 15, is it possible to partition it in such a way that the product of the terms in the first set is equal to the sum of the members of the second set? **can this be generalised to an arbitrary set {1,2,..,n}?** What happens if instead we only consider the odd integers in those sets?.
*

I used brute force by looking at random for a solution,

pb <- txtProgressBar(min = 0, max = 100, style = 3) for (N in 5:100){ sol=FALSE while (!sol){ k=sample(1:N,1,prob=(1:N)*(N-(1:N))) pro=sample(1:N,k) sol=(prod(pro)==sum((1:N)[-pro])) } setTxtProgressBar(pb, N)} close(pb)and while it took a while to run the R code, it eventually got out of the loop, meaning there was at least one solution for all n’s between 5 and 100. (It does not work for n=1,2,3,4, for obvious reasons.) For instance, when n=15, the integers in the product part are either 3,5,7, 1,7,14, or 1,9,11. Jean-Louis Fouley sent me an explanation: when n is odd, n=2p+1, one solution is (1,p,2p), while when n is even, n=2p, one solution is (1,p-1,2p).

A side remark on the R code: thanks to a Cross Validated question by Paulo Marques, on which I thought I had commented on this blog, I learned about the progress bar function in R, *setTxtProgressBar()*, which makes running R code with loops much nicer!

For the second question, I just adapted the R code to exclude even integers:

while (!sol){ k=1+trunc(sample(1:N,1)/2) pro=sample(seq(1,N,by=2),k) cum=(1:N)[-pro] sol=(prod(pro)==sum(cum[cum%%2==1])) }and found a solution for n=15, namely 1,3,15 versus 5,7,9,11,13. However, there does not seem to be a solution for all n’s: I found solutions for n=15,21,23,31,39,41,47,49,55,59,63,71,75,79,87,95…

Filed under: Books, Kids, Statistics, University life Tagged: Chib's approximation, Le Monde, mathematical puzzle, mixture estimation, progress bar, R, txtProgressBar

### Domaine de Mortiès [in the New York Times]

*“I’m not sure how we found Domaine de Mortiès, an organic winery at the foothills of Pic St. Loup, but it was the kind of unplanned, delightful discovery our previous trips to Montpellier never allowed.”*

**L**ast year, I had the opportunity to visit and sample (!) from Domaine de Mortiès, an organic Pic Saint-Loup vineyard and winemaker. I have not yet opened the bottle of *Jamais Content* I bought then. Today I spotted in The New York Times a travel article on A visit to the in-laws in Montpellier that takes the author to Domaine de Mortiès, Pic Saint-Loup, Saint-Guilhem-du-Désert and other nice places, away from the overcrowded centre of town and the rather bland beach-town of Carnon, where she usually stays when visiting. And where we almost finished our *Bayesian Essentials with R*! To quote from the article, “Montpellier, France’s eighth-largest city, is blessed with a Mediterranean sun and a beautiful, walkable historic centre, a tourist destination in its own right, but because it is my husband’s home city, a trip there never felt like a vacation to me.” And when the author mentions the owner of Domaine de Mortiès, she states that “Mme. Moustiés looked about as enthused as a teenager working the checkout at Rite Aid”, which is not how I remember her from last year. Anyway, it is fun to see that visitors from New York City can unexpectedly come upon this excellent vineyard!

Filed under: Mountains, Travel, Wines Tagged: carignan, Carnon, Domaine Mortiès, French wines, grenache, Languedoc wines, Méditerranée, Montpellier, mourvèdre, New York city, Pic Saint Loup, Syrah, The New York Times, vineyard

### mixture models with a prior on the number of components

*“From a Bayesian perspective, perhaps the most natural approach is to treat the numberof components like any other unknown parameter and put a prior on it.”*

**A**nother mixture paper on arXiv! Indeed, Jeffrey Miller and Matthew Harrison recently arXived a paper on estimating the number of components in a mixture model, comparing the parametric with the non-parametric Dirichlet prior approaches. Since priors can be chosen towards agreement between those. This is an obviously interesting issue, as they are often opposed in modelling debates. The above graph shows a crystal clear agreement between finite component mixture modelling and Dirichlet process modelling. The same happens for classification. However, Dirichlet process priors do not return an estimate of the number of components, which may be considered a drawback if one considers this is an identifiable quantity in a mixture model… But the paper stresses that the number of estimated clusters under the Dirichlet process modelling tends to be larger than the number of components in the finite case. Hence that the Dirichlet process mixture modelling is not consistent in that respect, producing parasite extra clusters…

In the parametric modelling, the authors assume the same scale is used in all Dirichlet priors, that is, for all values of k, the number of components. Which means an incoherence when marginalising from k to (k-p) components. Mild incoherence, in fact, as the parameters of the different models do not have to share the same priors. And, as shown by Proposition 3.3 in the paper, this does not prevent coherence in the marginal distribution of the latent variables. The authors also draw a comparison between the distribution of the partition in the finite mixture case and the Chinese restaurant process associated with the partition in the infinite case. A further analogy is that the finite case allows for a stick breaking representation. A noteworthy difference between both modellings is about the size of the partitions

in the finite (homogeneous partitions) and infinite (extreme partitions) cases.

An interesting entry into the connections between “regular” mixture modelling and Dirichlet mixture models. Maybe not ultimately surprising given the past studies by Peter Green and Sylvia Richardson of both approaches (1997 in Series B and 2001 in JASA).

Filed under: Books, Statistics, University life Tagged: Bayesian asymptotics, Bayesian non-parametrics, Chinese restaurant process, consistency, Dirichlet mixture priors, Dirichlet process, mixtures, reversible jump

### snapshot from Gibbet Hill

Filed under: pictures, Travel, University life Tagged: England, Gibbet Hill, sculpture, sunset, University of Warwick

### accelerating Metropolis-Hastings algorithms by delayed acceptance

**M**arco Banterle, Clara Grazian, Anthony Lee, and myself just arXived our paper “Accelerating Metropolis-Hastings algorithms by delayed acceptance“, which is an major revision and upgrade of our “Delayed acceptance with prefetching” paper of last June. Paper that we submitted at the last minute to NIPS, but which did not get accepted. The difference with this earlier version is the inclusion of convergence results, in particular that, while the original Metropolis-Hastings algorithm dominates the delayed version in Peskun ordering, the later can improve upon the original for an appropriate choice of the early stage acceptance step. We thus included a new section on optimising the design of the delayed step, by picking the optimal scaling à la Roberts, Gelman and Gilks (1997) in the first step and by proposing a ranking of the factors in the Metropolis-Hastings acceptance ratio that speeds up the algorithm. The algorithm thus got adaptive. Compared with the earlier version, we have not pursued the second thread of prefetching as much, simply mentioning that prefetching and delayed acceptance could be merged. We have also included a section on the alternative suggested by Philip Nutzman on the ‘Og of using a growing ratio rather than individual terms, the advantage being the probability of acceptance stabilising when the number of terms grows, with the drawback being that expensive terms are not always computed last. In addition to our logistic and mixture examples, we also study in this version the MALA algorithm, since we can postpone computing the ratio of the proposals till the second step. The gain observed in one experiment is of the order of a ten-fold higher efficiency. By comparison, and in answer to one comment on Andrew’s blog, we did not cover the HMC algorithm, since the preliminary acceptance step would require the construction of a proxy to the acceptance ratio, in order to avoid computing a costly number of derivatives in the discretised Hamiltonian integration.

Filed under: Books, Statistics, University life Tagged: Andrew Gelman, Hamiltonian Monte Carlo, MALA, Metropolis-Hastings algorithm, Montréal, NIPS, Peskun ordering, prefetching, University of Warwick

### Overfitting Bayesian mixture models with an unknown number of components

**D**uring my Czech vacations, Zoé van Havre, Nicole White, Judith Rousseau, and Kerrie Mengersen1 posted on arXiv a paper on overfitting mixture models to estimate the number of components. This is directly related with Judith and Kerrie’s 2011 paper and with Zoé’s PhD topic. The paper also returns to the vexing (?) issue of label switching! I very much like the paper and not only because the author are good friends!, but also because it brings a solution to an approach I briefly attempted with Marie-Anne Gruet in the early 1990’s, just before finding about the reversible jump MCMC algorithm of Peter Green at a workshop in Luminy and considering we were not going to “beat the competition”! Hence not publishing the output of our over-fitted Gibbs samplers that were nicely emptying extra components… It also brings a rebuke about a later assertion of mine’s at an ICMS workshop on mixtures, where I defended the notion that over-fitted mixtures could not be detected, a notion that was severely disputed by David McKay…

What is so fantastic in Rousseau and Mengersen (2011) is that a simple constraint on the Dirichlet prior on the mixture weights suffices to guarantee that asymptotically superfluous components will empty out and signal they are truly superfluous! The authors here cumulate the over-fitted mixture with a tempering strategy, which seems somewhat redundant, the number of extra components being a sort of temperature, but eliminates the need for fragile RJMCMC steps. Label switching is obviously even more of an issue with a larger number of components and identifying empty components seems to require a lack of label switching for some components to remain empty!

When reading through the paper, I came upon the condition that *only* the priors of the weights are allowed to vary between temperatures. Distinguishing the weights from the other parameters does make perfect sense, as some representations of a mixture work without those weights. Still I feel a bit uncertain about the fixed prior constraint, even though I can see the rationale in not allowing for complete freedom in picking those priors. More fundamentally, I am less and less happy with independent identical or exchangeable priors on the components.

Our own recent experience with almost zero weights mixtures (and with Judith, Kaniav, and Kerrie) suggests not using solely a Gibbs sampler there as it shows poor mixing. And even poorer label switching. The current paper does not seem to meet the same difficulties, maybe thanks to (prior) tempering.

The paper proposes a strategy called *Zswitch* to resolve label switching, which amounts to identify a MAP for each possible number of components and a subsequent relabelling. Even though I do not entirely understand the way the permutation is constructed. I wonder in particular at the cost of the relabelling.

Filed under: Statistics Tagged: component of a mixture, Czech Republic, Gibbs sampling, label switching, Luminy, mixture estimation, Peter Green, reversible jump, unknown number of components

### Is Jeffreys’ prior unique?

*“A striking characterisation showing the central importance of Fisher’s information in a differential framework is due to Cencov (1972), who shows that it is the only invariant Riemannian metric under symmetry conditions.” *N. Polson, PhD Thesis, University of Nottingham, 1988

**F**ollowing a discussion on Cross Validated, I wonder whether or not the affirmation that Jeffreys’ prior was *the only prior construction rule that remains invariant* under arbitrary (if smooth enough) reparameterisation. In the discussion, Paulo Marques mentioned Nikolaj Nikolaevič Čencov’s book, *Statistical Decision Rules and Optimal Inference*, Russian book from 1972, of which I had not heard previously and which seems too theoretical [from Paulo’s comments] to explain why this rule would be the sole one. As I kept looking for Čencov’s references on the Web, I found Nick Polson’s thesis and the above quote. So maybe Nick could tell us more!

However, my uncertainty about the uniqueness of Jeffreys’ rule stems from the fact that, f I decide on a favourite or reference parametrisation—as Jeffreys indirectly does when selecting the parametrisation associated with a constant Fisher information—and on a prior derivation from the sampling distribution for this parametrisation, I have derived a parametrisation invariant principle. Possibly silly and uninteresting from a Bayesian viewpoint but nonetheless invariant.

Filed under: Books, Statistics, University life Tagged: cross validated, Harold Jeffreys, Jeffreys priors, NIck Polson, Nikolaj Nikolaevič Čencov, Russian mathematicians

### market static

*[Heard in the local market, while queuing for cheese:]*

– You took too much!

– Maybe, but remember your sister is staying for two days.

– My sister…, as usual, she will take a big serving and leave half of it!

– Yes, but she will make sure to finish the bottle of wine!

Filed under: Kids, Travel Tagged: farmers' market, métro static

### trans-dimensional nested sampling and a few planets

**T**his morning, in the train to Dauphine (train that was even more delayed than usual!), I read a recent arXival of Brendon Brewer and Courtney Donovan. Entitled Fast Bayesian inference for exoplanet discovery in radial velocity data, the paper suggests to associate Matthew Stephens’ (2000) birth-and-death MCMC approach with nested sampling to infer about the number N of exoplanets in an exoplanetary system. The paper is somewhat sparse in its description of the suggested approach, but states that the birth-date moves involves adding a planet with parameters simulated from the prior and removing a planet at random, both being accepted under a likelihood constraint associated with nested sampling. I actually wonder if this actually is the birth-date version of Peter Green’s (1995) RJMCMC rather than the continuous time birth-and-death process version of Matthew…

*“The traditional approach to inferring N also contradicts fundamental ideas in Bayesian computation. Imagine we are trying to compute the posterior distribution for a parameter a in the presence of a nuisance parameter b. This is usually solved by exploring the joint posterior for a and b, and then only looking at the generated values of a. Nobody would suggest the wasteful alternative of using a discrete grid of possible a values and doing an entire Nested Sampling run for each, to get the marginal likelihood as a function of a.”*

This criticism is receivable when there is a huge number of possible values of N, even though I see no fundamental contradiction with my ideas about Bayesian computation. However, it is more debatable when there are a few possible values for N, given that the exploration of the augmented space by a RJMCMC algorithm is often very inefficient, in particular when the proposed parameters are generated from the prior. The more when nested sampling is involved and simulations are run under the likelihood constraint! In the astronomy examples given in the paper, N never exceeds 15… Furthermore, by merging all N’s together, it is unclear how the evidences associated with the various values of N can be computed. At least, those are not reported in the paper.

The paper also omits to provide the likelihood function so I do not completely understand where “label switching” occurs therein. My first impression is that this is not a mixture model. However if the observed signal (from an exoplanetary system) is the sum of N signals corresponding to N planets, this makes more sense.

Filed under: Books, Statistics, Travel, University life Tagged: birth-and-death process, Chamonix, exoplanet, label switching, métro, nested sampling, Paris, RER B, reversible jump, Université Paris Dauphine

### ice-climbing Niagara Falls

**I** had missed these news that a frozen portion of the Niagara Falls had been ice-climbed. By Will Gadd on Jan. 27. This is obviously quite impressive given the weird and dangerous nature of the ice there, which is mostly frozen foam from the nearby waterfall. (I once climbed an easy route on such ice at the Chutes Montmorency, near Québec City, and it felt quite strange…) He even had a special ice hook designed for that climb as he did not trust the usual ice screws. Will Gadd has however climbed much more difficult routes like Helmcken Falls in British Columbia, which may be the hardest mixed route in the World!

Filed under: Mountains, pictures Tagged: British Columbia, Canada, Helmcken Falls, ice climbing, Niagara Falls, Niagara-on-the-Lake, USA

### Ubuntu issues

**I**t may be that weekends are the wrong time to tamper with computer OS… Last Sunday, I noticed my Bluetooth icon had a “turn off” option and since I only use Bluetooth for my remote keyboard and mouse when in Warwick, I turned it off, thinking I would turn it on again next week. This alas led to a series of problems, maybe as a coincidence since I also updated the Kubuntu 14.04 system over the weekend.

- I cannot turn Bluetooth on again! My keyboard and mouse are no longer recognised or detected. No Bluetooth adapter is found by the system setting. Similarly,
*sudo modprobe bluetooth*shows nothing. I have installed a new interface called Blueman but to no avail. The fix suggested on forums to run*rfkill unblock bluetooth*does not work either… Actually*rfkill list all*only returns the wireless device. Which is working fine. - My webcam vanished as well. It was working fine before the weekend.
- Accessing some webpages, including all New York Times articles, now takes forever on Firefox! If less on Chrome.

Is this a curse of sorts?!

As an aside, I also found this week that I cannot update Adobe reader from version 9 to version 11, as Adobe does not support Linux versions any more… Another bummer. If one wants to stick to acrobat.

**Update [03/02]**

Thanks to Ingmar and Thomas, I got both my problems solved! The Bluetooth restarted after I shut down my *unplugged* computer, in connection with an USB over-current protection. And Thomas figured out my keyboard had a key to turn the webcam off and on, key that I had pressed when trying to restart the Bluetooth device. Et voilà!

Filed under: Kids, Linux Tagged: Bluetooth, Kubuntu, Linux, Ubuntu 14.04

### je suis Avijit Roy

**আমরা শোকাহত**

**কিন্তু আমরা অপরাজিত**

[“We mourn but we are not defeated”]

Filed under: Uncategorized Tagged: atheism, Bangladesh, blogging, fanaticism, fascism, Mukto-Mona

### Unbiased Bayes for Big Data: Path of partial posteriors [a reply from the authors]

*[Here is a reply by Heiko Strathmann to my post of yesterday. Along with the slides of a talk in Oxford mentioned in the discussion.]*

Thanks for putting this up, and thanks for the discussion. Christian, as already exchanged via email, here are some answers to the points you make.

First of all, we don’t claim a free lunch — and are honest with the limitations of the method (see negative examples). Rather, we make the point that we *can* achieve computational savings in certain situations — essentially exploiting redundancy (what Michael called “tall” data in his note on subsampling & HMC) leading to fast convergence of posterior statistics.

Dan is of course correct noticing that if the posterior statistic does not converge nicely (i.e. all data counts), then truncation time is “mammoth”. It is also correct that it might be questionable to aim for an unbiased Bayesian method in the presence of such redundancies. However, these are the two extreme perspectives on the topic. The message that we want to get along is that there is a trade-off in between these extremes. In particular the GP examples illustrate this nicely as we are able to reduce MSE in a regime where posterior statistics have *not* yet stabilised, see e.g. figure 6.

*“And the following paragraph is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation.”*

To clarify, the paragraph refers to the *additional* convergence issues induced by alternative Markov transition kernels of mini-batch-based full posterior sampling methods by Welling, Bardenet, Dougal & co. For example, Firefly MC’s mixing time is increased by a factor of 1/q where q*N is the mini-batch size. Mixing of stochastic gradient Langevin gets worse over time. This is *not* true for our scheme as we can use standard transition kernels. It is still essential for the partial posterior Markov chains to converge (*if* MCMC is used). However, as this is a well studied problem, we omit the topic in our paper and refer to standard tools for diagnosis. All this is independent of the debiasing device.

**About MCMC convergence.**

Yesterday in Oxford, Pierre Jacob pointed out that if MCMC is used for estimating partial posterior statistics, the overall result is *not* unbiased. We had a nice discussion how this bias could be addressed via a two-stage debiasing procedure: debiasing the MC estimates as described in the “Unbiased Monte Carlo” paper by Agapiou et al, and then plugging those into the path estimators — though it is (yet) not so clear how (and whether) this would work in our case.

In the current version of the paper, we do not address the bias present due to MCMC. We have a paragraph on this in section 3.2. Rather, we start from a premise that full posterior MCMC samples are a gold standard. Furthermore, the framework we study is not necessarily linked to MCMC – it could be that the posterior expectation is available in closed form, but simply costly in N. In this case, we can still unbiasedly estimate this posterior expectation – see GP regression.

*“The choice of the tail rate is thus quite delicate to validate against the variance constraints (2) and (3).”*

It is true that the choice is crucial in order to control the variance. However, provided that partial posterior expectations converge at a rate n-β with n the size of a minibatch, computational complexity can be reduced to N1-α (α<β) without variance exploding. There is a trade-off: the faster the posterior expectations converge, more computation can be saved; β is in general unknown, but can be roughly estimated with the “direct approach” as we describe in appendix.

**About the “direct approach”**

It is true that for certain classes of models and φ functionals, the direct averaging of expectations for increasing data sizes yields good results (see log-normal example), and we state this. However, the GP regression experiments show that the direct averaging gives a larger MSE as with debiasing applied. This is exactly the trade-off mentioned earlier.

I also wonder what people think about the comparison to stochastic variational inference (GP for Big Data), as this hasn’t appeared in discussions yet. It is the comparison to “non-unbiased” schemes that Christian and Dan asked for.

Filed under: Statistics, University life Tagged: arXiv, bias vs. variance, big data, convergence assessment, de-biasing, Firefly MC, MCMC, Monte Carlo Statistical Methods, telescoping estimator, unbiased estimation

### Unbiased Bayes for Big Data: Path of partial posteriors

*“Data complexity is sub-linear in N, no bias is introduced, variance is finite.”*

**H**eiko Strathman, Dino Sejdinovic and Mark Girolami have arXived a few weeks ago a paper on the use of a telescoping estimator to achieve an unbiased estimator of a Bayes estimator relying on the entire dataset, while using only a small proportion of the dataset. The idea is that a sequence converging—to an unbiased estimator—of estimators φt can be turned into an unbiased estimator by a stopping rule T:

is indeed unbiased. In a “Big Data” framework, the components φt are MCMC versions of posterior expectations based on a proportion αt of the data. And the stopping rule cannot exceed αt=1. The authors further propose to replicate this unbiased estimator R times on R parallel processors. They further claim a reduction in the computing cost of

which means that a sub-linear cost can be achieved. However, the gain in computing time means higher variance than for the full MCMC solution:

*“It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (…), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compare **approaches in that region. “*

I first got a “free lunch” impression when reading the paper, namely it sounded like using a random stopping rule was enough to overcome unbiasedness and large size jams. This is not the message of the paper, but I remain both intrigued by the possibilities the unbiasedness offers *and* bemused by the claims therein, for several reasons:

- the above estimator requires computing T MCMC (partial) estimators φt in parallel. All of those estimators have to be associated with Markov chains in a stationary regime and they all are associated with independent chains. While addressing the convergence of a single chain, the paper does not truly cover the
*simultaneous*convergence assessment on a group of T parallel MCMC sequences. And the paragraph below is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation. In fact, further discussion with the authors (!) led me to understand this relates to the existing alternatives for handling large data, like firefly Monte Carlo: Convergence to the stationary remains essential (and somewhat problematic) for all the partial estimators.

*“If a Markov chain is, in line with above considerations, used for computing partial posterior expectations *

*
Categories: Bayesian Bloggers
*