## Bayesian News Feeds

### Adaptive Priors Based on Splines with Random Knots

**Eduard Belitser**,

**Paulo Serra**.

**Source: **Bayesian Analysis, Volume 9, Number 4, 859--882.

**Abstract:**

Splines are useful building blocks when constructing priors on nonparametric models indexed by functions. Recently it has been established in the literature that hierarchical adaptive priors based on splines with a random number of equally spaced knots and random coefficients in the B-spline basis corresponding to those knots lead, under some conditions, to optimal posterior contraction rates, over certain smoothness functional classes. In this paper we extend these results for when the location of the knots is also endowed with a prior. This has already been a common practice in Markov chain Monte Carlo applications, but a theoretical basis in terms of adaptive contraction rates was missing. Under some mild assumptions, we establish a result that provides sufficient conditions for adaptive contraction rates in a range of models, over certain functional classes of smoothness up to the order of the splines that are used. We also present some numerical results illustrating how such a prior adapts to inhomogeneous variability (smoothness) of the function in the context of nonparametric regression.

### Stratified Graphical Models - Context-Specific Independence in Graphical Models

**Henrik Nyman**,

**Johan Pensar**,

**Timo Koski**,

**Jukka Corander**.

**Source: **Bayesian Analysis, Volume 9, Number 4, 883--908.

**Abstract:**

Theory of graphical models has matured over more than three decades to provide the backbone for several classes of models that are used in a myriad of applications such as genetic mapping of diseases, credit risk evaluation, reliability and computer security. Despite their generic applicability and wide adoption, the constraints imposed by undirected graphical models and Bayesian networks have also been recognized to be unnecessarily stringent under certain circumstances. This observation has led to the proposal of several generalizations that aim at more relaxed constraints by which the models can impose local or context-specific dependence structures. Here we consider an additional class of such models, termed stratified graphical models. We develop a method for Bayesian learning of these models by deriving an analytical expression for the marginal likelihood of data under a specific subclass of decomposable stratified models. A non-reversible Markov chain Monte Carlo approach is further used to identify models that are highly supported by the posterior distribution over the model space. Our method is illustrated and compared with ordinary graphical models through application to several real and synthetic datasets.

### The Evidentiary Credible Region

**David Shalloway**.

**Source: **Bayesian Analysis, Volume 9, Number 4, 909--922.

**Abstract:**

Many disparate definitions of Bayesian credible intervals and regions are in use, which can lead to ambiguous presentation of results. It is particularly unsatisfactory when intervals are specified that do not match the one-sided character of the evidence. We suggest that a sensible resolution is to use the parameterization-independent region that maximizes the information gain between the initial prior and posterior distributions, as assessed by their Kullback-Leibler divergence, subject to the constraint on included posterior probability. This turns out to be equivalent to the relative surprise region previously defined by Evans (1997), and thus provides information theoretic support for its use. We also show that this region is the constrained optimizer over the posterior measure of any strictly monotonic function of the likelihood, which explains its many optimal properties, and that it is guaranteed to be consistent with the sidedness of the evidence. Because all of its equivalent derivations depend on the evidence as well as on the posterior distribution, we suggest that it be called the evidentiary credible region.

### Hellinger Distance and Non-informative Priors

**Arkady Shemyakin**.

**Source: **Bayesian Analysis, Volume 9, Number 4, 923--938.

**Abstract:**

This paper introduces an extension of the Jeffreys’ rule to the construction of objective priors for non-regular parametric families. A new class of priors based on Hellinger information is introduced as Hellinger priors. The main results establish the relationship of Hellinger priors to the Jeffreys’ rule priors in the regular case, and to the reference and probability matching priors for the non-regular class introduced by Ghosal and Samanta. These priors are also studied for some non-regular examples outside of this class. Their behavior proves to be similar to that of the reference priors considered by Berger, Bernardo, and Sun, however some differences are observed. For the multi-parameter case, a combination of Hellinger priors and reference priors is suggested and some examples are considered.

### Equivalence between the Posterior Distribution of the Likelihood Ratio and a p-value in an Invariant Frame

**Isabelle Smith**,

**André Ferrari**.

**Source: **Bayesian Analysis, Volume 9, Number 4, 939--962.

**Abstract:**

The Posterior distribution of the Likelihood Ratio (PLR) is proposed by Dempster in 1973 for significance testing in the simple vs. composite hypothesis case. In this hypothesis test case, classical frequentist and Bayesian hypothesis tests are irreconcilable, as emphasized by Lindley’s paradox, Berger & Selke in 1987 and many others. However, Dempster shows that the PLR (with inner threshold 1) is equal to the frequentist p-value in the simple Gaussian case. In 1997, Aitkin extends this result by adding a nuisance parameter and showing its asymptotic validity under more general distributions. Here we extend the reconciliation between the PLR and a frequentist p-value for a finite sample, through a framework analogous to the Stein’s theorem frame in which a credible (Bayesian) domain is equal to a confidence (frequentist) domain.

### A Stochastic Variational Framework for Fitting and Diagnosing Generalized Linear Mixed Models

**Linda S. L. Tan**,

**David J. Nott**.

**Source: **Bayesian Analysis, Volume 9, Number 4, 963--1004.

**Abstract:**

In stochastic variational inference, the variational Bayes objective function is optimized using stochastic gradient approximation, where gradients computed on small random subsets of data are used to approximate the true gradient over the whole data set. This enables complex models to be fit to large data sets as data can be processed in mini-batches. In this article, we extend stochastic variational inference for conjugate-exponential models to nonconjugate models and present a stochastic nonconjugate variational message passing algorithm for fitting generalized linear mixed models that is scalable to large data sets. In addition, we show that diagnostics for prior-likelihood conflict, which are useful for Bayesian model criticism, can be obtained from nonconjugate variational message passing automatically, as an alternative to simulation-based Markov chain Monte Carlo methods. Finally, we demonstrate that for moderate-sized data sets, convergence can be accelerated by using the stochastic version of nonconjugate variational message passing in the initial stage of optimization before switching to the standard version.

### Bayesian Analysis, Volume 9, Number 4 (2014)

Contents:

**Jesse Windle**, **Carlos M. Carvalho**. A Tractable State-Space Model for Symmetric Positive-Definite Matrices. 759--792.

**Roberto Casarin**. Comment on Article by Windle and Carvalho. 793--804.

**Catherine Scipione Forbes**. Comment on Article by Windle and Carvalho. 805--808.

**Enrique ter Horst**, **German Molina**. Comment on Article by Windle and Carvalho. 809--818.

**Jesse Windle**, **Carlos M. Carvalho**. Rejoinder. 819--822.

**Asael Fabian Martínez**, **Ramsés H. Mena**. On a Nonparametric Change Point Detection Model in Markovian Regimes. 823--858.

**Eduard Belitser**, **Paulo Serra**. Adaptive Priors Based on Splines with Random Knots. 859--882.

**Henrik Nyman**, **Johan Pensar**, **Timo Koski**, **Jukka Corander**. Stratified Graphical Models - Context-Specific Independence in Graphical Models. 883--908.

**David Shalloway**. The Evidentiary Credible Region. 909--922.

**Arkady Shemyakin**. Hellinger Distance and Non-informative Priors. 923--938.

**Isabelle Smith**, **André Ferrari**. Equivalence between the Posterior Distribution of the Likelihood Ratio and a p-value in an Invariant Frame. 939--962.

**Linda S. L. Tan**, **David J. Nott**. A Stochastic Variational Framework for Fitting and Diagnosing Generalized Linear Mixed Models. 963--1004.

### some LaTeX tricks

Here are a few LaTeX tricks I learned or rediscovered when working on several papers the past week:

- I am always forgetting how to make aligned equations with a single equation number, so I found this solution on the TeX forum of stackexchange, Namely use the equation environment and then an aligned environment inside. Or the split environment. But it does not always work…
- Another frustrating black hole is how to deal with integral signs that do not adapt to the integrand. Too bad we cannot use \left\int, really! Another stackexchange question led me to the bigints package. Not perfect though.
- Pierre Pudlo also showed me the commands \graphicspath{{dir1}{dir2}} and \DeclareGraphicsExtensions{.pdf,.png,.jpg} to avoid coding the entire path to each image and to put an order on the extension type, respectively. The second one is fairly handy when working on drafts. The first one does not seem to work with symbolic links, though…

Filed under: Books, Kids, Statistics, University life Tagged: bigint, graphical extension, LaTeX, mathematical equations, StackExchange

### not converging to London for an [extra]ordinary Read Paper

**O**n December 10, I will alas not travel to London to attend the Read Paper on sequential quasi-Monte Carlo presented by Mathieu Gerber and Nicolas Chopin to The Society, as I fly instead to Montréal for the NIPS workshops… I am quite sorry to miss this event, as this is a major paper which brings quasi-Monte Carlo methods into mainstream statistics. I will most certainly write a discussion and remind Og’s readers that contributed (800 words) discussions are welcome from everyone, the deadline for submission being January 02.

Filed under: Books, Kids, pictures, Statistics, Travel, University life Tagged: discussion paper, London, MCQMC, Nicolas Chopin, NIPS 2014, Read paper, Royal Statistical Society, sequential Monte Carlo

### limbo IPA

Filed under: pictures, Travel, Wines Tagged: ÌPA, beer, Limbo IPA, Long Trail Brewing Company, Maine, microbrewery, USA, Vermont

### Bayesian evidence and model selection

**A**nother arXived paper with a topic close to my interests, posted by Knuth et al. today. Namely, Bayesian model selection. However, after reading the paper in Gainesville, I am rather uncertain about its prospects, besides providing an entry to the issue (for physicists?). Indeed, the description of (Bayesian) evidence is concentrating on rough approximations, in a physics perspective, with a notion of Occam’s factor that measures the divergence to the maximum likelihood. (As usual when reading physics literature, I am uncertain as to why one should consider always approximations.) The numerical part mentions the tools of importance sampling and Laplace approximations, path sampling and nested sampling. The main part of the paper consists in applying those tools to signal processing models. One of them is a mixture example where nested sampling is used to evaluate the most likely number of components. Using uniform priors over non-specified hypercubes. In an example about photometric signal from an exoplanet, two models are distinguished by evidences of 37,764 and 37,765, with another one at 37,748. It seems to me that this very proximity simply prevents the comparison of those models, even without accounting for the Monte Carlo variability. And does not suffice to conclude about a scientific theory (“effectively characterize exoplanetary systems”). Which leads to my current thinking, already expressed on that blog, that Bayes factors and posterior probabilities should be replaced with an alternative, including uncertainty about the very Bayes factor (or evidence).

Filed under: Statistics

### differences between Bayes factors and normalised maximum likelihood

**A** recent arXival by Heck, Wagenmaker and Morey attracted my attention: Three Qualitative Differences Between Bayes Factors and Normalized Maximum Likelihood, as it provides an analysis of the differences between Bayesian analysis and Rissanen’s Optimal Estimation of Parameters that I reviewed a while ago. As detailed in this review, I had difficulties with considering the normalised likelihood

as the relevant quantity. One reason being that the distribution does not make experimental sense: for instance, how can one simulate from this distribution? [I mean, when considering only the original distribution.] Working with the simple binomial B(n,θ) model, the authors show the quantity corresponding to the posterior probability may be constant for most of the data values, produces a different upper bound and hence a different penalty of model complexity, and may differ in conclusion for some observations. Which means that the apparent proximity to using a Jeffreys prior and Rissanen’s alternative does not go all the way. While it is a short note and only focussed on producing an illustration in the Binomial case, I find it interesting that researchers investigate the Bayesian nature (vs. artifice!) of this approach…

Filed under: Books, Kids, Statistics, University life Tagged: Bayes factors, binomial distribution, model complexity, normalised maximum likelihood, Rissanen

### importance sampling schemes for evidence approximation [revised]

**A**fter a rather intense period of new simulations and versions, Juong Een (Kate) Lee and I have now resubmitted our paper on (some) importance sampling schemes for evidence approximation in mixture models to Bayesian Analysis. There is no fundamental change in the new version but rather a more detailed description of what those importance schemes mean in practice. The original idea in the paper is to improve upon the Rao-Blackwellisation solution proposed by Berkoff et al. (2002) and later by Marin et al. (2005) to avoid the impact of label switching on Chib’s formula. The Rao-Blackwellisation consists in averaging over all permutations of the labels while the improvement relies on the elimination of useless permutations, namely those that produce a negligible conditional density in Chib’s (candidate’s) formula. While the improvement implies truncated the overall sum and hence induces a potential bias (which was the concern of one referee), the determination of the irrelevant permutations after relabelling next to a single mode does not appear to cause any bias, while reducing the computational overload. Referees also made us aware of many recent proposals that conduct to different evidence approximations, albeit not directly related with our purpose. (One was Rodrigues and Walker, 2014, discussed and commented in a recent post.)

Filed under: Statistics, University life Tagged: Andrew Gelman, candidate approximation, Chib's approximation, evidence, finite mixtures, label switching, permutations, Rao-Blackwellisation

### a probabilistic proof to a quasi-Monte Carlo lemma

As I was reading in the Paris métro a new textbook on Quasi-Monte Carlo methods, *Introduction to Quasi-Monte Carlo Integration and Applications*, written by Gunther Leobacher and Friedrich Pillichshammer, I came upon the lemma that, given two sequences on (0,1) such that, for all i’s,

and the geometric bound made me wonder if there was an easy probabilistic proof to this inequality. Rather than the algebraic proof contained in the book. Unsurprisingly, there is one based on associating with each pair (u,v) a pair of independent events (A,B) such that, for all i’s,

and representing

Obviously, there is no visible consequence to this remark, but it was a good way to switch off the métro hassle for a while! (The book is under review and the review will hopefully be posted on the ‘Og as soon as it is completed.)

Filed under: Books, Statistics, Travel, University life Tagged: book review, inequalities, métro, Paris, probability basics, quasi-Monte Carlo methods

### snapshot from UF campus (#2)

Filed under: pictures, Running, Travel, University life Tagged: Florida, Gainesville, Griffin-Floyd Hall, Spanish moss, University of Florida

### Le Monde puzzle [#887bis]

**A**s mentioned in the previous post, an alternative consists in finding the permutation of {1,…,N} by “adding” squares left and right until the permutation is complete or no solution is available. While this sounds like the dual of the initial solution, it brings a considerable improvement in computing time, as shown below. I thus redefined the construction of the solution by initialising the permutation at random (it could start at 1 just as well)

and then completing only with possible neighbours, left

out=outer(perfect-perm[t],friends,"==") if (max(out)==1){ t=t+1 perm[t]=sample(rep(perfect[apply(out,1, max)==1],2),1)-perm[t-1] friends=friends[friends!=perm[t]]}or right

out=outer(perfect-perm[1],friends,"==") if (max(out)==1){ t=t+1 perf=sample(rep(perfect[apply(out,1, max)==1],2),1)-perm[1] perm[1:t]=c(perf,perm[1:(t-1)]) friends=friends[friends!=perf]}(If you wonder about why the *rep* in the *sample* step, it is a trick I just found to avoid the insufferable feature that sample(n,1) is equivalent to sample(1:n,1)! It costs basically nothing but bypasses reprogramming sample() each time I use it… I am very glad I found this trick!) The gain in computing time is amazing:

An unrelated point is that a more interesting (?) alternative problem consists in adding a toroidal constraint, namely to have the first and the last entries in the permutation to also sum up to a perfect square. Is it at all possible?

Filed under: Kids, R, Statistics, University life Tagged: Le Monde, mathematical puzzle, sample

### snapshot from UF campus

Filed under: pictures, Running, Travel, University life Tagged: Challis Lecture, Florida, Gainesville, University of Florida, USA

### Le Monde puzzle [#887]

**A** simple combinatorics Le Monde mathematical puzzle:

*N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?
*

Indeed, from an R programming point of view, all I have to do is to go over all possible permutations of {1,2,..,N} until one works or until I have exhausted all possible permutations for a given N. However, 25!=10²⁵ is a wee bit too large… Instead, I resorted once again to brute force simulation, by first introducing possible neighbours of the integers

perfect=(1:trunc(sqrt(2*N)))^2 friends=NULL le=1:N for (perm in 1:N){ amis=perfect[perfect>perm]-perm amis=amis[amis<N] le[perm]=length(amis) friends=c(friends,list(amis)) }and then proceeding to construct the permutation one integer at time by picking from its remaining potential neighbours until there is none left or the sequence is complete

orderin=0*(1:N) t=1 orderin[1]=sample((1:N),1) for (perm in 1:N){ friends[[perm]]=friends[[perm]] [friends[[perm]]!=orderin[1]] le[perm]=length(friends[[perm]]) } while (t<N){ if (length(friends[[orderin[t]]])==0) break() if (length(friends[[orderin[t]]])>1){ orderin[t+1]=sample(friends[[orderin[t]]],1)}else{ orderin[t+1]=friends[[orderin[t]]] } for (perm in 1:N){ friends[[perm]]=friends[[perm]] [friends[[perm]]!=orderin[t+1]] le[perm]=length(friends[[perm]]) } t=t+1}and then repeating this attempt until a full sequence is produced or a certain number of failed attempts has been reached. I gained in efficiency by proposing a second completion on the left of the first integer once a break occurs:

while (t<N){ if (length(friends[[orderin[1]]])==0) break() orderin[2:(t+1)]=orderin[1:t] if (length(friends[[orderin[2]]])>1){ orderin[1]=sample(friends[[orderin[2]]],1)}else{ orderin[1]=friends[[orderin[2]]] } for (perm in 1:N){ friends[[perm]]=friends[[perm]] [friends[[perm]]!=orderin[1]] le[perm]=length(friends[[perm]]) } t=t+1}(An alternative would have been to complete left and right by squared numbers taken at random…) The result of running this program showed there exist permutations with the above property for N=15,16,17,23,25,26,…,77. Here is the solution for N=49:

25 39 10 26 38 43 21 4 32 49 15 34 30 6 3 22 42 7 9 27 37 12 13 23 41 40 24 1 8 28 36 45 19 17 47 2 14 11 5 44 20 29 35 46 18 31 33 16 48

As an aside, the authors of Le Monde puzzle pretended (in Tuesday, Nov. 12, edition) that there was no solution for N=23, while this sequence

22 3 1 8 17 19 6 10 15 21 4 12 13 23 2 14 11 5 20 16 9 7 18

sounds fine enough to me… I more generally wonder at the general principle behind the existence of such sequences. It sounds quite probable that they exist for N>24. (The published solution does not bring any light on this issue, so I assume the authors have no mathematical analysis to provide.)

Filed under: Books, Kids, R, Statistics Tagged: Le Monde, mathematical puzzle, perfect square, R

### that the median cannot be a sufficient statistic

**W**hen reading an entry on The Chemical Statistician that a sample median could often be a choice for a sufficient statistic, it attracted my attention as I had never thought a median could be sufficient. After thinking a wee bit more about it, and even posting a question on cross validated, but getting no immediate answer, I came to the conclusion that medians (and other quantiles) cannot be sufficient statistics for arbitrary (large enough) sample sizes (a condition that excludes the obvious cases of one & two observations where the sample median equals the sample mean).

In the case when the support of the distribution does not depend on the unknown parameter θ, we can invoke the Darmois-Pitman-Koopman theorem, namely that the density of the observations is necessarily of the exponential family form,

to conclude that, if the natural sufficient statistic

is minimal sufficient, then the median is a function of S, which is impossible since modifying an extreme in the *n>2* observations modifies S but not the median.

In the other case when the support does depend on the unknown parameter θ, we can consider the case when

where the set indexed by θ is the support of f. In that case, the factorisation theorem implies that

is a 0-1 function of the sample median. Adding a further observation y⁰ which does not modify the median then leads to a contradiction since it may be in or outside the support set.

Incidentally, if an aside, when looking for examples, I played with the distribution

which has θ as its theoretical median if not mean. In this example, not only the sample median is not sufficient (the only sufficient statistic is the order statistic and rightly so since the support is fixed and the distributions not in an exponential family), but the MLE is also different from the sample median. Here is an example with n=30 observations, the sienna bar being the sample median:

Filed under: Kids, Statistics, University life Tagged: blogging, cross validated, exponential families, mean vs. median, Pitman-Koopman theorem, sufficiency

### about the strong likelihood principle

**D**eborah Mayo arXived a Statistical Science paper a few days ago, along with discussions by Jan Bjørnstad, Phil Dawid, Don Fraser, Michael Evans, Jan Hanning, R. Martin and C. Liu. I am very glad that this discussion paper came out and that it came out in Statistical Science, although I am rather surprised to find no discussion by Jim Berger or Robert Wolpert, and even though I still cannot entirely follow the deductive argument in the rejection of Birnbaum’s proof, just as in the earlier version in Error & Inference. But I somehow do not feel like going again into a new debate about this critique of Birnbaum’s derivation. (Even though statements like the fact that the SLP “would preclude the use of sampling distributions” (p.227) would call for contradiction.)

*“It is the imprecision in Birnbaum’s formulation that leads to a faulty impression of exactly what is proved.” M. Evans*

Indeed, at this stage, I fear that [for me] a more relevant issue is whether or not the debate does matter… At a logical cum foundational [and maybe cum historical] level, it makes perfect sense to uncover if and which if any of the myriad of Birnbaum’s likelihood Principles holds. [Although trying to uncover Birnbaum's motives and positions over time may not be so relevant.] I think the paper and the discussions acknowledge that *some* version of the weak conditionality Principle does not imply *some* version of the strong likelihood Principle. With other logical implications remaining true. At a methodological level, I am less much less sure it matters. Each time I taught this notion, I got blank stares and incomprehension from my students, to the point I have now stopped altogether teaching the likelihood Principle in class. And most of my co-authors do not seem to care very much about it. At a purely mathematical level, I wonder if there even is ground for a debate since the notions involved can be defined in various imprecise ways, as pointed out by Michael Evans above and in his discussion. At a statistical level, sufficiency eventually is a strange notion in that it seems to make plenty of sense until one realises there is no interesting sufficiency outside exponential families. Just as there are very few parameter transforms for which unbiased estimators can be found. So I also spend very little time teaching and even less worrying about sufficiency. (As it happens, I taught the notion this morning!) At another and presumably more significant statistical level, what matters is information, e.g., conditioning means adding information (i.e., about which experiment has been used). While complex settings may prohibit the use of the entire information provided by the data, at a formal level there is no argument for not using the entire information, i.e. conditioning upon the entire data. (At a computational level, this is no longer true, witness ABC and similar limited information techniques. By the way, ABC demonstrates if needed why sampling distributions matter so much to Bayesian analysis.)

*“Non-subjective Bayesians who (…) have to live with some violations of the likelihood principle (…) since their prior probability distributions are influenced by the sampling distribution.” D. Mayo (p.229)*

In the end, the fact that the prior may depend on the form of the sampling distribution and hence does violate the likelihood Principle does not worry me so much. In most models I consider, the parameters are endogenous to those sampling distributions and do not live an ethereal existence independently from the model: they are substantiated and calibrated by the model itself, which makes the discussion about the LP rather vacuous. See, e.g., the coefficients of a linear model. In complex models, or in large datasets, it is even impossible to handle the whole data or the whole model and proxies have to be used instead, making worries about the structure of the (original) likelihood vacuous. I think we have now reached a stage of statistical inference where models are no longer accepted as ideal truth and where approximation is the hard reality, imposed by the massive amounts of data relentlessly calling for immediate processing. Hence, where the self-validation or invalidation of such approximations in terms of predictive performances is the relevant issue. Provided we can at all face the challenge…

Filed under: Books, Statistics, University life Tagged: ABC, ABC model choice, Alan Birnbaum, Bayesian Analysis, conditioning, sufficient statistics, The Likelihood Principle, weak conditionality principle