Bayesian News Feeds
Here are the two pairs of beautiful skies offered by Blossom for the Richard Tweedie ski race of Wednesday! Provided the snow cover holds till then on the ski track. So far, the chances are very good, according to the ski school organisers. Confirmation this afternoon! It will definitely take part: registration tomorrow morning (Jan. 7, closing at half past noon) and meeting for the race at the top of the Parsa ski-lift (reached via the Brévent cable-car) on the stade (stadium) from 1pm onwards.
Filed under: Mountains, Running, University life Tagged: Blossom skis, Chamonix-Mont-Blanc, ESF, MCMSki IV, Richard Tweedie
MCMSki IV is about to start! While further participants may still register (registration is still open!), we are currently 223 registered participants, without accompanying people. I do hope most of these managed to reach the town of Chamonix-Mont-Blanc despite the foul weather on the East Coast. Unfortunately, three speakers (so far) cannot make it: Yugo Chen (Urbana-Champaign), David Hunter (Penn State), Georgios Karagiannis (Toronto), and Liam Paninski (New York). Nial Friel will replace David Hunter and give a talk on Noisy MCMC.
First, the posters for tonight session (A to K authors) should be posted today (before dinner) on the boards at the end of the main lecture theatre. And removed tonight as well. Check my wordpress blog for the abstracts. (When I mentioned there was no deadline for sending abstracts, I did not expect getting one last Friday!)
Second, I remind potential skiers that the most manageable option is to ski on the Brévent domain, uphill from the conference centre. There is even a small rental place facing the cable-car station (make sure to phone +33450535264 to check they still have skis available) and renting storage closets…
Filed under: Mountains, R, Statistics, University life Tagged: Chamonix, doodle, France, Geneva, ISBA conference, MCMC, MCMSki IV, Mont Blanc, Monte Carlo Statistical Methods, posters, shuttle, simulation, ski
[To keep with tradition, here is my daughter's comments on the second Hobbit movie.]
I think this second instalment is just as good as the first part of the hobbit. The biggest mistake of the movie is the part with the dragon, he spits fire when he wants and when he can kill some dwarves he doesn’t do it! It is said that he is very smart but the dwarves manage easily to deceit him. The part with the elves is not really better, the language imagined by Tolkien is fabulous but I expected more surprises in this universe. The fact that the dwarves easily get out is also incredible! And the fight with the orcs is unrealistic too.. The part in the forest is well made, the spiders seem real and the intervention of Bilbo is superb. The man who can change himself into a bear is a great idea, well realised but he doesn’t act in a logical way, he run after the dwarves and two seconds after he let them sleep in his house. The landscapes at the beginning are awesome, that is a great entry. But the music is disappointing, because there are very few songs and they are not as entertaining as the hobbit theme or the song of the lonely mountain in the first instalment, but the last song pushes the level up, too bad it is at the end. The actors play quite well, the news characters are really well made, like the fisherman Bard, contributing to a good section of the movie which feeds our curiosity. He is intriguing and his story unravels one step at a time. The Master of the city of Laketown is also typical, and his character is easily understood. The female elf is not a glamourous girl as in charicatural American movies, a good feature because it’s a change from others films. Tauriel plays in an interesting way but seems a little naïve at times. We don’t understand what are her feelings towards Legolas and the dwarf Kili.
Filed under: Books, Kids Tagged: Bilbo, elves, Far Over the Misty Mountains Cold, Smaug, songs, Tolkien
Filed under: Kids, Mountains Tagged: Alps, cable car, Chamonix-Mont-Blanc, Les Drus, Les Houches, MCMSki IV, ski
Xiao-Li Meng asked this question in his latest XL column, to which Andrew replied faster than I. And in the same mood as mine. I had taken part to a recent discussion on this topic within the IMS Council, namely whether or not the IMS should associate with other organisations like ASA towards funding and supporting this potential prize. My initial reaction was one of surprise that we could consider mimicking/hijacking the Nobel for our field. First, I dislike the whole spirit of most prizes, from the personalisation to the media frenzy and distortion, to the notion that we could rank discoveries and research careers within a whole field. And separate what is clearly due to a single individual from what is due to a team of researchers.
Being clueless about those fields, I will not get into a discussion of who should have gotten a Nobel Prize in medicine, physics, or chemistry. And who should not have. But there are certainly many worthy competitors to the actual winners. And this is not the point: I do not see how any of this fights the downfall of scientific students in most of the Western World. That is, how a teenager can get more enticed to undertake maths or physics studies because she saw a couple old guys wearing weird clothes getting a medal and a check in Sweden. I have no actual data, but could Xiao-Li give me a quantitative assessment of the fact that Nobel Prizes “attract future talent”? Chemistry departments keep closing for lack of a sufficient number of students, (pure) maths and physics departments threatened with the same fate… Even the Fields Medal, which has at least the appeal of being delivered to younger researchers, does not seem to fit Xiao-Li’s argument. (To take a specific example: The recent Fields medallist Cédric Villani is a great communicator and took advantage of his medal to promote maths throughout France, in conferences, the medias, and by launching all kinds of initiative. I still remain sceptical about the overall impact on recruiting young blood in maths programs [again with no data to back up my feeling).) I will even less mention Nobel prizes for literature and peace, as there clearly is a political agenda in the nomination. (And selecting Sartre for the Nobel prize for literature definitely discredited it. At least for me.)
“…the media and public have given much more attention to the Fields Medal than to the COPSS Award, even though the former has hardly been about direct or even indirect impact on everyday life.” XL
Well, I do not see this other point of Xiao-Li’s. Nobel prizes are not prestigious for their impact on society, as most people do not understand at all what the rewarded research (career) is about. The most extreme example is the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel: On the one hand, Xiao-Li is right in pointing out that this is a very successful post-Alfred creation of a “Nobel Prize”. On the other hand, the fact that some years see two competing theories simultaneously win leads me to consider that this prize gives priority to theoretical construct above any impact on the World’s economy. Obviously, this statement is a bit of shooting our field in the foot since the only statisticians who got a Nobel Prize are econometricians and game-theorists! Nonetheless, it also shows that the happy few statisticians who entered the Nobel Olympus did not bring a bonus to the field… I am thus remaining my usual pessimistic self on the impact of a whatever-company Prize in Statistical Sciences in Memory of Alfred Nobel.
Another remark is the opposition between the COPSS Award, which remains completely ignored by the media (despite a wealth of great nominees with various domains of achievements) and the Fields Medal (which is not ignored). This has been a curse of Statistics that has been discussed at large, namely the difficulty to separate what is math and what is outside math within the field. The Fields Medal is clearly very unlikely to nominate a statistician, even a highly theoretical statistician, as there will always be “sexier” maths results, i.e. corpora of work that will be seen as higher maths than, say, the invention of the Lasso or the creation of generalized linear models. So there is no hope to reach for an alternative Fields Medal with the same shine. Just like the Nobel Prize.
Other issues I could have mentioned, but for the length of the current rant, are the creation of rewards for solving a specific problem (as some found in Machine Learning), for involving multidisciplinary and multicountry research teams, and for reaching new orders of magnitude in processing large data problems.
Filed under: Kids, Statistics, University life Tagged: ASA, COPSS Award, Fields medal, IMS, Nobel Prize
We are a few days from the start, here are the latest items of information for the participants:
The shuttle transfer on January 5th, from Geneva Airport to Chamonix lasts 1 hour 30 minutes. At your arrival in the airport , follow the “Swiss Exit”. After the customs, the bus driver (handling a sign “MCMC’Ski Chamonix”) will be waiting for you at the Meeting Point in the Arrival Hall. The bus driver will arrive 10 minutes before the time of the meeting and will check for each participant on his or her list. There may be delays in case of poor weather. The bus will drop you in front of or close to your hotel. If you miss the bus initially booked, you can get the next one. If you miss the last transfer, taking a taxi will be the only solution (warning, about 250 Euros!!!)
The registration will start on Monday January 6th at 8am, the conference will start at 8.45am. The conference will take place at the Majestic Congress Center, located 241 Allée du Majestic, in downtown Chamonix. There are signs all over town directing to Majestic Congrés. (No skiing equipment, i.e., skis, boots, boards, is allowed inside the building.) Speakers are advised to check with their chair in advance about downloading their talk.
The Richard Tweedie ski race should take place on Wednesday at 1pm, weather and snow permitting. There will be a registration line at the registration desk. (The cost is 10€ per person and does not include lift passes or equipment.) Thanks to Antonietta Mira, there will be two pairs of skis to be won!)
Filed under: Mountains, R, Statistics, University life Tagged: Chamonix, doodle, France, Geneva, ISBA conference, MCMC, MCMSki IV, Mont Blanc, Monte Carlo Statistical Methods, posters, Richard Tweedie, shuttle, simulation, ski
Almost immediately after I published my comments on his paper with David Dunson, Xiangyu Wang sent a long comment that I think worth a post on its own (especially, given that I am now busy skiing and enjoying Chamonix!). So here it is:
Thanks for the thoughtful comments. I did not realize that Neiswanger et al. also proposed the similar trick to avoid combinatoric problem as we did for the rejection sampler. Thank you for pointing that out.
For the criticism 3 on the tail degeneration, we did not mean to fire on the non-parametric estimation issues, but rather the problem caused by using the product equation. When two densities are multiplied together, the accuracy of the product mainly depends on the tail of the two densities (the overlapping area), if there are more than two densities, the impact will be more significant. As a result, it may be unwise to directly use the product equation, as the most distant sub-posteriors could be potentially very far away from each other, and most of the sub posterior draws are outside the overlapping area. (The full Gibbs sampler formulated in our paper does not have this issue, as shown in equation 5, there is a common part multiplied on each sub-posterior, which brought them close.)
Point 4 stated the problem caused by averaging. The approximated density follows Neiswanger et al. (2013) will be a mixture of Gaussian, whose component means are the average of the sub-posterior draws. Therefore, if sub-posteriors stick to different modes (assuming the true posterior is multi-modal), then the approximated density is likely to mess up the modes, and produce some faked modes (eg. average of the modes. We provide an example in the simulation 3.)
Sorry for the vague description of the refining method (4.2). The idea is kinda dull. We start from an initial approximation to θ and then do one step Gibbs update to obtain a new θ, and we call this procedure ‘refining’, as we believe such process would bring the original approximation closer to the true posterior distribution.
The first (4.1) and the second (4.2) algorithms do seem weird to be called as ‘parallel’, since they are both modified from the Gibbs sampler described in (4) and (5). The reason we want to propose these two algorithms is to overcome two problems. The first is the dimensionality curse, and the second is the issue when the subset inferences are not extremely accurate (subset effective sample size small) which might be a common scenario for logistic regression (with large parameters) even with huge data set. First, algorithm (4.1) and (4.2) both start from some initial approximations, and attempt to improve to obtain a better approximation, thus avoid the dimensional issue. Second, in our simulation 1, we attempt to pull down the performance of the simple averaging by worsening the sub-posterior performance (we allocate smaller amount of data to each subset), and the non-parametric method fails to approximate the combined density as well. However, the algorithm 4.1 and 4.2 still work in this case.
I have some problem with the logistic regression example provided in Neiswanger et al. (2013). As shown in the paper, under the authors’ setting (not fully specified in the paper), though the non-parametric method is better than simple averaging, the approximation error of simple averaging is small enough for practical use (I also have some problem with their error evaluation method), then why should we still bother to use a much more complicated method?
Actually I’m adding a new algorithm into the Weierstrass rejection sampling, which will render it thoroughly free from the dimensionality curse of p. The new scheme is applicable to the nonparametric method in Neiswanger et al. (2013) as well. It should appear soon in the second version of the draft.
Filed under: Books, Statistics, University life Tagged: big data, Chamonix, Duke University, kernel density estimator, large dimensions, likelihood-free methods, MCMC, O-Bayes 2013, parallel processing, ski, snow, untractable normalizing constant, Xiangyu Wang
During O’Bayes 2013, Xiangyu Wang and David Dunson arXived a paper (with the above title) that David then presented on the 19th. The setting is quite similar to the recently discussed embarrassingly parallel paper of Neiswanger et al., in that Xiangyu and David start from the same product representation of the target (posterior). Namely,
However, they criticise the choice made by Neiswanger et al to use MCMC approximations to each component of the product for the following reasons:
- Curse of dimensionality in the number of parameters p
- Curse of dimensionality in the number of subsets m
- Tail degeneration
- Support inconsistency and mode misspecification
While I agree on point 1 (although there may be other ways than kernel estimation to mix samples from the terms in the product, terms Neiswanger et al. called the subposteriors), which is also a drawback with the current method, point 2 is not such a clearcut drawback. I first had the same reaction about the Tm explosion in the number of terms in a product of m sums of T terms, but Neiswanger et al. use a clever trick to avoid the combinatoric explosion, namely to operate one mixture at a time. Having non-manageable targets is not such an issue in the post-MCMC era, isn’t it?! Point 3 is formally correct, in that the kernel tail behaviour induces the kernel estimate tail behaviour, most likely disconnected from the true target tail behaviour, but this feature is true for any non-parametric estimate, I believe, even for the Weierstraß transform, and hence maybe not so relevant in practice. In fact, my operational intuition is that by lifting the tails up, the simulation from the subposteriors should help in visiting the tails of the true target. (Proofs of convergence and bound rely on L1 norms, not particularly interested in tails.) At last, point 4 does not seem to life-threatening: assuming the true target can be computed up to a normalising constant, the value of the target for every simulated parameter could be computed, eliminating those outside the support of the product and highlighting modal regions. (The paper does mention the improvement brought by computing those target values.)
The Wierstraß transform of a density f is a convolution of f and of an arbitrary kernel K. The authors propose to simulate from the product of the Wierstraß transform, using a multi-tiered Gibbs sampler. This way, the parameter is only simulated once and from a controlled kernel, while the random effects from the convolution are related with each subposterior. While the method requires coordination between the parallel threads, the components of the target are separately computed on a single thread. (I do not get the refinement in Section 4.2, about how “we can regard [θ] as a hyper-parameter, which allows to draw samples of [random effects] hierarchically”. Once a cycle is completed, all threads must wait for the common θ to be updated.) Maybe the clearest perspective on the Wierstraß transform is the rejection sampling version (Section 4.3) where simulations from the subpriors are merged together into a normal proposal on θ to be accepted with a probability depending on the subprior simulations.
There is something puzzling and vaguely familiar about the Wierstraß transform, something I cannot really put the finger on. It reminds me of the (great) convolution trick Jacques Neveu used to prove the Central Limit Theorem in his École Polytechnique notes. But it also reminds me of West and Harrison’s notion of time-evolving parameter (notion that took [me] a while to seep in…). Definitely interesting!
Filed under: Books, Statistics, University life Tagged: big data, Duke University, kernel density estimator, large dimensions, likelihood-free methods, MCMC, O-Bayes 2013, parallel processing, untractable normalizing constant
The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.
Here’s an excerpt:
The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 250,000 times in 2013. If it were an exhibit at the Louvre Museum, it would take about 11 days for that many people to see it.
Filed under: Uncategorized Tagged: 2013, Wordpress
I am now in Chamonix, a week before MCMSki IV. The town is packed with tourists from all over Europe, English being the dominant language. There is not much snow so far, even though some runs reach town. (I did ski in the nearby Les Houches today and the red runs were either icy or very thin, with stones and grass showing here and there…) Since the town is quite pricey in comparison with other French ski resorts, esp. in terms of rental, let me point out the special store of Technique Extrême, with very low rates, located a few blocks away from the conference centre. I am currently trying to see whether or not we could get special rates for the daily passes. I checked whether ski passes could be sold at the registration desk, but the only solution was for me to buy them in advance, so I gave up on that. If you arrive on Sunday my advice is to buy the pass from one of the cablecar stations (which close about 5pm). You can also buy it on line. And pick it later from a cablecar station booth…
Filed under: Mountains, R, Statistics, University life Tagged: calblecar, Chamonix, France, MCMC, MCMSki IV, Mont Blanc, Monte Carlo Statistical Methods, posters, simulation, ski, skipass, Technique Extrême
In yet another permutation of the original title (!), Andrew Gelman posted the answer Val Johnson sent him after our (submitted) letter to PNAS. As Val did not send me a copy (although Andrew did!), I will not reproduce it here and I rather refer the interested readers to Andrews’ blog… In addition to Andrew’s (sensible) points, here are a few idle (post-X’mas and pre-skiing) reflections:
- “evidence against a false null hypothesis accrues exponentially fast” makes me wonder in which metric this exponential rate (in γ?) occurs;
- that “most decision-theoretic analyses of the optimal threshold to use for declaring a significant finding would lead to evidence thresholds that are substantially greater than 5 (and probably also greater 25)” is difficult to accept as an argument since there is no trace of a decision-theoretic argument in the whole paper;
- Val rejects our minimaxity argument on the basis that “[UMPBTs] do not involve minimization of maximum loss” but the prior that corresponds to those tests is minimising the integrated probability of not rejecting at threshold level γ, a loss function integrated against parameter and observation, a Bayes risk in other words… Point masses or spike priors are clearly characteristics of minimax priors. Furthermore, the additional argument that “in most applications, however, a unique loss function/prior distribution combination does not exist” has been used by many to refute the Bayesian perspective and makes me wonder what are the arguments left in using a (pseudo-)Bayesian approach;
- the next paragraph is pure tautology: the fact that “no other test, based on either a subjectively or objectively specified alternative hypothesis, is as likely to produce a Bayes factor that exceeds the specified evidence threshold” is a paraphrase of the definition of UMPBTs, not an argument. I do not see we should solely “worry about false negatives”, since minimising those should lead to a point mass on the null (or, more seriously, should not lead to the minimax-like selection of the prior under the alternative).
Filed under: Statistics, University life Tagged: Bayes factors, Bayesian tests, evidence, False positive, minima, PNAS, statistical significance, UMPBTs, uniformly most powerful tests, Valen Johnson
Source: Bayesian Anal., Volume 8, Number 4, 741--758.
This article examines the convergence properties of a Bayesian model selection procedure based on a non-local prior density in ultrahigh-dimensional settings. The performance of the model selection procedure is also compared to popular penalized likelihood methods. Coupling diagnostics are used to bound the total variation distance between iterates in an Markov chain Monte Carlo (MCMC) algorithm and the posterior distribution on the model space. In several simulation scenarios in which the number of observations exceeds 100, rapid convergence and high accuracy of the Bayesian procedure is demonstrated. Conversely, the coupling diagnostics are successful in diagnosing lack of convergence in several scenarios for which the number of observations is less than 100. The accuracy of the Bayesian model selection procedure in identifying high probability models is shown to be comparable to commonly used penalized likelihood methods, including extensions of smoothly clipped absolute deviations (SCAD) and least absolute shrinkage and selection operator (LASSO) procedures.
Source: Bayesian Anal., Volume 8, Number 4, 759--780.
Histone modifications (HMs) play important roles in transcription through post-translational modifications. Combinations of HMs, known as chromatin signatures, encode specific messages for gene regulation. We therefore expect that inference on possible clustering of HMs and an annotation of genomic locations on the basis of such clustering can contribute new insights about the functions of regulatory elements and their relationships to combinations of HMs. We propose a nonparametric Bayesian local clustering Poisson model (NoB-LCP) to facilitate posterior inference on two-dimensional clustering of HMs and genomic locations. The NoB-LCP clusters HMs into HM sets and lets each HM set define its own clustering of genomic locations. Furthermore, it probabilistically excludes HMs and genomic locations that are irrelevant to clustering. By doing so, the proposed model effectively identifies important sets of HMs and groups regulatory elements with similar functionality based on HM patterns.
Source: Bayesian Anal., Volume 8, Number 4, 781--800.
We study a Bayesian model where we have made specific requests about the parameter values to be estimated. The aim is to find the parameter of a parametric family which minimizes a distance to the data generating density and then to estimate the discrepancy using nonparametric methods. We illustrate how coherent updating can proceed given that the standard Bayesian posterior from an unidentifiable model is inappropriate. Our updating is performed using Markov Chain Monte Carlo methods and in particular a novel method for dealing with intractable normalizing constants is required. Illustrations using synthetic data are provided.
Source: Bayesian Anal., Volume 8, Number 4, 801--836.
The problem of inferring a clustering of a data set has been the subject of much research in Bayesian analysis, and there currently exists a solid mathematical foundation for Bayesian approaches to clustering. In particular, the class of probability distributions over partitions of a data set has been characterized in a number of ways, including via exchangeable partition probability functions (EPPFs) and the Kingman paintbox. Here, we develop a generalization of the clustering problem, called feature allocation, where we allow each data point to belong to an arbitrary, non-negative integer number of groups, now called features or topics. We define and study an “exchangeable feature probability function” (EFPF)—analogous to the EPPF in the clustering setting—for certain types of feature models. Moreover, we introduce a “feature paintbox” characterization—analogous to the Kingman paintbox for clustering—of the class of exchangeable feature models. We provide a further characterization of the subclass of feature allocations that have EFPF representations.
Source: Bayesian Anal., Volume 8, Number 4, 837--882.
We propose a general algorithm for approximating nonstandard Bayesian posterior distributions. The algorithm minimizes the Kullback-Leibler divergence of an approximating distribution to the intractable posterior distribution. Our method can be used to approximate any posterior distribution, provided that it is given in closed form up to the proportionality constant. The approximation can be any distribution in the exponential family or any mixture of such distributions, which means that it can be made arbitrarily precise. Several examples illustrate the speed and accuracy of our approximation method in practice.
Source: Bayesian Anal., Volume 8, Number 4, 883--908.
It is sometimes preferable to conduct statistical analyses based on the combination of several models rather than on the selection of a single model, thus taking into account the uncertainty about the true model. Models are usually combined using constant weights that do not distinguish between different regions of the covariate space. However, a procedure that performs well in a given situation may not do so in another situation. In this paper, we propose the concept of local Bayes factors, where we calculate the Bayes factors by restricting the models to regions of the covariate space. The covariate space is split in such a way that the relative model efficiencies of the various Bayesian models are about the same in the same region while differing in different regions. An algorithm for clustered Bayes averaging is then proposed for model combination, where local Bayes factors are used to guide the weighting of the Bayesian models. Simulations and real data studies show that clustered Bayesian averaging results in better predictive performance compared to a single Bayesian model or Bayesian model averaging where models are combined using the same weights over the entire covariate space.