Source author record

Florian Frommlet

Florian Frommlet appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation math.ST Methodology Statistics Theory Applications Machine Learning math.LO

Catalog footprint

What is connected

8works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A subsampling approach for Bayesian model selection

It is common practice to use Laplace approximations to compute marginal likelihoods in Bayesian versions of generalised linear models (GLM). Marginal likelihoods combined with model priors are then used in different search algorithms to compute the posterior marginal probabilities of models and individual covariates. This allows performing Bayesian model selection and model averaging. For large sample sizes, even the Laplace approximation becomes computationally challenging because the optimisation routine involved needs to evaluate the likelihood on the full set of data in multiple iterations. As a consequence, the algorithm is not scalable for large datasets. To address this problem, we suggest using a version of a popular batch stochastic gradient descent (BSGD) algorithm for estimating the marginal likelihood of a GLM by subsampling from the data. We further combine the algorithm with Markov chain Monte Carlo (MCMC) based methods for Bayesian model selection and provide some theoretical results on the convergence of the estimates. Finally, we report results from experiments illustrating the performance of the proposed algorithm.

preprint2020arXiv

A novel algorithmic approach to Bayesian Logic Regression

Logic regression was developed more than a decade ago as a tool to construct predictors from Boolean combinations of binary covariates. It has been mainly used to model epistatic effects in genetic association studies, which is very appealing due to the intuitive interpretation of logic expressions to describe the interaction between genetic variations. Nevertheless logic regression has (partly due to computational challenges) remained less well known than other approaches to epistatic association mapping. Here we will adapt an advanced evolutionary algorithm called GMJMCMC (Genetically modified Mode Jumping Markov Chain Monte Carlo) to perform Bayesian model selection in the space of logic regression models. After describing the algorithmic details of GMJMCMC we perform a comprehensive simulation study that illustrates its performance given logic regression terms of various complexity. Specifically GMJMCMC is shown to be able to identify three-way and even four-way interactions with relatively large power, a level of complexity which has not been achieved by previous implementations of logic regression. We apply GMJMCMC to reanalyze QTL mapping data for Recombinant Inbred Lines in \textit{Arabidopsis thaliana} and from a backcross population in \textit{Drosophila} where we identify several interesting epistatic effects. The method is implemented in an R package which is available on github.

preprint2020arXiv

Rejoinder for the discussion of the paper "A novel algorithmic approach to Bayesian Logic Regression"

In this rejoinder we summarize the comments, questions and remarks on the paper "A novel algorithmic approach to Bayesian Logic Regression" from the discussants. We then respond to those comments, questions and remarks, provide several extensions of the original model and give a tutorial on our R-package EMJMCMC (http://aliaksah.github.io/EMJMCMC2016/)

preprint2015arXiv

An adaptive Ridge procedure for L0 regularization

Penalized selection criteria like AIC or BIC are among the most popular methods for variable selection. Their theoretical properties have been studied intensively and are well understood, but making use of them in case of high-dimensional data is difficult due to the non-convex optimization problem induced by L0 penalties. An elegant solution to this problem is provided by the multi-step adaptive lasso, where iteratively weighted lasso problems are solved, whose weights are updated in such a way that the procedure converges towards selection with L0 penalties. In this paper we introduce an adaptive ridge procedure (AR) which mimics the adaptive lasso, but is based on weighted Ridge problems. After introducing AR its theoretical properties are studied in the particular case of orthogonal linear regression. For the non-orthogonal case extensive simulations are performed to assess the performance of AR. In case of Poisson regression and logistic regression it is illustrated how the iterative procedure of AR can be combined with iterative maximization procedures. The paper ends with an efficient implementation of AR in the context of least-squares segmentation.

preprint2014arXiv

Analyzing genome-wide association studies with an FDR controlling modification of the Bayesian information criterion

The prevailing method of analyzing GWAS data is still to test each marker individually, although from a statistical point of view it is quite obvious that in case of complex traits such single marker tests are not ideal. Recently several model selection approaches for GWAS have been suggested, most of them based on LASSO-type procedures. Here we will discuss an alternative model selection approach which is based on a modification of the Bayesian Information Criterion (mBIC2) which was previously shown to have certain asymptotic optimality properties in terms of minimizing the misclassification error. Heuristic search strategies are introduced which attempt to find the model which minimizes mBIC2, and which are efficient enough to allow the analysis of GWAS data. Our approach is implemented in a software package called MOSGWA. Its performance in case control GWAS is compared with the two algorithms HLASSO and GWASelect, as well as with single marker tests, where we performed a simulation study based on real SNP data from the POPRES sample. Our results show that MOSGWA performs slightly better than HLASSO, whereas according to our simulations GWASelect does not control the type I error when used to automatically determine the number of important SNPs. We also reanalyze the GWAS data from the Wellcome Trust Case-Control Consortium (WTCCC) and compare the findings of the different procedures.

preprint2012arXiv

Asymptotic Bayes-optimality under sparsity of some multiple testing procedures

Within a Bayesian decision theoretic framework we investigate some asymptotic optimality properties of a large class of multiple testing rules. A parametric setup is considered, in which observations come from a normal scale mixture model and the total loss is assumed to be the sum of losses for individual tests. Our model can be used for testing point null hypotheses, as well as to distinguish large signals from a multitude of very small effects. A rule is defined to be asymptotically Bayes optimal under sparsity (ABOS), if within our chosen asymptotic framework the ratio of its Bayes risk and that of the Bayes oracle (a rule which minimizes the Bayes risk) converges to one. Our main interest is in the asymptotic scheme where the proportion p of "true" alternatives converges to zero. We fully characterize the class of fixed threshold multiple testing rules which are ABOS, and hence derive conditions for the asymptotic optimality of rules controlling the Bayesian False Discovery Rate (BFDR). We finally provide conditions under which the popular Benjamini-Hochberg (BH) and Bonferroni procedures are ABOS and show that for a wide class of sparsity levels, the threshold of the former can be approximated by a nonrandom threshold.

preprint2011arXiv

Asymptotic Bayes optimality under sparsity for generally distributed effect sizes under the alternative

Recent results concerning asymptotic Bayes-optimality under sparsity (ABOS) of multiple testing procedures are extended to fairly generally distributed effect sizes under the alternative. An asymptotic framework is considered where both the number of tests m and the sample size m go to infinity, while the fraction p of true alternatives converges to zero. It is shown that under mild restrictions on the loss function nontrivial asymptotic inference is possible only if n increases to infinity at least at the rate of log m. Based on this assumption precise conditions are given under which the Bonferroni correction with nominal Family Wise Error Rate (FWER) level alpha and the Benjamini- Hochberg procedure (BH) at FDR level alpha are asymptotically optimal. When n is proportional to log m then alpha can remain fixed, whereas when n increases to infinity at a quicker rate, then alpha has to converge to zero roughly like n^(-1/2). Under these conditions the Bonferroni correction is ABOS in case of extreme sparsity, while BH adapts well to the unknown level of sparsity. In the second part of this article these optimality results are carried over to model selection in the context of multiple regression with orthogonal regressors. Several modifications of Bayesian Information Criterion are considered, controlling either FWER or FDR, and conditions are provided under which these selection criteria are ABOS. Finally the performance of these criteria is examined in a brief simulation study.

preprint2010arXiv

A model selection approach to genome wide association studies

For the vast majority of genome wide association studies (GWAS) published so far, statistical analysis was performed by testing markers individually. In this article we present some elementary statistical considerations which clearly show that in case of complex traits the approach based on multiple regression or generalized linear models is preferable to multiple testing. We introduce a model selection approach to GWAS based on modifications of Bayesian Information Criterion (BIC) and develop some simple search strategies to deal with the huge number of potential models. Comprehensive simulations based on real SNP data confirm that model selection has larger power than multiple testing to detect causal SNPs in complex models. On the other hand multiple testing has substantial problems with proper ranking of causal SNPs and tends to detect a certain number of false positive SNPs, which are not linked to any of the causal mutations. We show that this behavior is typical in GWAS for complex traits and can be explained by an aggregated influence of many small random sample correlations between genotypes of a SNP under investigation and other causal SNPs. We believe that our findings at least partially explain problems with low power and nonreplicability of results in many real data GWAS. Finally, we discuss the advantages of our model selection approach in the context of real data analysis, where we consider publicly available gene expression data as traits for individuals from the HapMap project.