Topic overview

Computation

1468 works3673 researchers0 institutions

Topic snapshot

What this area looks like now

1468works
3673authors
0experts visible
0communities

Next steps

Move from topic reading into action

The graph preview below keeps the nearby papers, people and communities visible in the same reading flow.

Topic graph

See the topic as a live network

Open full explorer

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Papers in this area

24 featured work(s)

preprint2020arXiv

ParaDRAM: A Cross-Language Toolbox for Parallel High-Performance Delayed-Rejection Adaptive Metropolis Markov Chain Monte Carlo Simulations

We present ParaDRAM, a high-performance Parallel Delayed-Rejection Adaptive Metropolis Markov Chain Monte Carlo software for optimization, sampling, and integration of mathematical objective functions encountered in scientific inference. ParaDRAM is currently accessible from several popular programming languages including C/C++, Fortran, MATLAB, Python and is part of the ParaMonte open-source project with the following principal design goals: 1. full automation of Monte Carlo simulations, 2. interoperability of the core library with as many programming languages as possible, thus, providing a unified Application Programming Interface and Monte Carlo simulation environment across all programming languages, 3. high-performance 4. parallelizability and scalability of simulations from personal laptops to supercomputers, 5. virtually zero-dependence on external libraries, 6. fully-deterministic reproducibility of simulations, 7. automatic comprehensive reporting and post-processing of the simulation results. We present and discuss several novel techniques implemented in ParaDRAM to automatically and dynamically ensure the good-mixing and the diminishing-adaptation of the resulting pseud

preprint2020arXiv

Individual-level Modeling of COVID-19 Epidemic Risk

The ongoing COVID-19 pandemic calls for a multi-faceted public health response comprising complementary interventions to control the spread of the disease while vaccines and therapies are developed. Many of these interventions need to be informed by epidemic risk predictions given available data, including symptoms, contact patterns, and environmental factors. Here we propose a novel probabilistic formalism based on Individual-Level Models (ILMs) that offers rigorous formulas for the probability of infection of individuals, which can be parameterised via Maximum Likelihood Estimation (MLE) applied on compartmental models defined at the population level. We describe an approach where individual data collected in real-time is integrated with overall case counts to update the a predictor of the susceptibility of infection as a function of individual risk factors.

preprint2020arXiv

Comparison of non-parametric global envelopes

This study presents a simulation study to compare different non-parametric global envelopes that are refinements of the rank envelope proposed by Myllymäki et al. (2017, Global envelope tests for spatial processes, J. R. Statist. Soc. B 79, 381-404, doi: 10.1111/rssb.12172). The global envelopes are constructed for a set of functions or vectors. For a large number of vectors, all the refinements lead to the same outcome as the global rank envelope. For smaller numbers of vectors the refinement playes a role, where different refinements are sensitive to different types of extremeness of a vector among the set of vectors. The performance of the different alternatives are compared in a simulation study with respect to the numbers of available vectors, the dimensionality of the vectors, the amount of dependence between the vector elements and the expected type of extremeness.

preprint2020arXiv

Low-complexity Architecture for AR(1) Inference

In this Letter, we propose a low-complexity estimator for the correlation coefficient based on the signed $\operatorname{AR}(1)$ process. The introduced approximation is suitable for implementation in low-power hardware architectures. Monte Carlo simulations reveal that the proposed estimator performs comparably to the competing methods in literature with maximum error in order of $10^{-2}$. However, the hardware implementation of the introduced method presents considerable advantages in several relevant metrics, offering more than 95% reduction in dynamic power and doubling the maximum operating frequency when compared to the reference method.

preprint2020arXiv

Robust Mean Estimation in High Dimensions via $\ell_0$ Minimization

We study the robust mean estimation problem in high dimensions, where $α<0.5$ fraction of the data points can be arbitrarily corrupted. Motivated by compressive sensing, we formulate the robust mean estimation problem as the minimization of the $\ell_0$-`norm&#39; of the outlier indicator vector, under second moment constraints on the inlier data points. We prove that the global minimum of this objective is order optimal for the robust mean estimation problem, and we propose a general framework for minimizing the objective. We further leverage the $\ell_1$ and $\ell_p$ $(0<p<1)$, minimization techniques in compressive sensing to provide computationally tractable solutions to the $\ell_0$ minimization problem. Both synthetic and real data experiments demonstrate that the proposed algorithms significantly outperform state-of-the-art robust mean estimation methods.

preprint2020arXiv

Nested sampling cross-checks using order statistics

Nested sampling (NS) is an invaluable tool in data analysis in modern astrophysics, cosmology, gravitational wave astronomy and particle physics. We identify a previously unused property of NS related to order statistics: the insertion indexes of new live points into the existing live points should be uniformly distributed. This observation enabled us to create a novel cross-check of single NS runs. The tests can detect when an NS run failed to sample new live points from the constrained prior and plateaus in the likelihood function, which break an assumption of NS and thus leads to unreliable results. We applied our cross-check to NS runs on toy functions with known analytic results in 2 - 50 dimensions, showing that our approach can detect problematic runs on a variety of likelihoods, settings and dimensions. As an example of a realistic application, we cross-checked NS runs performed in the context of cosmological model selection. Since the cross-check is simple, we recommend that it become a mandatory test for every applicable NS run.

preprint2020arXiv

Logistic regression models for aggregated data

Logistic regression models are a popular and effective method to predict the probability of categorical response data. However inference for these models can become computationally prohibitive for large datasets. Here we adapt ideas from symbolic data analysis to summarise the collection of predictor variables into histogram form, and perform inference on this summary dataset. We develop ideas based on composite likelihoods to derive an efficient one-versus-rest approximate composite likelihood model for histogram-based random variables, constructed from low-dimensional marginal histograms obtained from the full histogram. We demonstrate that this procedure can achieve comparable classification rates compared to the standard full data multinomial analysis and against state-of-the-art subsampling algorithms for logistic regression, but at a substantially lower computational cost. Performance is explored through simulated examples, and analyses of large supersymmetry and satellite crop classification datasets.

preprint2020arXiv

A table of short-period Tausworthe generators for Markov chain quasi-Monte Carlo

We consider the problem of estimating expectations by using Markov chain Monte Carlo methods and improving the accuracy by replacing IID uniform random points with quasi-Monte Carlo (QMC) points. Recently, it has been shown that Markov chain QMC remains consistent when the driving sequences are completely uniformly distributed (CUD). However, the definition of CUD sequences is not constructive, so an implementation method using short-period Tausworthe generators (i.e., linear feedback shift register generators over the two-element field) that approximate CUD sequences has been proposed. In this paper, we conduct an exhaustive search of short-period Tausworthe generators for Markov chain QMC in terms of the $t$-value, which is a criterion of uniformity widely used in the study of QMC methods. We provide a parameter table of Tausworthe generators and show the effectiveness in numerical examples using Gibbs sampling.

preprint2020arXiv

Prediction of Hilbertian autoregressive processes : a Recurrent Neural Network approach

The autoregressive Hilbertian model (ARH) was introduced in the early 90&#39;s by Denis Bosq. It was the subject of a vast literature and gave birth to numerous extensions. The model generalizes the classical multidimensional autoregressive model, widely used in Time Series Analysis. It was successfully applied in numerous fields such as finance, industry, biology. We propose here to compare the classical prediction methodology based on the estimation of the autocorrelation operator with a neural network learning approach. The latter is based on a popular version of Recurrent Neural Networks : the Long Short Term Memory networks. The comparison is carried out through simulations and real datasets.

preprint2020arXiv

An Algorithm for Distributed Bayesian Inference in Generalized Linear Models

Monte Carlo algorithms, such as Markov chain Monte Carlo (MCMC) and Hamiltonian Monte Carlo (HMC), are routinely used for Bayesian inference in generalized linear models; however, these algorithms are prohibitively slow in massive data settings because they require multiple passes through the full data in every iteration. Addressing this problem, we develop a scalable extension of these algorithms using the divide-and-conquer (D&C) technique that divides the data into a sufficiently large number of subsets, draws parameters in parallel on the subsets using a \textit{powered} likelihood, and produces Monte Carlo draws of the parameter by combining parameter draws obtained from each subset. These combined parameter draws play the role of draws from the original sampling algorithm. Our main contributions are two-fold. First, we demonstrate through diverse simulated and real data analyses that our distributed algorithm is comparable to the current state-of-the-art D&C algorithm in terms of statistical accuracy and computational efficiency. Second, providing theoretical support for our empirical observations, we identify regularity assumptions under which the proposed algorithm leads to

preprint2020arXiv

agtboost: Adaptive and Automatic Gradient Tree Boosting Computations

agtboost is an R package implementing fast gradient tree boosting computations in a manner similar to other established frameworks such as xgboost and LightGBM, but with significant decreases in computation time and required mathematical and technical knowledge. The package automatically takes care of split/no-split decisions and selects the number of trees in the gradient tree boosting ensemble, i.e., agtboost adapts the complexity of the ensemble automatically to the information in the data. All of this is done during a single training run, which is made possible by utilizing developments in information theory for tree algorithms {\tt arXiv:2008.05926v1 [stat.ME]}. agtboost also comes with a feature importance function that eliminates the common practice of inserting noise features. Further, a useful model validation function performs the Kolmogorov-Smirnov test on the learned distribution.

preprint2020arXiv

Poisson-Tweedie mixed-effects model: a flexible approach for the analysis of longitudinal RNA-seq data

We present a new modelling approach for longitudinal count data that is motivated by the increasing availability of longitudinal RNA-sequencing experiments. The distribution of RNA-seq counts typically exhibits overdispersion, zero-inflation and heavy tails; moreover, in longitudinal designs repeated measurements from the same subject are typically (positively) correlated. We propose a generalized linear mixed model based on the Poisson-Tweedie distribution that can flexibly handle each of the aforementioned features of longitudinal overdispersed counts. We develop a computational approach to accurately evaluate the likelihood of the proposed model and to perform maximum likelihood estimation. Our approach is implemented in the R package ptmixed, which can be freely downloaded from CRAN. We assess the performance of ptmixed on simulated data and we present an application to a dataset with longitudinal RNA-sequencing measurements from healthy and dystrophic mice. The applicability of the Poisson-Tweedie mixed-effects model is not restricted to longitudinal RNA-seq data, but it extends to any scenario where non-independent measurements of a discrete overdispersed response variable ar

preprint2020arXiv

ScoreDrivenModels.jl: a Julia Package for Generalized Autoregressive Score Models

Score-driven models, also known as generalized autoregressive score models, represent a class of observation-driven time series models. They possess powerful properties, such as the ability to model different conditional distributions and to consider time-varying parameters within a flexible framework. In this paper, we present ScoreDrivenModels.jl, an open-source Julia package for modeling, forecasting, and simulating time series using the framework of score-driven models. The package is flexible with respect to model definition, allowing the user to specify the lag structure and which parameters are time-varying or constant. It is also possible to consider several distributions, including Beta, Exponential, Gamma, Lognormal, Normal, Poisson, Student&#39;s t, and Weibull. The provided interface is flexible, allowing interested users to implement any desired distribution and parametrization.

preprint2020arXiv

diproperm: An R Package for the DiProPerm Test

High-dimensional low sample size (HDLSS) data sets emerge frequently in many biomedical applications. A common task for analyzing HDLSS data is to assign data to the correct class using a classifier. Classifiers which use two labels and a linear combination of features are known as binary linear classifiers. The direction-projection-permutation (DiProPerm) test was developed for testing the difference of two high-dimensional distributions induced by a binary linear classifier. This paper discusses the key components of the DiProPerm test, introduces the diproperm R package, and demonstrates the package on a real-world data set.

preprint2020arXiv

Likelihood-based inference for modelling packet transit from thinned flow summaries

The substantial growth of network traffic speed and volume presents practical challenges to network data analysis. Packet thinning and flow aggregation protocols such as NetFlow reduce the size of datasets by providing structured data summaries, but conversely this impedes statistical inference. Methods which aim to model patterns of traffic propagation typically do not account for the packet thinning and summarisation process into the analysis, and are often simplistic, e.g.~method-of-moments. As a result, they can be of limited practical use. We introduce a likelihood-based analysis which fully incorporates packet thinning and NetFlow summarisation into the analysis. As a result, inferences can be made for models on the level of individual packets while only observing thinned flow summary information. We establish consistency of the resulting maximum likelihood estimator, derive bounds on the volume of traffic which should be observed to achieve required levels of estimator accuracy, and identify an ideal family of models. The robust performance of the estimator is examined through simulated analyses and an application on a publicly available trace dataset containing over 36m pac

preprint2020arXiv

Fast Graphlet Transform of Sparse Graphs

We introduce the computational problem of graphlet transform of a sparse large graph. Graphlets are fundamental topology elements of all graphs/networks. They can be used as coding elements to encode graph-topological information at multiple granularity levels for classifying vertices on the same graph/network as well as for making differentiation or connection across different networks. Network/graph analysis using graphlets has growing applications. We recognize the universality and increased encoding capacity in using multiple graphlets, we address the arising computational complexity issues, and we present a fast method for exact graphlet transform. The fast graphlet transform establishes a few remarkable records at once in high computational efficiency, low memory consumption, and ready translation to high-performance program and implementation. It is intended to enable and advance network/graph analysis with graphlets, and to introduce the relatively new analysis apparatus to graph theory, high-performance graph computation, and broader applications.

preprint2020arXiv

hIPPYlib: An Extensible Software Framework for Large-Scale Inverse Problems Governed by PDEs; Part I: Deterministic Inversion and Linearized Bayesian Inference

We present an extensible software framework, hIPPYlib, for solution of large-scale deterministic and Bayesian inverse problems governed by partial differential equations (PDEs) with infinite-dimensional parameter fields (which are high-dimensional after discretization). hIPPYlib overcomes the prohibitive nature of Bayesian inversion for this class of problems by implementing state-of-the-art scalable algorithms for PDE-based inverse problems that exploit the structure of the underlying operators, notably the Hessian of the log-posterior. The key property of the algorithms implemented in hIPPYlib is that the solution of the deterministic and linearized Bayesian inverse problem is computed at a cost, measured in linearized forward PDE solves, that is independent of the parameter dimension. The mean of the posterior is approximated by the MAP point, which is found by minimizing the negative log-posterior. This deterministic nonlinear least-squares optimization problem is solved with an inexact matrix-free Newton-CG method. The posterior covariance is approximated by the inverse of the Hessian of the negative log posterior evaluated at the MAP point. This Gaussian approximation is exac

preprint2020arXiv

Environmental contours as Voronoi cells

Environmental contours are widely used as basis for design of structures exposed to environmental loads. The basic idea of the method is to decouple the environmental description from the structural response. This is done by establishing an envelope of environmental conditions, such that any structure tolerating loads on this envelope will have a failure probability smaller than a prescribed value. Specifically, given an $n$-dimensional random variable $\mathbf{X}$ and a target probability of failure $p_{e}$, an environmental contour is the boundary of a set $\mathcal{B} \subset \mathbb{R}^{n}$ with the following property: For any failure set $\mathcal{F} \subset \mathbb{R}^{n}$, if $\mathcal{F}$ does not intersect the interior of $\mathcal{B}$, then the probability of failure, $P(\mathbf{X} \in \mathcal{F})$, is bounded above by $p_{e}$. As is common for many real-world applications, we work under the assumption that failure sets are convex. In this paper, we show that such environmental contours may be regarded as boundaries of Voronoi cells. This geometric interpretation leads to new theoretical insights and suggests a simple novel construction algorithm that guarantees the desi

preprint2020arXiv

Adaptive Path Sampling in Metastable Posterior Distributions

The normalizing constant plays an important role in Bayesian computation, and there is a large literature on methods for computing or approximating normalizing constants that cannot be evaluated in closed form. When the normalizing constant varies by orders of magnitude, methods based on importance sampling can require many rounds of tuning. We present an improved approach using adaptive path sampling, iteratively reducing gaps between the base and target. Using this adaptive strategy, we develop two metastable sampling schemes. They are automated in Stan and require little tuning. For a multimodal posterior density, we equip simulated tempering with a continuous temperature. For a funnel-shaped entropic barrier, we adaptively increase mass in bottleneck regions to form an implicit divide-and-conquer. Both approaches empirically perform better than existing methods for sampling from metastable distributions, including higher accuracy and computation efficiency.

preprint2020arXiv

Near-Linear Time Local Polynomial Nonparametric Estimation with Box Kernels

Local polynomial regression (Fan and Gijbels 1996) is an important class of methods for nonparametric density estimation and regression problems. However, straightforward implementation of local polynomial regression has quadratic time complexity which hinders its applicability in large-scale data analysis. In this paper, we significantly accelerate the computation of local polynomial estimates by novel applications of multi-dimensional binary indexed trees (Fenwick 1994). Both time and space complexity of our proposed algorithm is nearly linear in the number of input data points. Simulation results confirm the efficiency and effectiveness of our proposed approach.

preprint2020arXiv

A Survey of Singular Value Decomposition Methods for Distributed Tall/Skinny Data

The Singular Value Decomposition (SVD) is one of the most important matrix factorizations, enjoying a wide variety of applications across numerous application domains. In statistics and data analysis, the common applications of SVD such as Principal Components Analysis (PCA) and linear regression. Usually these applications arise on data that has far more rows than columns, so-called &#34;tall/skinny&#34; matrices. In the big data analytics context, this may take the form of hundreds of millions to billions of rows with only a few hundred columns. There is a need, therefore, for fast, accurate, and scalable tall/skinny SVD implementations which can fully utilize modern computing resources. To that end, we present a survey of three different algorithms for computing the SVD for these kinds of tall/skinny data layouts using MPI for communication. We contextualize these with common big data analytics techniques, principally PCA. Finally, we present both CPU and GPU timing results from the Summit supercomputer, and discuss possible alternative approaches.

preprint2020arXiv

Marginally-calibrated deep distributional regression

Deep neural network (DNN) regression models are widely used in applications requiring state-of-the-art predictive accuracy. However, until recently there has been little work on accurate uncertainty quantification for predictions from such models. We add to this literature by outlining an approach to constructing predictive distributions that are `marginally calibrated&#39;. This is where the long run average of the predictive distributions of the response variable matches the observed empirical margin. Our approach considers a DNN regression with a conditionally Gaussian prior for the final layer weights, from which an implicit copula process on the feature space is extracted. This copula process is combined with a non-parametrically estimated marginal distribution for the response. The end result is a scalable distributional DNN regression method with marginally calibrated predictions, and our work complements existing methods for probability calibration. The approach is first illustrated using two applications of dense layer feed-forward neural networks. However, our main motivating applications are in likelihood-free inference, where distributional deep regression is used to es

preprint2020arXiv

JointAI: Joint Analysis and Imputation of Incomplete Data in R

Missing data occur in many types of studies and typically complicate the analysis. Multiple imputation, either using joint modelling or the more flexible fully conditional specification approach, are popular and work well in standard settings. In settings involving non-linear associations or interactions, however, incompatibility of the imputation model with the analysis model is an issue often resulting in bias. Similarly, complex outcomes such as longitudinal or survival outcomes cannot be adequately handled by standard implementations. In this paper, we introduce the R package JointAI, which utilizes the Bayesian framework to perform simultaneous analysis and imputation in regression models with incomplete covariates. Using a fully Bayesian joint modelling approach it overcomes the issue of uncongeniality while retaining the attractive flexibility of fully conditional specification multiple imputation by specifying the joint distribution of analysis and imputation models as a sequence of univariate models that can be adapted to the type of variable. JointAI provides functions for Bayesian inference with generalized linear and generalized linear mixed models and extensions thereo

preprint2020arXiv

Improving ERGM Starting Values Using Simulated Annealing

Much of the theory of estimation for exponential family models, which include exponential-family random graph models (ERGMs) as a special case, is well-established and maximum likelihood estimates in particular enjoy many desirable properties. However, in the case of many ERGMs, direct calculation of MLEs is impossible and therefore methods for approximating MLEs and/or alternative estimation methods must be employed. Many MLE approximation methods require alternative estimates as starting points. We discuss one class of such alternatives here. The MLE satisfies the so-called &#34;likelihood principle,&#34; unlike the MPLE. This means that different networks may have different MPLEs even if they have the same sufficient statistics. We exploit this fact here to search for improved starting values for approximation-based MLE methods. The method we propose has shown its merit in producing an MLE for a network dataset and model that had defied estimation using all other known methods.

People in this topic

12 visible researcher(s)