Topic overview

Methodology

5119 works10552 researchers0 institutions

Topic snapshot

What this area looks like now

5119works
10552authors
0experts visible
0communities

Next steps

Move from topic reading into action

The graph preview below keeps the nearby papers, people and communities visible in the same reading flow.

Topic graph

See the topic as a live network

Open full explorer

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Papers in this area

24 featured work(s)

preprint2020arXiv

Nonparametric Trend Estimation in Functional Time Series with Application to Annual Mortality Rates

Here, we address the problem of trend estimation for functional time series. Existing contributions either deal with detecting a functional trend or assuming a simple model. They consider neither the estimation of a general functional trend nor the analysis of functional time series with a functional trend component. Similarly to univariate time series, we propose an alternative methodology to analyze functional time series, taking into account a functional trend component. We propose to estimate the functional trend by using a tensor product surface that is easy to implement, to interpret, and allows to control the smoothness properties of the estimator. Through a Monte Carlo study, we simulate different scenarios of functional processes to show that our estimator accurately identifies the functional trend component. We also show that the dependency structure of the estimated stationary time series component is not significantly affected by the error approximation of the functional trend component. We apply our methodology to annual mortality rates in France.

preprint2020arXiv

Functional Data Analysis with Causation in Observational Studies: Covariate Balancing Functional Propensity Score for Functional Treatments

Functional data analysis, which handles data arising from curves, surfaces, volumes, manifolds and beyond in a variety of scientific fields, is a rapidly developing area in modern statistics and data science in the recent decades. The effect of a functional variable on an outcome is an essential theme in functional data analysis, but a majority of related studies are restricted to correlational effects rather than causal effects. This paper makes the first attempt to study the causal effect of a functional variable as a treatment in observational studies. Despite the lack of a probability density function for the functional treatment, the propensity score is properly defined in terms of a multivariate substitute. Two covariate balancing methods are proposed to estimate the propensity score, which minimize the correlation between the treatment and covariates. The appealing performance of the proposed method in both covariate balance and causal effect estimation is demonstrated by a simulation study. The proposed method is applied to study the causal effect of body shape on human visceral adipose tissue.

preprint2020arXiv

Numerical computation of triangular complex spherical designs with small mesh ratio

This paper provides triangular spherical designs for the complex unit sphere $Ω^d$ by exploiting the natural correspondence between the complex unit sphere in $d$ dimensions and the real unit sphere in $2d-1$. The existence of triangular and square complex spherical $t$-designs with the optimal order number of points is established. A variational characterization of triangular complex designs is provided, with particular emphasis on numerical computation of efficient triangular complex designs with good geometric properties as measured by their mesh ratio. We give numerical examples of triangular spherical $t$-designs on complex unit spheres of dimension $d=2$ to $6$.

preprint2020arXiv

On the Identifiability of Latent Class Models for Multiple-Systems Estimation

Latent class models have recently become popular for multiple-systems estimation in human rights applications. However, it is currently unknown when a given family of latent class models is identifiable in this context. We provide necessary and sufficient conditions on the number of latent classes needed for a family of latent class models to be identifiable. Along the way we provide a mechanism for verifying identifiability in a class of multiple-systems estimation models that allow for individual heterogeneity.

preprint2020arXiv

Calibration Scoring Rules for Practical Prediction Training

In situations where forecasters are scored on the quality of their probabilistic predictions, it is standard to use `proper' scoring rules to perform such scoring. These rules are desirable because they give forecasters no incentive to lie about their probabilistic beliefs. However, in the real world context of creating a training program designed to help people improve calibration through prediction practice, there are a variety of desirable traits for scoring rules that go beyond properness. These potentially may have a substantial impact on the user experience, usability of the program, or efficiency of learning. The space of proper scoring rules is too broad, in the sense that most proper scoring rules lack these other desirable properties. On the other hand, the space of proper scoring rules is potentially also too narrow, in the sense that we may sometimes choose to give up properness when it conflicts with other properties that are even more desirable from the point of view of usability and effective training. We introduce a class of scoring rules that we call `Practical' scoring rules, designed to be intuitive to users in the context of `right' vs. `wrong' p

preprint2020arXiv

Optimizing tail risks using an importance sampling based extrapolation for heavy-tailed objectives

Motivated by the prominence of Conditional Value-at-Risk (CVaR) as a measure for tail risk in settings affected by uncertainty, we develop a new formula for approximating CVaR based optimization objectives and their gradients from limited samples. A key difficulty that limits the widespread practical use of these optimization formulations is the large amount of data required by the state-of-the-art sample average approximation schemes to approximate the CVaR objective with high fidelity. Unlike the state-of-the-art sample average approximations which require impractically large amounts of data in tail probability regions, the proposed approximation scheme exploits the self-similarity of heavy-tailed distributions to extrapolate data from suitable lower quantiles. The resulting approximations are shown to be statistically consistent and are amenable for optimization by means of conventional gradient descent. The approximation is guided by means of a systematic importance-sampling scheme whose asymptotic variance reduction properties are rigorously examined. Numerical experiments demonstrate the superiority of the proposed approximations and the ease of implementation points to the v

preprint2020arXiv

Asynchronous Online Testing of Multiple Hypotheses

We consider the problem of asynchronous online testing, aimed at providing control of the false discovery rate (FDR) during a continual stream of data collection and testing, where each test may be a sequential test that can start and stop at arbitrary times. This setting increasingly characterizes real-world applications in science and industry, where teams of researchers across large organizations may conduct tests of hypotheses in a decentralized manner. The overlap in time and space also tends to induce dependencies among test statistics, a challenge for classical methodology, which either assumes (overly optimistically) independence or (overly pessimistically) arbitrary dependence between test statistics. We present a general framework that addresses both of these issues via a unified computational abstraction that we refer to as "conflict sets." We show how this framework yields algorithms with formal FDR guarantees under a more intermediate, local notion of dependence. We illustrate our algorithms in simulations by comparing to existing algorithms for online FDR control.

preprint2020arXiv

Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data

We investigate whether generating synthetic data can be a viable strategy for providing access to detailed geocoding information for external researchers, without compromising the confidentiality of the units included in the database. Our work was motivated by a recent project at the Institute for Employment Research (IAB) in Germany that linked exact geocodes to the Integrated Employment Biographies (IEB), a large administrative database containing several million records. We evaluate the performance of three synthesizers regarding the trade-off between preserving analytical validity and limiting disclosure risks: One synthesizer employs Dirichlet Process mixtures of products of multinomials (DPMPM), while the other two use different versions of Classification and Regression Trees (CART). In terms of preserving analytical validity, our proposed synthesis strategy for geocodes based on categorical CART models outperforms the other two. If the risks of the synthetic data generated by the categorical CART synthesizer are deemed too high, we demonstrate that synthesizing additional variables is the preferred strategy to address the risk-utility trade-off in practice, compared to limit

preprint2020arXiv

Focused Bayesian Prediction

We propose a new method for conducting Bayesian prediction that delivers accurate predictions without correctly specifying the unknown true data generating process. A prior is defined over a class of plausible predictive models. After observing data, we update the prior to a posterior over these models, via a criterion that captures a user-specified measure of predictive accuracy. Under regularity, this update yields posterior concentration onto the element of the predictive class that maximizes the expectation of the accuracy measure. In a series of simulation experiments and empirical examples we find notable gains in predictive accuracy relative to conventional likelihood-based prediction.

preprint2020arXiv

On the Monotonicity of a Nondifferentially Mismeasured Binary Confounder

Suppose that we are interested in the average causal effect of a binary treatment on an outcome when this relationship is confounded by a binary confounder. Suppose that the confounder is unobserved but a nondifferential proxy of it is observed. We show that, under certain monotonicity assumption that is empirically verifiable, adjusting for the proxy produces a measure of the effect that is between the unadjusted and the true measures.

preprint2020arXiv

Comparison of non-parametric global envelopes

This study presents a simulation study to compare different non-parametric global envelopes that are refinements of the rank envelope proposed by Myllymäki et al. (2017, Global envelope tests for spatial processes, J. R. Statist. Soc. B 79, 381-404, doi: 10.1111/rssb.12172). The global envelopes are constructed for a set of functions or vectors. For a large number of vectors, all the refinements lead to the same outcome as the global rank envelope. For smaller numbers of vectors the refinement playes a role, where different refinements are sensitive to different types of extremeness of a vector among the set of vectors. The performance of the different alternatives are compared in a simulation study with respect to the numbers of available vectors, the dimensionality of the vectors, the amount of dependence between the vector elements and the expected type of extremeness.

preprint2020arXiv

Functional Registration and Local Variations: Identifiability, Rank, and Tuning

We develop theory and methodology for the problem of nonparametric registration of functional data that have been subjected to random deformation (warping) of their time scale. The separation of this phase variation ("horizontal" variation) from the amplitude variation ("vertical" variation) is crucial in order to properly conduct further analyses, which otherwise can be severely distorted. We determine precise nonparametric conditions under which the two forms of variation are identifiable. These show that the identifiability delicately depends on the underlying rank. By means of several counterexamples, we demonstrate that our conditions are sharp if one wishes a genuinely nonparametric setup; and in doing so we caution that popular remedies such as structural assumptions or roughness penalties can easily fail. We then propose a nonparametric registration method based on a "local variation measure", the main element in elucidating identifiability. A key advantage of the method is that it is free of any tuning or penalisation parameters regulating the amount of alignment, thus circumventing the problem of over/under-registration often encountered in practic

preprint2020arXiv

Goodness-of-fit tests for functional linear models based on integrated projections

Functional linear models are one of the most fundamental tools to assess the relation between two random variables of a functional or scalar nature. This contribution proposes a goodness-of-fit test for the functional linear model with functional response that neatly adapts to functional/scalar responses/predictors. In particular, the new goodness-of-fit test extends a previous proposal for scalar response. The test statistic is based on a convenient regularized estimator, is easy to compute, and is calibrated through an efficient bootstrap resampling. A graphical diagnostic tool, useful to visualize the deviations from the model, is introduced and illustrated with a novel data application. The R package goffda implements the proposed methods and allows for the reproducibility of the data application.

preprint2020arXiv

Identifying stochastic governing equations from data of the most probable transition trajectories

Extracting governing stochastic differential equation models from elusive data is crucial to understand and forecast dynamics for complex systems. We devise a method to extract the drift term and estimate the diffusion coefficient of a governing stochastic dynamical system, from its time-series data of the most probable transition trajectory. By the Onsager-Machlup theory, the most probable transition trajectory satisfies the corresponding Euler-Lagrange equation, which is a second order deterministic ordinary differential equation involving the drift term and diffusion coefficient. We first estimate the coefficients of the Euler-Lagrange equation based on the data of the most probable trajectory, and then we calculate the drift and diffusion coefficients of the governing stochastic dynamical system. These two steps involve sparse regression and optimization. Finally, we illustrate our method with an example and some discussions.

preprint2020arXiv

Low-complexity Architecture for AR(1) Inference

In this Letter, we propose a low-complexity estimator for the correlation coefficient based on the signed $\operatorname{AR}(1)$ process. The introduced approximation is suitable for implementation in low-power hardware architectures. Monte Carlo simulations reveal that the proposed estimator performs comparably to the competing methods in literature with maximum error in order of $10^{-2}$. However, the hardware implementation of the introduced method presents considerable advantages in several relevant metrics, offering more than 95% reduction in dynamic power and doubling the maximum operating frequency when compared to the reference method.

preprint2020arXiv

Inference for multiple heterogeneous networks with a common invariant subspace

The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge (COSIE) multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The COSIE model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices - the multiple adjacency spectral embedding (MASE) - leads, in a COSIE model, to simultaneous consistent estimation of underlying pa

preprint2020arXiv

A Pilot Design for Observational Studies: Using Abundant Data Thoughtfully

Observational studies often benefit from an abundance of observational units. This can lead to studies that -- while challenged by issues of internal validity -- have inferences derived from sample sizes substantially larger than randomized controlled trials. But is the information provided by an observational unit best used in the analysis phase? We propose the use of `pilot design,' in which observations are expended in the design phase of the study, and the post-treatment information from these observations is used to improve study design. In modern observational studies, which are data rich but control poor, pilot designs can be used to gain information about the structure of post-treatment variation. This information can then be used to improve instrumental variable designs, propensity score matching, doubly-robust estimation, and other observational study designs. We illustrate one version of a pilot design, which aims to reduce within-set heterogeneity and improve performance in sensitivity analyses. This version of a pilot design expends observational units during the design phase to fit a prognostic model, avoiding concerns of overfitting. Additionally, it enables the

preprint2020arXiv

Estimation of the number of irregular foreigners in Poland using non-linear count regression models

Population size estimation requires access to unit-level data in order to correctly apply capture-recapture methods. Unfortunately, for reasons of confidentiality access to such data may be limited. To overcome this issue we apply and extend the hierarchical Poisson-Gamma model proposed by Zhang (2008), which initially was used to estimate the number of irregular foreigners in Norway. The model is an alternative to the current capture-recapture approach as it does not require linking multiple sources and is solely based on aggregated administrative data that include (1) the number of apprehended irregular foreigners, (2) the number of foreigners who faced criminal charges and (3) the number of foreigners registered in the central population register. The model explicitly assumes a relationship between the unauthorized and registered population, which is motivated by the interconnection between these two groups. This makes the estimation conditionally dependent on the size of regular population, provides interpretation with analogy to registered population and makes the estimated parameter more stable over time. In this paper, we modify the original idea to allow for covariates and

preprint2020arXiv

Unified Rules of Renewable Weighted Sums for Various Online Updating Estimations

This paper establishes unified frameworks of renewable weighted sums (RWS) for various online updating estimations in the models with streaming data sets. The newly defined RWS lays the foundation of online updating likelihood, online updating loss function, online updating estimating equation and so on. The idea of RWS is intuitive and heuristic, and the algorithm is computationally simple. This paper chooses nonparametric model as an exemplary setting. The RWS applies to various types of nonparametric estimators, which include but are not limited to nonparametric likelihood, quasi-likelihood and least squares. Furthermore, the method and the theory can be extended into the models with both parameter and nonparametric function. The estimation consistency and asymptotic normality of the proposed renewable estimator are established, and the oracle property is obtained. Moreover, these properties are always satisfied, without any constraint on the number of data batches, which means that the new method is adaptive to the situation where streaming data sets arrive perpetually. The behavior of the method is further illustrated by various numerical examples from simulation experiments a

preprint2020arXiv

Testing for equality between conditional copulas given discretized conditioning events

Several procedures have been recently proposed to test the simplifying assumption for conditional copulas. Instead of considering pointwise conditioning events, we study the constancy of the conditional dependence structure when some covariates belong to general borelian conditioning subsets. Several test statistics based on the equality of conditional Kendall's tau are introduced, and we derive their asymptotic distributions under the null. When such conditioning events are not fixed ex ante, we propose a data-driven procedure to recursively build such relevant subsets. It is based on decision trees that maximize the differences between the conditional Kendall's taus corresponding to the leaves of the trees. The performances of such tests are illustrated in a simulation experiment. Moreover, a study of the conditional dependence between financial stock returns is managed, given some clustering of their past values. The last application deals with the conditional dependence between coverage amounts in an insurance dataset.

preprint2020arXiv

VAR estimators using binary measurements

In this paper, two novel algorithms to estimate a Gaussian Vector Autoregressive (VAR) model from 1-bit measurements are introduced. They are based on the Yule-Walker scheme modified to account for quantisation. The scalar case has been studied before. The main difficulty when going from the scalar to the vector case is how to estimate the ratios of the variances of pairwise components of the VAR model. The first method overcomes this difficulty by requiring the quantisation to be non-symmetric: each component of the VAR model output is replaced by a binary "zero" or a binary "one" depending on whether its value is greater than a strictly positive threshold. Different components of the VAR model can have different thresholds. As the choice of these thresholds has a strong influence on the performance, this first method is best suited for applications where the variance of each time series is approximately known prior to choosing the corresponding threshold. The second method relies instead on symmetric quantisations of not only each component of the VAR model but also on the pairwise differences of the components. These additional measurements are equivalent to a ra

preprint2020arXiv

The polar-generalized normal distribution

This paper introduces an extension to the normal distribution through the polar method to capture bimodality and asymmetry, which are often observed characteristics of empirical data. The later two features are entirely controlled by a separate scalar parameter. Explicit expressions for the cumulative distribution function, the density function and the moments were derived. The stochastic representation of the distribution facilitates implementing Bayesian estimation via the Markov chain Monte Carlo methods. Some real-life data as well as simulated data are analyzed to illustrate the flexibility of the distribution for modeling asymmetric bimodality.

preprint2020arXiv

Efficient Detection Of Infected Individuals using Two Stage Testing

Group testing is an efficient method for testing a large population to detect infected individuals. In this paper, we consider an efficient adaptive two stage group testing scheme. Using a straightforward analysis, we characterize the efficiency of several two stage group testing algorithms. We determine how to pick the parameters of the tests optimally for three schemes with different types of randomization, and show that the performance of two stage testing depends on the type of randomization employed. Seemingly similar randomization procedures lead to different expected number of tests to detect all infected individuals, we determine what kinds of randomization are necessary to achieve optimal performance. We further show that in the optimal setting, our testing scheme is robust to errors in the input parameters.

preprint2020arXiv

Block-wise Minimization-Majorization algorithm for Huber's criterion: sparse learning and applications

Huber's criterion can be used for robust joint estimation of regression and scale parameters in the linear model. Huber's (Huber, 1981) motivation for introducing the criterion stemmed from non-convexity of the joint maximum likelihood objective function as well as non-robustness (unbounded influence function) of the associated ML-estimate of scale. In this paper, we illustrate how the original algorithm proposed by Huber can be set within the block-wise minimization majorization framework. In addition, we propose novel data-adaptive step sizes for both the location and scale, which are further improving the convergence. We then illustrate how Huber's criterion can be used for sparse learning of underdetermined linear model using the iterative hard thresholding approach. We illustrate the usefulness of the algorithms in an image denoising application and simulation studies.

People in this topic

12 visible researcher(s)