Source author record

Jan Hannig

Jan Hannig appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning math.PR math.ST Quantitative Methods Statistics Theory Applications Computation Computational Engineering, Finance, and Science eess.IV Genomics Information Theory math.AP math.CO math.IT math.NA math.OC Multiagent Systems Networking and Internet Architecture

Catalog footprint

What is connected

14works

19topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A New String Edit Distance and Applications

String edit distances have been used for decades in applications ranging from spelling correction and web search suggestions to DNA analysis. Most string edit distances are variations of the Levenshtein distance and consider only single-character edits. In forensic applications polymorphic genetic markers such as short tandem repeats (STRs) are used. At these repetitive motifs the DNA copying errors consist of more than just single base differences. More often the phenomenon of ``stutter'' is observed, where the number of repeated units differs (by whole units) from the template. To adapt the Levenshtein distance to be suitable for forensic applications where DNA sequence similarity is of interest, a generalized string edit distance is defined that accommodates the addition or deletion of whole motifs in addition to single-nucleotide edits. A dynamic programming implementation is developed for computing this distance between sequences. The novelty of this algorithm is in handling the complex interactions that arise between multiple- and single-character edits. Forensic examples illustrate the purpose and use of the Restricted Forensic Levenshtein (RFL) distance measure, but applications extend to sequence alignment and string similarity in other biological areas, as well as dynamic programming algorithms more broadly.

preprint2022arXiv

Demystifying Inferential Models: A Fiducial Perspective

Inferential models have recently gained in popularity for valid uncertainty quantification. In this paper, we investigate inferential models by exploring relationships between inferential models, fiducial inference, and confidence curves. In short, we argue that from a certain point of view, inferential models can be viewed as fiducial distribution based confidence curves. We show that all probabilistic uncertainty quantification of inferential models is based on a collection of sets we name principle sets and principle assertions.

preprint2020arXiv

Deep Fiducial Inference

Since the mid-2000s, there has been a resurrection of interest in modern modifications of fiducial inference. To date, the main computational tool to extract a generalized fiducial distribution is Markov chain Monte Carlo (MCMC). We propose an alternative way of computing a generalized fiducial distribution that could be used in complex situations. In particular, to overcome the difficulty when the unnormalized fiducial density (needed for MCMC), we design a fiducial autoencoder (FAE). The fitted autoencoder is used to generate generalized fiducial samples of the unknown parameters. To increase accuracy, we then apply an approximate fiducial computation (AFC) algorithm, by rejecting samples that when plugged into a decoder do not replicate the observed data well enough. Our numerical experiments show the effectiveness of our FAE-based inverse solution and the excellent coverage performance of the AFC corrected FAE solution.

preprint2020arXiv

Joint and individual analysis of breast cancer histologic images and genomic covariates

A key challenge in modern data analysis is understanding connections between complex and differing modalities of data. For example, two of the main approaches to the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genetics. While histopathology is the gold standard for diagnostics and there have been many recent breakthroughs in genetics, there is little overlap between these two fields. We aim to bridge this gap by developing methods based on Angle-based Joint and Individual Variation Explained (AJIVE) to directly explore similarities and differences between these two modalities. Our approach exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction to address some of the challenges presented by statistical analysis of histopathology image data. CNNs raise issues of interpretability that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features. Our results provide many interpretable connections and contrasts between histopathology and genetics.

preprint2020arXiv

Subspace Clustering through Sub-Clusters

The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out of sample points. The key idea is that this random subset borrows information across the entire data set and that the problem of clustering points can be replaced with the more efficient and robust problem of "clustering sub-clusters". We provide theoretical guarantees for our procedure. The numerical results indicate we outperform other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.

preprint2016arXiv

A Note on Automatic Data Transformation

Modern data analysis frequently involves variables with highly non-Gaussian marginal distributions. However, commonly used analysis methods are most effective with roughly Gaussian data. This paper introduces an automatic transformation that improves the closeness of distributions to normality. For each variable, a new family of parametrizations of the shifted logarithm transformation is proposed, which is unique in treating the data as real-valued, and in allowing transformation for both left and right skewness within the single family. This also allows an automatic selection of the parameter value (which is crucial for high dimensional data with many variables to transform) by minimizing the Anderson-Darling test statistic of the transformed data. An application to image features extracted from melanoma microscopy slides demonstrate the utility of the proposed transformation in addressing data with excessive skewness, heteroscedasticity and influential observations.

preprint2016arXiv

Higher order asymptotics of Generalized Fiducial Distribution

Generalized Fiducial Inference (GFI) is motivated by R.A. Fisher's approach of obtaining posterior-like distributions when there is no prior information available for the unknown parameter. Without the use of Bayes' theorem GFI proposes a distribution on the parameter space using a technique called increasing precision asymptotics \cite{hannig2013generalized}. In this article we analyzed the regularity conditions under which the Generalized Fiducial Distribution (GFD) will be first and second order exact in a frequentist sense. We used a modification of an ingenious technique named "Shrinkage method" \cite{bickel1990decomposition}, which has been extensively used in the probability matching prior contexts, to find the higher order expansion of the frequentist coverage of Fiducial quantile. We identified when the higher order terms of one-sided coverage of Fiducial quantile will vanish and derived a workable recipe for obtaining such GFDs. These ideas are demonstrated on several examples.

preprint2016arXiv

Non-iterative Joint and Individual Variation Explained

Integrative analysis of disparate data blocks measured on a common set of experimental subjects is one major challenge in modern data analysis. This data structure naturally motivates the simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Genome Atlas (TCGA) to characterize the common and also the unique aspects of cancer genetics and cell biology for each source. In this paper we introduce Non-iterative Joint and Individual Variation Explained (Non-iterative JIVE), capturing both joint and individual variation within each data block. This is a major improvement over earlier approaches to this challenge in terms of a new conceptual understanding, much better adaption to data heterogeneity and a fast linear algebra computation. Important mathematical contributions are the use of score subspaces as the principal descriptors of variation structure and the use of perturbation theory as the guide for variation segmentation. This leads to a method which is robust against the heterogeneity among data blocks without a need for normalization. An application to TCGA data reveals different behaviors of each type of signal in characterizing tumor subtypes. An application to a mortality data set reveals interesting historical lessons.

preprint2015arXiv

Source detection algorithms for dynamic contaminants based on the analysis of a hydrodynamic limit

In this work we propose and numerically analyze an algorithm for detection of a contaminant source using a dynamic sensor network. The algorithm is motivated using a global probabilistic optimization problem and is based on the analysis of the hydrodynamic limit of a discrete time evolution equation on the lattice under a suitable scaling of time and space. Numerical results illustrating the effectiveness of the algorithm are presented.

preprint2014arXiv

Discussion of "On the Birnbaum Argument for the Strong Likelihood Principle"

In this discussion we demonstrate that fiducial distributions provide a natural example of an inference paradigm that does not obey Strong Likelihood Principle while still satisfying the Weak Conditionality Principle. [arXiv:1302.7021]

preprint2014arXiv

The importance sampling technique for understanding rare events in Erdős-Rényi random graphs

In dense Erdős-Rényi random graphs, we are interested in the events where large numbers of a given subgraph occur. The mean behavior of subgraph counts is known, and only recently were the related large deviations results discovered. Consequently, it is natural to ask, can one develop efficient numerical schemes to estimate the probability of an Erdős-Rényi graph containing an excessively large number of a fixed given subgraph? Using the large deviation principle we study an importance sampling scheme as a method to numerically compute the small probabilities of large triangle counts occurring within Erdős-Rényi graphs. We show that the exponential tilt suggested directly by the large deviation principle does not always yield an optimal scheme. The exponential tilt used in the importance sampling scheme comes from a generalized class of exponential random graphs. Asymptotic optimality, a measure of the efficiency of the importance sampling scheme, is achieved by a special choice of the parameters in the exponential random graph that makes it indistinguishable from an Erdős-Rényi graph conditioned to have many triangles in the large network limit. We show how this choice can be made for the conditioned Erdős-Rényi graphs both in the replica symmetric phase as well as in parts of the replica breaking phase to yield asymptotically optimal numerical schemes to estimate this rare event probability.

preprint2013arXiv

Generalized Fiducial Inference for Ultrahigh Dimensional Regression

In recent years the ultrahigh dimensional linear regression problem has attracted enormous attentions from the research community. Under the sparsity assumption most of the published work is devoted to the selection and estimation of the significant predictor variables. This paper studies a different but fundamentally important aspect of this problem: uncertainty quantification for parameter estimates and model choices. To be more specific, this paper proposes methods for deriving a probability density function on the set of all possible models, and also for constructing confidence intervals for the corresponding parameters. These proposed methods are developed using the generalized fiducial methodology, which is a variant of Fisher's controversial fiducial idea. Theoretical properties of the proposed methods are studied, and in particular it is shown that statistical inference based on the proposed methods will have exact asymptotic frequentist property. In terms of empirical performances, the proposed methods are tested by simulation experiments and an application to a real data set. Lastly this work can also be seen as an interesting and successful application of Fisher's fiducial idea to an important and contemporary problem. To the best of the authors' knowledge, this is the first time that the fiducial idea is being applied to a so-called "large p small n" problem.

preprint2012arXiv

Generalized fiducial inference for normal linear mixed models

While linear mixed modeling methods are foundational concepts introduced in any statistical education, adequate general methods for interval estimation involving models with more than a few variance components are lacking, especially in the unbalanced setting. Generalized fiducial inference provides a possible framework that accommodates this absence of methodology. Under the fabric of generalized fiducial inference along with sequential Monte Carlo methods, we present an approach for interval estimation for both balanced and unbalanced Gaussian linear mixed models. We compare the proposed method to classical and Bayesian results in the literature in a simulation study of two-fold nested models and two-factor crossed designs with an interaction term. The proposed method is found to be competitive or better when evaluated based on frequentist criteria of empirical coverage and average length of confidence intervals for small sample sizes. A MATLAB implementation of the proposed algorithm is available from the authors.

preprint2011arXiv

Continuum Limits of Markov Chains with Application to Network Modeling

In this paper we investigate the continuum limits of a class of Markov chains. The investigation of such limits is motivated by the desire to model very large networks. We show that under some conditions, a sequence of Markov chains converges in some sense to the solution of a partial differential equation. Based on such convergence we approximate Markov chains modeling networks with a large number of components by partial differential equations. While traditional Monte Carlo simulation for very large networks is practically infeasible, partial differential equations can be solved with reasonable computational overhead using well-established mathematical tools.

Jan Hannig

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

A New String Edit Distance and Applications

Demystifying Inferential Models: A Fiducial Perspective

Deep Fiducial Inference

Joint and individual analysis of breast cancer histologic images and genomic covariates

Subspace Clustering through Sub-Clusters

A Note on Automatic Data Transformation

Higher order asymptotics of Generalized Fiducial Distribution

Non-iterative Joint and Individual Variation Explained

Source detection algorithms for dynamic contaminants based on the analysis of a hydrodynamic limit

Discussion of "On the Birnbaum Argument for the Strong Likelihood Principle"

The importance sampling technique for understanding rare events in Erdős-Rényi random graphs

Generalized Fiducial Inference for Ultrahigh Dimensional Regression

Generalized fiducial inference for normal linear mixed models

Continuum Limits of Markov Chains with Application to Network Modeling