Researcher profile

Jan Hannig

Jan Hannig contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

A New String Edit Distance and Applications

String edit distances have been used for decades in applications ranging from spelling correction and web search suggestions to DNA analysis. Most string edit distances are variations of the Levenshtein distance and consider only single-character edits. In forensic applications polymorphic genetic markers such as short tandem repeats (STRs) are used. At these repetitive motifs the DNA copying errors consist of more than just single base differences. More often the phenomenon of ``stutter'' is observed, where the number of repeated units differs (by whole units) from the template. To adapt the Levenshtein distance to be suitable for forensic applications where DNA sequence similarity is of interest, a generalized string edit distance is defined that accommodates the addition or deletion of whole motifs in addition to single-nucleotide edits. A dynamic programming implementation is developed for computing this distance between sequences. The novelty of this algorithm is in handling the complex interactions that arise between multiple- and single-character edits. Forensic examples illustrate the purpose and use of the Restricted Forensic Levenshtein (RFL) distance measure, but applications extend to sequence alignment and string similarity in other biological areas, as well as dynamic programming algorithms more broadly.

preprint2022arXiv

Demystifying Inferential Models: A Fiducial Perspective

Inferential models have recently gained in popularity for valid uncertainty quantification. In this paper, we investigate inferential models by exploring relationships between inferential models, fiducial inference, and confidence curves. In short, we argue that from a certain point of view, inferential models can be viewed as fiducial distribution based confidence curves. We show that all probabilistic uncertainty quantification of inferential models is based on a collection of sets we name principle sets and principle assertions.

preprint2020arXiv

Deep Fiducial Inference

Since the mid-2000s, there has been a resurrection of interest in modern modifications of fiducial inference. To date, the main computational tool to extract a generalized fiducial distribution is Markov chain Monte Carlo (MCMC). We propose an alternative way of computing a generalized fiducial distribution that could be used in complex situations. In particular, to overcome the difficulty when the unnormalized fiducial density (needed for MCMC), we design a fiducial autoencoder (FAE). The fitted autoencoder is used to generate generalized fiducial samples of the unknown parameters. To increase accuracy, we then apply an approximate fiducial computation (AFC) algorithm, by rejecting samples that when plugged into a decoder do not replicate the observed data well enough. Our numerical experiments show the effectiveness of our FAE-based inverse solution and the excellent coverage performance of the AFC corrected FAE solution.

preprint2020arXiv

Joint and individual analysis of breast cancer histologic images and genomic covariates

A key challenge in modern data analysis is understanding connections between complex and differing modalities of data. For example, two of the main approaches to the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genetics. While histopathology is the gold standard for diagnostics and there have been many recent breakthroughs in genetics, there is little overlap between these two fields. We aim to bridge this gap by developing methods based on Angle-based Joint and Individual Variation Explained (AJIVE) to directly explore similarities and differences between these two modalities. Our approach exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction to address some of the challenges presented by statistical analysis of histopathology image data. CNNs raise issues of interpretability that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features. Our results provide many interpretable connections and contrasts between histopathology and genetics.

preprint2020arXiv

Subspace Clustering through Sub-Clusters

The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out of sample points. The key idea is that this random subset borrows information across the entire data set and that the problem of clustering points can be replaced with the more efficient and robust problem of "clustering sub-clusters". We provide theoretical guarantees for our procedure. The numerical results indicate we outperform other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.