Source author record

Wojciech Szpankowski

Wojciech Szpankowski appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Data Structures and Algorithms Machine Learning math.PR Social and Information Networks Applications Artificial Intelligence

Catalog footprint

What is connected

17works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Low Complexity Approximate Bayesian Logistic Regression for Sparse Online Learning

Theoretical results show that Bayesian methods can achieve lower bounds on regret for online logistic regression. In practice, however, such techniques may not be feasible especially for very large feature sets. Various approximations that, for huge sparse feature sets, diminish the theoretical advantages, must be used. Often, they apply stochastic gradient methods with hyper-parameters that must be tuned on some surrogate loss, defeating theoretical advantages of Bayesian methods. The surrogate loss, defined to approximate the mixture, requires techniques as Monte Carlo sampling, increasing computations per example. We propose low complexity analytical approximations for sparse online logistic and probit regressions. Unlike variational inference and other methods, our methods use analytical closed forms, substantially lowering computations. Unlike dense solutions, as Gaussian Mixtures, our methods allow for sparse problems with huge feature sets without increasing complexity. With the analytical closed forms, there is also no need for applying stochastic gradient methods on surrogate losses, and for tuning and balancing learning and regularization hyper-parameters. Empirical results top the performance of the more computationally involved methods. Like such methods, our methods still reveal per feature and per example uncertainty measures.

preprint2021arXiv

On Agnostic PAC Learning using $\mathcal{L}_2$-polynomial Regression and Fourier-based Algorithms

We develop a framework using Hilbert spaces as a proxy to analyze PAC learning problems with structural properties. We consider a joint Hilbert space incorporating the relation between the true label and the predictor under a joint distribution $D$. We demonstrate that agnostic PAC learning with 0-1 loss is equivalent to an optimization in the Hilbert space domain. With our model, we revisit the PAC learning problem using methods based on least-squares such as $\mathcal{L}_2$ polynomial regression and Linial's low-degree algorithm. We study learning with respect to several hypothesis classes such as half-spaces and polynomial-approximated classes (i.e., functions approximated by a fixed-degree polynomial). We prove that (under some distributional assumptions) such methods obtain generalization error up to $2opt$ with $opt$ being the optimal error of the class. Hence, we show the tightest bound on generalization error when $opt\leq 0.2$.

preprint2021arXiv

On maximum-likelihood estimation in the all-or-nothing regime

We study the problem of estimating a rank-1 additive deformation of a Gaussian tensor according to the \emph{maximum-likelihood estimator} (MLE). The analysis is carried out in the sparse setting, where the underlying signal has a support that scales sublinearly with the total number of dimensions. We show that for Bernoulli distributed signals, the MLE undergoes an \emph{all-or-nothing} (AoN) phase transition, already established for the minimum mean-square-error estimator (MMSE) in the same problem. The result follows from two main technical points: (i) the connection established between the MLE and the MMSE, using the first and second-moment methods in the constrained signal space, (ii) a recovery regime for the MMSE stricter than the simple error vanishing characterization given in the standard AoN, that is here proved as a general result.

preprint2021arXiv

Sequential Universal Modeling for Non-Binary Sequences with Constrained Distributions

Sequential probability assignment and universal compression go hand in hand. We propose sequential probability assignment for non-binary (and large alphabet) sequences with empirical distributions whose parameters are known to be bounded within a limited interval. Sequential probability assignment algorithms are essential in many applications that require fast and accurate estimation of the maximizing sequence probability. These applications include learning, regression, channel estimation and decoding, prediction, and universal compression. On the other hand, constrained distributions introduce interesting theoretical twists that must be overcome in order to present efficient sequential algorithms. Here, we focus on universal compression for memoryless sources, and present precise analysis for the maximal minimax and the average minimax for constrained distributions. We show that our sequential algorithm based on modified Krichevsky-Trofimov (KT) estimator is asymptotically optimal up to $O(1)$ for both maximal and average redundancies. This paper follows and addresses the challenge presented in \cite{stw08} that suggested "results for the binary case lay the foundation to studying larger alphabets".

preprint2021arXiv

Statistical and computational thresholds for the planted $k$-densest sub-hypergraph problem

In this work, we consider the problem of recovery a planted $k$-densest sub-hypergraph on $d$-uniform hypergraphs. This fundamental problem appears in different contexts, e.g., community detection, average-case complexity, and neuroscience applications as a structural variant of tensor-PCA problem. We provide tight \emph{information-theoretic} upper and lower bounds for the exact recovery threshold by the maximum-likelihood estimator, as well as \emph{algorithmic} bounds based on approximate message passing algorithms. The problem exhibits a typical statistical-to-computational gap observed in analogous sparse settings that widen with increasing sparsity of the problem. The bounds show that the signal structure impacts the location of the statistical and computational phase transition that the known existing bounds for the tensor-PCA model do not capture. This effect is due to the generic planted signal prior that this latter model addresses.

preprint2020arXiv

Hidden Words Statistics for Large Patterns

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern $w$ of length $m$ as a subsequence in a random text of length $n$. The quantity of interest is the number of occurrences of $w$ as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern $w$ is of variable length. To the best of our knowledge this problem was only tackled for a fixed length $m=O(1)$ [Flajolet, Szpankowski and Vallée, 2006]. In our main result we prove that for $m=o(n^{1/3})$ the number of subsequence occurrences is normally distributed. In addition, we show that under some constraints on the structure of $w$ the asymptotic normality can be extended to $m=o(\sqrt{n})$. For a special pattern $w$ consisting of the same symbol, we indicate that for $m=o(n)$ the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding's projection method for $U$-statistics to prove our findings.

preprint2020arXiv

Randomized Linear Algebra Approaches to Estimate the Von Neumann Entropy of Density Matrices

Thevon Neumann entropy, named after John von Neumann, is an extension of the classical concept of entropy to the field of quantum mechanics. From a numerical perspective, von Neumann entropy can be computed simply by computing all eigenvalues of a density matrix, an operation that could be prohibitively expensive for large-scale density matrices. We present and analyze three randomized algorithms to approximate von Neumann entropy of {real} density matrices: our algorithms leverage recent developments in the Randomized Numerical Linear Algebra (RandNLA) literature, such as randomized trace estimators, provable bounds for the power method, and the use of random projections to approximate the eigenvalues of a matrix. All three algorithms come with provable accuracy guarantees and our experimental evaluations support our theoretical findings showing considerable speedup with small loss in accuracy.

preprint2020arXiv

Temporal Ordered Clustering in Dynamic Networks: Unsupervised and Semi-supervised Learning Algorithms

In temporal ordered clustering, given a single snapshot of a dynamic network in which nodes arrive at distinct time instants, we aim at partitioning its nodes into $K$ ordered clusters $\mathcal{C}_1 \prec \cdots \prec \mathcal{C}_K$ such that for $i<j$, nodes in cluster $\mathcal{C}_i$ arrived before nodes in cluster $\mathcal{C}_j$, with $K$ being a data-driven parameter and not known upfront. Such a problem is of considerable significance in many applications ranging from tracking the expansion of fake news to mapping the spread of information. We first formulate our problem for a general dynamic graph, and propose an integer programming framework that finds the optimal clustering, represented as a strict partial order set, achieving the best precision (i.e., fraction of successfully ordered node pairs) for a fixed density (i.e., fraction of comparable node pairs). We then develop a sequential importance procedure and design unsupervised and semi-supervised algorithms to find temporal ordered clusters that efficiently approximate the optimal solution. To illustrate the techniques, we apply our methods to the vertex copying (duplication-divergence) model which exhibits some edge-case challenges in inferring the clusters as compared to other network models. Finally, we validate the performance of the proposed algorithms on synthetic and real-world networks.

preprint2020arXiv

Toward Universal Testing of Dynamic Network Models

Numerous networks in the real world change over time, in the sense that nodes and edges enter and leave the networks. Various dynamic random graph models have been proposed to explain the macroscopic properties of these systems and to provide a foundation for statistical inferences and predictions. It is of interest to have a rigorous way to determine how well these models match observed networks. We thus ask the following goodness of fit question: given a sequence of observations/snapshots of a growing random graph, along with a candidate model M, can we determine whether the snapshots came from M or from some arbitrary alternative model that is well-separated from M in some natural metric? We formulate this problem precisely and boil it down to goodness of fit testing for graph-valued, infinite-state Markov processes and exhibit and analyze a universal test based on non-stationary sampling for a natural class of models.

preprint2018arXiv

The Trade-off between Privacy and Fidelity via Ehrhart Theory

As an increasing amount of data is gathered nowadays and stored in databases (DBs), the question arises of how to protect the privacy of individual records in a DB even while providing accurate answers to queries on the DB. Differential Privacy (DP) has gained acceptance as a framework to quantify vulnerability of algorithms to privacy breaches. We consider the problem of how to sanitize an entire DB via a DP mechanism, on which unlimited further querying is performed. While protecting privacy, it is important that the sanitized DB still provide accurate responses to queries. The central contribution of this work is to characterize the amount of information preserved in an optimal DP DB sanitizing mechanism (DSM). We precisely characterize the utility-privacy trade-off of mechanisms that sanitize DBs in the asymptotic regime of large DBs. We study this in an information-theoretic framework by modeling a generic distribution on the data, and a measure of fidelity between the histograms of the original and sanitized DBs. We consider the popular $\mathbb{L}_{1}-$distortion metric that leads to the formulation as a linear program (LP). This optimization problem is prohibitive in complexity with the number of constraints growing exponentially in the parameters of the problem. Leveraging tools from discrete geometry, analytic combinatorics, and duality theorems of optimization, we fully characterize the optimal solution in terms of a power series whose coefficients are the number of integer points on a multidimensional convex polytope studied by Ehrhart in 1967. Employing Ehrhart theory, we determine a simple closed form computable expression for the asymptotic growth of the optimal privacy-fidelity trade-off to infinite precision. At the heart of the findings is a deep connection between the minimum expected distortion and the Ehrhart series of an integral convex polytope.

preprint2016arXiv

Asymmetric Rényi Problem and PATRICIA Tries

In 1960, Rényi asked for the number of random queries necessary to recover a hidden bijective labeling of n distinct objects. In each query one selects a random subset of labels and asks, what is the set of objects that have these labels? We consider here an asymmetric version of the problem in which in every query an object is chosen with probability p > 1/2 and we ignore "inconclusive" queries. We study the number of queries needed to recover the labeling in its entirety (the height), to recover at least one single element (the fillup level), and to recover a randomly chosen element (the typical depth). This problem exhibits several remarkable behaviors: the depth D_n converges in probability but not almost surely and while it satisfies the central limit theorem its local limit theorem doesn't hold; the height H_n and the fillup level F_n exhibit phase transitions with respect to p in the second term. To obtain these results, we take a unified approach via the analysis of the external profile, defined at level k as the number of elements recovered by the kth query. We first establish new precise asymptotic results for the average and variance, and a central limit law, for the external profile in the regime where it grows polynomially with n. We then extend the external profile results to the boundaries of the central region, leading to the solution of our problem for the height and fillup level. As a bonus, our analysis implies novel results for analogous parameters of random PATRICIA tries.

preprint2016arXiv

Average Size of a Suffix Tree for Markov Sources

We study a suffix tree built from a sequence generated by a Markovian source. Such sources are more realistic probabilistic models for text generation, data compression, molecular applications, and so forth. We prove that the average size of such a suffix tree is asymptotically equivalent to the average size of a trie built over $n$ independent sequences from the same Markovian source. This equivalence is only known for memoryless sources. We then derive a formula for the size of a trie under Markovian model to complete the analysis for suffix trees. We accomplish our goal by applying some novel techniques of analytic combinatorics on words also known as analytic pattern matching.

preprint2016arXiv

Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers

Nanopore sequencers are emerging as promising new platforms for high-throughput sequencing. As with other technologies, sequencer errors pose a major challenge for their effective use. In this paper, we present a novel information theoretic analysis of the impact of insertion-deletion (indel) errors in nanopore sequencers. In particular, we consider the following problems: (i) for given indel error characteristics and rate, what is the probability of accurate reconstruction as a function of sequence length; (ii) what is the number of `typical' sequences within the distortion bound induced by indel errors; (iii) using replicated extrusion (the process of passing a DNA strand through the nanopore), what is the number of replicas needed to reduce the distortion bound so that only one typical sequence exists within the distortion bound. Our results provide a number of important insights: (i) the maximum length of a sequence that can be accurately reconstructed in the presence of indel and substitution errors is relatively small; (ii) the number of typical sequences within the distortion bound is large; and (iii) replicated extrusion is an effective technique for unique reconstruction. In particular, we show that the number of replicas is a slow function (logarithmic) of sequence length -- implying that through replicated extrusion, we can sequence large reads using nanopore sequencers. Our model considers indel and substitution errors separately. In this sense, it can be viewed as providing (tight) bounds on reconstruction lengths and repetitions for accurate reconstruction when the two error modes are considered in a single model.

preprint2015arXiv

A Limit Theorem for Radix Sort and Tries with Markovian Input

Tries are among the most versatile and widely used data structures on words. In particular, they are used in fundamental sorting algorithms such as radix sort which we study in this paper. While the performance of radix sort and tries under a realistic probabilistic model for the generation of words is of significant importance, its analysis, even for simplest memoryless sources, has proved difficult. In this paper we consider a more realistic model where words are generated by a Markov source. By a novel use of the contraction method combined with moment transfer techniques we prove a central limit theorem for the complexity of radix sort and for the external path length in a trie. This is the first application of the contraction method to the analysis of algorithms and data structures with Markovian inputs; it relies on the use of systems of stochastic recurrences combined with a product version of the Zolotarev metric.

preprint2012arXiv

Average redundancy of the Shannon code for Markov sources

It is known that for memoryless sources, the average and maximal redundancy of fixed-to-variable length codes, such as the Shannon and Huffman codes, exhibit two modes of behavior for long blocks. It either converges to a limit or it has an oscillatory pattern, depending on the irrationality or rationality, respectively, of certain parameters that depend on the source. In this paper, we extend these findings, concerning the Shannon code, to the case of a Markov source, which is considerably more involved. While this dichotomy, of convergent vs. oscillatory behavior, is well known in other contexts (including renewal theory, ergodic theory, local limit theorems and large deviations of discrete distributions), in information theory (e.g., in redundancy analysis) it was recognized relatively recently. To the best of our knowledge, no results of this type were reported thus far for Markov sources. We provide a precise characterization of the convergent vs. oscillatory behavior of the Shannon code redundancy for a class of irreducible, periodic and aperiodic, Markov sources. These findings are obtained by analytic methods, such as Fourier/Fejer series analysis and spectral analysis of matrices.

preprint2012arXiv

Towards More Realistic Probabilistic Models for Data Structures: The External Path Length in Tries under the Markov Model

Tries are among the most versatile and widely used data structures on words. They are pertinent to the (internal) structure of (stored) words and several splitting procedures used in diverse contexts ranging from document taxonomy to IP addresses lookup, from data compression (i.e., Lempel-Ziv'77 scheme) to dynamic hashing, from partial-match queries to speech recognition, from leader election algorithms to distributed hashing tables and graph compression. While the performance of tries under a realistic probabilistic model is of significant importance, its analysis, even for simplest memoryless sources, has proved difficult. Rigorous findings about inherently complex parameters were rarely analyzed (with a few notable exceptions) under more realistic models of string generations. In this paper we meet these challenges: By a novel use of the contraction method combined with analytic techniques we prove a central limit theorem for the external path length of a trie under a general Markov source. In particular, our results apply to the Lempel-Ziv'77 code. We envision that the methods described here will have further applications to other trie parameters and data structures.

preprint2011arXiv

Deinterleaving Finite Memory Processes via Penalized Maximum Likelihood

We study the problem of deinterleaving a set of finite-memory (Markov) processes over disjoint finite alphabets, which have been randomly interleaved by a finite-memory switch. The deinterleaver has access to a sample of the resulting interleaved process, but no knowledge of the number or structure of the component Markov processes, or of the switch. We study conditions for uniqueness of the interleaved representation of a process, showing that certain switch configurations, as well as memoryless component processes, can cause ambiguities in the representation. We show that a deinterleaving scheme based on minimizing a penalized maximum-likelihood cost function is strongly consistent, in the sense of reconstructing, almost surely as the observed sequence length tends to infinity, a set of component and switch Markov processes compatible with the original interleaved process. Furthermore, under certain conditions on the structure of the switch (including the special case of a memoryless switch), we show that the scheme recovers \emph{all} possible interleaved representations of the original process. Experimental results are presented demonstrating that the proposed scheme performs well in practice, even for relatively short input samples.

Wojciech Szpankowski

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Low Complexity Approximate Bayesian Logistic Regression for Sparse Online Learning

On Agnostic PAC Learning using $\mathcal{L}_2$-polynomial Regression and Fourier-based Algorithms

On maximum-likelihood estimation in the all-or-nothing regime

Sequential Universal Modeling for Non-Binary Sequences with Constrained Distributions

Statistical and computational thresholds for the planted $k$-densest sub-hypergraph problem

Hidden Words Statistics for Large Patterns

Randomized Linear Algebra Approaches to Estimate the Von Neumann Entropy of Density Matrices

Temporal Ordered Clustering in Dynamic Networks: Unsupervised and Semi-supervised Learning Algorithms

Toward Universal Testing of Dynamic Network Models

The Trade-off between Privacy and Fidelity via Ehrhart Theory

Asymmetric Rényi Problem and PATRICIA Tries

Average Size of a Suffix Tree for Markov Sources

Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers

A Limit Theorem for Radix Sort and Tries with Markovian Input

Average redundancy of the Shannon code for Markov sources

Towards More Realistic Probabilistic Models for Data Structures: The External Path Length in Tries under the Markov Model

Deinterleaving Finite Memory Processes via Penalized Maximum Likelihood