Researcher profile

Narayana Santhanam

Narayana Santhanam contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - Baseline
4works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2014arXiv

On redundancy of memoryless sources over countable alphabets

The minimum average number of bits need to describe a random variable is its entropy, assuming knowledge of the underlying statistics On the other hand, universal compression supposes that the distribution of the random variable, while unknown, belongs to a known set $\cal P$ of distributions. Such universal descriptions for the random variable are agnostic to the identity of the distribution in $\cal P$. But because they are not matched exactly to the underlying distribution of the random variable, the average number of bits they use is higher, and the excess over the entropy used is the "redundancy". This formulation is fundamental to problems not just in compression, but also estimation and prediction and has a wide variety of applications from language modeling to insurance. In this paper, we study the redundancy of universal encodings of strings generated by independent identically distributed (iid) sampling from a set $\cal P$ of distributions over a countable support. We first show that if describing a single sample from $\cal P$ incurs finite redundancy, then $\cal P$ is tight but that the converse does not always hold. If a single sample can be described with finite worst-case-regret (a more stringent formulation than redundancy above), then it is known that length-$n$ iid samples only incurs a diminishing (in $n$) redundancy per symbol as $n$ increases. However, we show it is possible that a collection $\cal P$ incurs finite redundancy, yet description of length-$n$ iid samples incurs a constant redundancy per symbol encoded. We then show a sufficient condition on $\cal P$ such that length-$n$ iid samples will incur diminishing redundancy per symbol encoded.

preprint2013arXiv

Agnostic insurability of model classes

Motivated by problems in insurance, our task is to predict finite upper bounds on a future draw from an unknown distribution $p$ over the set of natural numbers. We can only use past observations generated independently and identically distributed according to $p$. While $p$ is unknown, it is known to belong to a given collection ${\cal P}$ of probability distributions on the natural numbers. The support of the distributions $p \in {\cal P}$ may be unbounded, and the prediction game goes on for \emph{infinitely} many draws. We are allowed to make observations without predicting upper bounds for some time. But we must, with probability 1, start and then continue to predict upper bounds after a finite time irrespective of which $p \in {\cal P}$ governs the data. If it is possible, without knowledge of $p$ and for any prescribed confidence however close to 1, to come up with a sequence of upper bounds that is never violated over an infinite time window with confidence at least as big as prescribed, we say the model class ${\cal P}$ is \emph{insurable}. We completely characterize the insurability of any class ${\cal P}$ of distributions over natural numbers by means of a condition on how the neighborhoods of distributions in ${\cal P}$ should be, one that is both necessary and sufficient.

preprint2012arXiv

On Modeling Profiles instead of Values

We consider the problem of estimating the distribution underlying an observed sample of data. Instead of maximum likelihood, which maximizes the probability of the ob served values, we propose a different estimate, the high-profile distribution, which maximizes the probability of the observed profile the number of symbols appearing any given number of times. We determine the high-profile distribution of several data samples, establish some of its general properties, and show that when the number of distinct symbols observed is small compared to the data size, the high-profile and maximum-likelihood distributions are roughly the same, but when the number of symbols is large, the distributions differ, and high-profile better explains the data.

preprint2012arXiv

Optimal Lempel-Ziv based lossy compression for memoryless data: how to make the right mistakes

Compression refers to encoding data using bits, so that the representation uses as few bits as possible. Compression could be lossless: i.e. encoded data can be recovered exactly from its representation) or lossy where the data is compressed more than the lossless case, but can still be recovered to within prespecified distortion metric. In this paper, we prove the optimality of Codelet Parsing, a quasi-linear time algorithm for lossy compression of sequences of bits that are independently and identically distributed (\iid) and Hamming distortion. Codelet Parsing extends the lossless Lempel Ziv algorithm to the lossy case---a task that has been a focus of the source coding literature for better part of two decades now. Given \iid sequences $\x$, the expected length of the shortest lossy representation such that $\x$ can be reconstructed to within distortion $\dist$ is given by the rate distortion function, $\rd$. We prove the optimality of the Codelet Parsing algorithm for lossy compression of memoryless bit sequences. It splits the input sequence naturally into phrases, representing each phrase by a codelet, a potentially distorted phrase of the same length. The codelets in the lossy representation of a length-$n$ string ${\x}$ have length roughly $(\log n)/\rd$, and like the lossless Lempel Ziv algorithm, Codelet Parsing constructs codebooks logarithmic in the sequence length.