Source author record

Aslan Tchamkerten

Aslan Tchamkerten appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Machine Learning math.ST Statistics Theory Computation and Language Data Structures and Algorithms stat.OT

Catalog footprint

What is connected

17works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.

preprint2021arXiv

Approximating Probability Distributions by ReLU Networks

How many neurons are needed to approximate a target probability distribution using a neural network with a given input distribution and approximation error? This paper examines this question for the case when the input distribution is uniform, and the target distribution belongs to the class of histogram distributions. We obtain a new upper bound on the number of required neurons, which is strictly better than previously existing upper bounds. The key ingredient in this improvement is an efficient construction of the neural nets representing piecewise linear functions. We also obtain a lower bound on the minimum number of neurons needed to approximate the histogram distributions.

preprint2021arXiv

Error-Correction for Sparse Support Recovery Algorithms

Consider the compressed sensing setup where the support $s^*$ of an $m$-sparse $d$-dimensional signal $x$ is to be recovered from $n$ linear measurements with a given algorithm. Suppose that the measurements are such that the algorithm does not guarantee perfect support recovery and that true features may be missed. Can they efficiently be retrieved? This paper addresses this question through a simple error-correction module referred to as LiRE. LiRE takes as input an estimate $s_{in}$ of the true support $s^*$, and outputs a refined support estimate $s_{out}$. In the noiseless measurement setup, sufficient conditions are established under which LiRE is guaranteed to recover the entire support, that is $s_{out}$ contains $s^*$. These conditions imply, for instance, that in the high-dimensional regime LiRE can correct a sublinear in $m$ number of errors made by Orthogonal Matching Pursuit (OMP). The computational complexity of LiRE is $O(mnd)$. Experimental results with random Gaussian design matrices show that LiRE substantially reduces the number of measurements needed for perfect support recovery via Compressive Sampling Matching Pursuit, Basis Pursuit (BP), and OMP. Interestingly, adding LiRE to OMP yields a support recovery procedure that is more accurate and significantly faster than BP. This observation carries over in the noisy measurement setup. Finally, as a standalone support recovery algorithm with a random initialization, experiments show that LiRE's reconstruction performance lies between OMP and BP. These results suggest that LiRE may be used generically, on top of any suboptimal baseline support recovery algorithm, to improve support recovery or to operate with a smaller number of measurements, at the cost of a relatively small computational overhead. Alternatively, LiRE may be used as a standalone support recovery algorithm that is competitive with respect to OMP.

preprint2020arXiv

$O(\log \log n)$ Worst-Case Local Decoding and Update Efficiency for Data Compression

This paper addresses the problem of data compression with local decoding and local update. A compression scheme has worst-case local decoding $d_{wc}$ if any bit of the raw file can be recovered by probing at most $d_{wc}$ bits of the compressed sequence, and has update efficiency of $u_{wc}$ if a single bit of the raw file can be updated by modifying at most $u_{wc}$ bits of the compressed sequence. This article provides an entropy-achieving compression scheme for memoryless sources that simultaneously achieves $ O(\log\log n) $ local decoding and update efficiency. Key to this achievability result is a novel succinct data structure for sparse sequences which allows efficient local decoding and local update. Under general assumptions on the local decoder and update algorithms, a converse result shows that $d_{wc}$ and $u_{wc}$ must grow as $ Ω(\log\log n) $.

preprint2015arXiv

Distributed Function Computation Over a Rooted Directed Tree

This paper establishes the rate region for a class of source coding function computation setups where sources of information are available at the nodes of a tree and where a function of these sources must be computed at the root. The rate region holds for any function as long as the sources' joint distribution satisfies a certain Markov criterion. This criterion is met, in particular, when the sources are independent. This result recovers the rate regions of several function computation setups. These include the point-to-point communication setting with arbitrary sources, the noiseless multiple access network with "conditionally independent sources," and the cascade network with Markovian sources.

preprint2015arXiv

On Cooperation in Multi-Terminal Computation and Rate Distortion

A receiver wants to compute a function of two correlated sources separately observed by two transmitters. One of the transmitters may send a possibly private message to the other transmitter in a cooperation phase before both transmitters communicate to the receiver. For this network configuration this paper investigates both a function computation setup, wherein the receiver wants to compute a given function of the sources exactly, and a rate distortion setup, wherein the receiver wants to compute a given function within some distortion. For the function computation setup, a general inner bound to the rate region is established and shown to be tight in a number of cases: partially invertible functions, full cooperation between transmitters, one-round point-to-point communication, two-round point-to-point communication, and the cascade setup where the transmitters and the receiver are aligned. In particular it is shown that the ratio of the total number of transmitted bits without cooperation and the total number of transmitted bits with cooperation can be arbitrarily large. Furthermore, one bit of cooperation suffices to arbitrarily reduce the amount of information both transmitters need to convey to the receiver. For the rate distortion version, an inner bound to the rate region is exhibited which always includes, and sometimes strictly, the convex hull of Kaspi-Berger's related inner bounds. The strict inclusion is shown via two examples.

preprint2014arXiv

Lattice Codes for the Binary Deletion Channel

The construction of deletion codes for the Levenshtein metric is reduced to the construction of codes over the integers for the Manhattan metric by run length coding. The latter codes are constructed by expurgation of translates of lattices. These lattices, in turn, are obtained from Construction~A applied to binary codes and $\Z_4-$codes. A lower bound on the size of our codes for the Manhattan distance are obtained through generalized theta series of the corresponding lattices.

preprint2013arXiv

Energy and Sampling Constrained Asynchronous Communication

The minimum energy, and, more generally, the minimum cost, to transmit one bit of information has been recently derived for bursty communication when information is available infrequently at random times at the transmitter. This result assumes that the receiver is always in the listening mode and samples all channel outputs until it makes a decision. If the receiver is constrained to sample only a fraction f>0 of the channel outputs, what is the cost penalty due to sparse output sampling? Remarkably, there is no penalty: regardless of f>0 the asynchronous capacity per unit cost is the same as under full sampling, ie, when f=1. There is not even a penalty in terms of decoding delay---the elapsed time between when information is available until when it is decoded. This latter result relies on the possibility to sample adaptively; the next sample can be chosen as a function of past samples. Under non-adaptive sampling, it is possible to achieve the full sampling asynchronous capacity per unit cost, but the decoding delay gets multiplied by 1/f. Therefore adaptive sampling strategies are of particular interest in the very sparse sampling regime.

preprint2012arXiv

Asynchronous Capacity per Unit Cost

The capacity per unit cost, or equivalently minimum cost to transmit one bit, is a well-studied quantity. It has been studied under the assumption of full synchrony between the transmitter and the receiver. In many applications, such as sensor networks, transmissions are very bursty, with small amounts of bits arriving infrequently at random times. In such scenarios, the cost of acquiring synchronization is significant and one is interested in the fundamental limits on communication without assuming a priori synchronization. In this paper, we show that the minimum cost to transmit B bits of information asynchronously is (B + \bar{H})k_sync, where k_sync is the synchronous minimum cost per bit and \bar{H} is a measure of timing uncertainty equal to the entropy for most reasonable arrival time distributions. This result holds when the transmitter can stay idle at no cost and is a particular case of a general result which holds for arbitrary cost functions.

preprint2012arXiv

Asynchronous Communication: Capacity Bounds and Suboptimality of Training

Several aspects of the problem of asynchronous point-to-point communication without feedback are developed when the source is highly intermittent. In the system model of interest, the codeword is transmitted at a random time within a prescribed window whose length corresponds to the level of asynchronism between the transmitter and the receiver. The decoder operates sequentially and communication rate is defined as the ratio between the message size and the elapsed time between when transmission commences and when the decoder makes a decision. For such systems, general upper and lower bounds on capacity as a function of the level of asynchronism are established, and are shown to coincide in some nontrivial cases. From these bounds, several properties of this asynchronous capacity are derived. In addition, the performance of training-based schemes is investigated. It is shown that such schemes, which implement synchronization and information transmission on separate degrees of freedom in the encoding, cannot achieve the asynchronous capacity in general, and that the penalty is particularly significant in the high-rate regime.

preprint2012arXiv

Estimating a Random Walk First-Passage Time from Noisy or Delayed Observations

A random walk (or a Wiener process), possibly with drift, is observed in a noisy or delayed fashion. The problem considered in this paper is to estimate the first time τthe random walk reaches a given level. Specifically, the p-moment (p\geq 1) optimization problem \inf_η\ex|η-τ|^p is investigated where the infimum is taken over the set of stopping times that are defined on the observation process. When there is no drift, optimal stopping rules are characterized for both types of observations. When there is a drift, upper and lower bounds on \inf_η\ex|η-τ|^p are established for both types of observations. The bounds are tight in the large-level regime for noisy observations and in the large-level-large-delay regime for delayed observations. Noteworthy, for noisy observations there exists an asymptotically optimal stopping rule that is a function of a single observation. Simulation results are provided that corroborate the validity of the results for non-asymptotic settings.

preprint2012arXiv

On Computing a Function of Correlated Sources

A receiver wants to compute a function f of two correlated sources X and Y and side information Z. What is the minimum number of bits that needs to be communicated by each transmitter? In this paper, we derive inner and outer bounds to the rate region of this problem which coincide in the cases where f is partially invertible and where the sources are independent given the side information. These rate regions point to an important difference with the single source case. Whereas for the latter it is sufficient to consider independent sets of some suitable characteristic graph, for multiple sources such a restriction is suboptimal and multisets are necessary.

preprint2012arXiv

On the Capacity of the One-Bit Deletion and Duplication Channel

The one-bit deletion and duplication channel is investigated. An input to this channel consists of a block of bits which experiences either a deletion, or a duplication, or remains unchanged. For this channel a capacity expression is obtained in a certain asymptotic regime where the deletion and duplication probabilities tend to zero. As a corollary, we obtain an asymptotic expression for the capacity of the segmented deletion and duplication channel where the input now consists of several blocks and each block independently experiences either a deletion, or a duplication, or remains unchanged.

preprint2012arXiv

Tracking a Random Walk First-Passage Time Through Noisy Observations

Given a Gaussian random walk (or a Wiener process), possibly with drift, observed through noise, we consider the problem of estimating its first-passage time $τ_\ell$ of a given level $\ell$ with a stopping time $η$ defined over the noisy observation process. Main results are upper and lower bounds on the minimum mean absolute deviation $\inf_η\ex|η-τ_\ell|$ which become tight as $\ell\to\infty$. Interestingly, in this regime the estimation error does not get smaller if we allow $ η$ to be an arbitrary function of the entire observation process, not necessarily a stopping time. In the particular case where there is no drift, we show that it is impossible to track $τ_\ell$: $\inf_η\ex|η-τ_\ell|^p=\infty$ for any $\ell>0$ and $p\geq1/2$.

preprint2010arXiv

On Bounded Weight Codes

The maximum size of a binary code is studied as a function of its length N, minimum distance D, and minimum codeword weight W. This function B(N,D,W) is first characterized in terms of its exponential growth rate in the limit as N tends to infinity for fixed d=D/N and w=W/N. The exponential growth rate of B(N,D,W) is shown to be equal to the exponential growth rate of A(N,D) for w <= 1/2, and equal to the exponential growth rate of A(N,D,W) for 1/2< w <= 1. Second, analytic and numerical upper bounds on B(N,D,W) are derived using the semidefinite programming (SDP) method. These bounds yield a non-asymptotic improvement of the second Johnson bound and are tight for certain values of the parameters.

preprint2009arXiv

Training-Based Schemes are Suboptimal for High Rate Asynchronous Communication

We consider asynchronous point-to-point communication. Building on a recently developed model, we show that training based schemes, i.e., communication strategies that separate synchronization from information transmission, perform suboptimally at high rate.

preprint2008arXiv

Tracking Stopping Times Through Noisy Observations

A novel quickest detection setting is proposed which is a generalization of the well-known Bayesian change-point detection model. Suppose \{(X_i,Y_i)\}_{i\geq 1} is a sequence of pairs of random variables, and that S is a stopping time with respect to \{X_i\}_{i\geq 1}. The problem is to find a stopping time T with respect to \{Y_i\}_{i\geq 1} that optimally tracks S, in the sense that T minimizes the expected reaction delay E(T-S)^+, while keeping the false-alarm probability P(T<S) below a given threshold α\in [0,1]. This problem formulation applies in several areas, such as in communication, detection, forecasting, and quality control. Our results relate to the situation where the X_i's and Y_i's take values in finite alphabets and where S is bounded by some positive integer κ. By using elementary methods based on the analysis of the tree structure of stopping times, we exhibit an algorithm that computes the optimal average reaction delays for all α\in [0,1], and constructs the associated optimal stopping times T. Under certain conditions on \{(X_i,Y_i)\}_{i\geq 1} and S, the algorithm running time is polynomial in κ.

Aslan Tchamkerten

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

Approximating Probability Distributions by ReLU Networks

Error-Correction for Sparse Support Recovery Algorithms

$O(\log \log n)$ Worst-Case Local Decoding and Update Efficiency for Data Compression

Distributed Function Computation Over a Rooted Directed Tree

On Cooperation in Multi-Terminal Computation and Rate Distortion

Lattice Codes for the Binary Deletion Channel

Energy and Sampling Constrained Asynchronous Communication

Asynchronous Capacity per Unit Cost

Asynchronous Communication: Capacity Bounds and Suboptimality of Training

Estimating a Random Walk First-Passage Time from Noisy or Delayed Observations

On Computing a Function of Correlated Sources

On the Capacity of the One-Bit Deletion and Duplication Channel

Tracking a Random Walk First-Passage Time Through Noisy Observations

On Bounded Weight Codes

Training-Based Schemes are Suboptimal for High Rate Asynchronous Communication

Tracking Stopping Times Through Noisy Observations