Source author record

Ilan Shomorony

Ilan Shomorony appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Genomics Applications Machine Learning

Catalog footprint

What is connected

18works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

On the Capacity of Noisy Frequency-based Channels

We investigate the capacity of noisy frequency-based channels, motivated by DNA data storage in the short-molecule regime, where information is encoded in the frequency of items types rather than their order. The channel output is a histogram formed by random sampling of items, followed by noisy item identification. While the capacity of the noiseless frequency-based channel has been previously addressed, the effect of identification noise has not been fully characterized. We present a converse bound on the channel capacity that follows from stochastic degradation and the data processing inequality. We then establish an achievable bound, which is based on a Poissonization of the multinomial sampling process, and an analysis of the resulting vector Poisson channel with inter-symbol interference. This analysis refines concentration inequalities for the information density used in Feinstein bound, and explicitly characterizes an additive loss in the mutual information due to identification noise. We apply our results to a DNA storage channel in the short-molecule regime, and quantify the resulting loss in the scaling of the total number of reliably stored bits.

preprint2022arXiv

Coded Shotgun Sequencing

Most DNA sequencing technologies are based on the shotgun paradigm: many short reads are obtained from random unknown locations in the DNA sequence. A fundamental question, studied in arXiv:1203.6233, is what read length and coverage depth (i.e., the total number of reads) are needed to guarantee reliable sequence reconstruction. Motivated by DNA-based storage, we study the coded version of this problem;i.e., the scenario where the DNA molecule being sequenced is a codeword from a predefined codebook. Our main result is an exact characterization of the capacity of the resulting shotgun sequencing channel as a function of the read length and coverage depth. In particular, our results imply that, while in the uncoded case, $O(n)$ reads of length greater than $2\log{n}$ are needed for reliable reconstruction of a length-$n$ binary sequence, in the coded case, only $O(n/\log{n})$ reads of length greater than $\log{n}$ are needed for the capacity to be arbitrarily close to $1$.

preprint2022arXiv

Reassembly Codes for the Chop-and-Shuffle Channel

We study the problem of retrieving data from a channel that breaks the input sequence into a set of unordered fragments of random lengths, which we refer to as the chop-and-shuffle channel. The length of each fragment follows a geometric distribution. We propose nested Varshamov-Tenengolts (VT) codes to recover the data. We evaluate the error rate and the complexity of our scheme numerically. Our results show that the decoding error decreases as the input length increases, and our method has a significantly lower complexity than the baseline brute-force approach. We also propose a new construction for VT codes, quantify the maximum number of the required parity bits, and show that our approach requires fewer parity bits compared to known results.

preprint2021arXiv

Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment

Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. State-of-the-art approaches to speed up this task use hashing to identify short segments (k-mers) that are shared by pairs of reads, which can then be used to estimate alignment scores. However, when the number of reads is large, accurately estimating alignment scores for all pairs is still very costly. Moreover, in practice, one is only interested in identifying pairs of reads with large alignment scores. In this work, we propose a new approach to pairwise alignment estimation based on two key new ingredients. The first ingredient is to cast the problem of pairwise alignment estimation under a general framework of rank-one crowdsourcing models, where the workers' responses correspond to k-mer hash collisions. These models can be accurately solved via a spectral decomposition of the response matrix. The second ingredient is to utilise a multi-armed bandit algorithm to adaptively refine this spectral estimator only for read pairs that are likely to have large alignments. The resulting algorithm iteratively performs a spectral decomposition of the response matrix for adaptively chosen subsets of the read pairs.

preprint2020arXiv

Communicating over the Torn-Paper Channel

We consider the problem of communicating over a channel that randomly "tears" the message block into small pieces of different sizes and shuffles them. For the binary torn-paper channel with block length $n$ and pieces of length ${\rm Geometric}(p_n)$, we characterize the capacity as $C = e^{-α}$, where $α= \lim_{n\to\infty} p_n \log n$. Our results show that the case of ${\rm Geometric}(p_n)$-length fragments and the case of deterministic length-$(1/p_n)$ fragments are qualitatively different and, surprisingly, the capacity of the former is larger. Intuitively, this is due to the fact that, in the random fragments case, large fragments are sometimes observed, which boosts the capacity.

preprint2020arXiv

DNA-Based Storage: Models and Fundamental Limits

Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and trade-offs of DNA-based storage systems by introducing a new channel model, which we call the noisy shuffling-sampling channel. Motivated by current technological constraints on DNA synthesis and sequencing, this model captures three key distinctive aspects of DNA storage systems: (1) the data is written onto many short DNA molecules; (2) the molecules are corrupted by noise during synthesis and sequencing and (3) the data is read by randomly sampling from the DNA pool. We provide capacity results for this channel under specific noise and sampling assumptions and show that, in many scenarios, a simple index-based coding scheme is optimal.

preprint2016arXiv

Partial DNA Assembly: A Rate-Distortion Perspective

Earlier formulations of the DNA assembly problem were all in the context of perfect assembly; i.e., given a set of reads from a long genome sequence, is it possible to perfectly reconstruct the original sequence? In practice, however, it is very often the case that the read data is not sufficiently rich to permit unambiguous reconstruction of the original sequence. While a natural generalization of the perfect assembly formulation to these cases would be to consider a rate-distortion framework, partial assemblies are usually represented in terms of an assembly graph, making the definition of a distortion measure challenging. In this work, we introduce a distortion function for assembly graphs that can be understood as the logarithm of the number of Eulerian cycles in the assembly graph, each of which correspond to a candidate assembly that could have generated the observed reads. We also introduce an algorithm for the construction of an assembly graph and analyze its performance on real genomes.

preprint2015arXiv

Do Read Errors Matter for Genome Assembly?

While most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read length and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, we make progress on this problem by establishing a critical read length, as a function of the genome and the error rate, above which perfect assembly is guaranteed. For several real genomes, including those from the GAGE dataset, we verify that this critical read length is not significantly greater than the read length required for perfect assembly from reads without errors.

preprint2015arXiv

Informational Bottlenecks in Two-Unicast Wireless Networks with Delayed CSIT

We study the impact of delayed channel state information at the transmitters (CSIT) in two-unicast wireless networks with a layered topology and arbitrary connectivity. We introduce a technique to obtain outer bounds to the degrees-of-freedom (DoF) region through the new graph-theoretic notion of bottleneck nodes. Such nodes act as informational bottlenecks only under the assumption of delayed CSIT, and imply asymmetric DoF bounds of the form $mD_1 + D_2 \leq m$. Combining this outer-bound technique with new achievability schemes, we characterize the sum DoF of a class of two-unicast wireless networks, which shows that, unlike in the case of instantaneous CSIT, the DoF of two-unicast networks with delayed CSIT can take an infinite set of values.

preprint2014arXiv

A Generalized Cut-Set Bound for Deterministic Multi-Flow Networks and its Applications

We present a new outer bound for the sum capacity of general multi-unicast deterministic networks. Intuitively, this bound can be understood as applying the cut-set bound to concatenated copies of the original network with a special restriction on the allowed transmit signal distributions. We first study applications to finite-field networks, where we obtain a general outer-bound expression in terms of ranks of the transfer matrices. We then show that, even though our outer bound is for deterministic networks, a recent result relating the capacity of AWGN KxKxK networks and the capacity of a deterministic counterpart allows us to establish an outer bound to the DoF of KxKxK wireless networks with general connectivity. This bound is tight in the case of the "adjacent-cell interference" topology, and yields graph-theoretic necessary and sufficient conditions for K DoF to be achievable in general topologies.

preprint2014arXiv

Sampling Large Data on Graphs

We consider the problem of sampling from data defined on the nodes of a weighted graph, where the edge weights capture the data correlation structure. As shown recently, using spectral graph theory one can define a cut-off frequency for the bandlimited graph signals that can be reconstructed from a given set of samples (i.e., graph nodes). In this work, we show how this cut-off frequency can be computed exactly. Using this characterization, we provide efficient algorithms for finding the subset of nodes of a given size with the largest cut-off frequency and for finding the smallest subset of nodes with a given cut-off frequency. In addition, we study the performance of random uniform sampling when compared to the centralized optimal sampling provided by the proposed algorithms.

preprint2013arXiv

Degrees of Freedom of Two-Hop Wireless Networks: "Everyone Gets the Entire Cake"

We show that fully connected two-hop wireless networks with K sources, K relays and K destinations have K degrees of freedom both in the case of time-varying channel coefficients and in the case of constant channel coefficients (in which case the result holds for almost all values of constant channel coefficients). Our main contribution is a new achievability scheme which we call Aligned Network Diagonalization. This scheme allows the data streams transmitted by the sources to undergo a diagonal linear transformation from the sources to the destinations, thus being received free of interference by their intended destination. In addition, we extend our scheme to multi-hop networks with fully connected hops, and multi-hop networks with MIMO nodes, for which the degrees of freedom are also fully characterized.

preprint2013arXiv

Network Compression: Worst-Case Analysis

We study the problem of communicating a distributed correlated memoryless source over a memoryless network, from source nodes to destination nodes, under quadratic distortion constraints. We establish the following two complementary results: (a) for an arbitrary memoryless network, among all distributed memoryless sources of a given correlation, Gaussian sources are least compressible, that is, they admit the smallest set of achievable distortion tuples, and (b) for any memoryless source to be communicated over a memoryless additive-noise network, among all noise processes of a given correlation, Gaussian noise admits the smallest achievable set of distortion tuples. We establish these results constructively by showing how schemes for the corresponding Gaussian problems can be applied to achieve similar performance for (source or noise) distributions that are not necessarily Gaussian but have the same covariance.

preprint2013arXiv

On Min-Cut Algorithms for Half-Duplex Relay Networks

Computing the cut-set bound in half-duplex relay networks is a challenging optimization problem, since it requires finding the cut-set optimal half-duplex schedule. This subproblem in general involves an exponential number of variables, since the number of ways to assign each node to either transmitter or receiver mode is exponential in the number of nodes. We present a general technique that takes advantage of specific structures in the topology of a given network and allows us to reduce the complexity of computing the half-duplex schedule that maximizes the cut-set bound (with i.i.d. input distribution). In certain classes of network topologies, our approach yields polynomial time algorithms. We use simulations to show running time improvements over alternative methods and compare the performance of various half-duplex scheduling approaches in different SNR regimes.

preprint2013arXiv

Worst-Case Additive Noise in Wireless Networks

A classical result in Information Theory states that the Gaussian noise is the worst-case additive noise in point-to-point channels, meaning that, for a fixed noise variance, the Gaussian noise minimizes the capacity of an additive noise channel. In this paper, we significantly generalize this result and show that the Gaussian noise is also the worst-case additive noise in wireless networks with additive noises that are independent from the transmit signals. More specifically, we show that, if we fix the noise variance at each node, then the capacity region with Gaussian noises is a subset of the capacity region with any other set of noise distributions. We prove this result by showing that a coding scheme that achieves a given set of rates on a network with Gaussian additive noises can be used to construct a coding scheme that achieves the same set of rates on a network that has the same topology and traffic demands, but with non-Gaussian additive noises.

preprint2012arXiv

Diamond Networks with Bursty Traffic: Bounds on the Minimum Energy-Per-Bit

When data traffic in a wireless network is bursty, small amounts of data sporadically become available for transmission, at times that are unknown at the receivers, and an extra amount of energy must be spent at the transmitters to overcome this lack of synchronization between the network nodes. In practice, pre-defined header sequences are used with the purpose of synchronizing the different network nodes. However, in networks where relays must be used for communication, the overhead required for synchronizing the entire network may be very significant. In this work, we study the fundamental limits of energy-efficient communication in an asynchronous diamond network with two relays. We formalize the notion of relay synchronization by saying that a relay is synchronized if the conditional entropy of the arrival time of the source message given the received signals at the relay is small. We show that the minimum energy-per-bit for bursty traffic in diamond networks is achieved with a coding scheme where each relay is either synchronized or not used at all. A consequence of this result is the derivation of a lower bound to the minimum energy-per-bit for bursty communication in diamond networks. This bound allows us to show that schemes that perform the tasks of synchronization and communication separately (i.e., with synchronization signals preceding the communication block) can achieve the minimum energy-per-bit to within a constant fraction that ranges from 2 in the synchronous case to 1 in the highly asynchronous regime.

preprint2012arXiv

Two-Unicast Wireless Networks: Characterizing the Degrees-of-Freedom

We consider two-source two-destination (i.e., two-unicast) multi-hop wireless networks that have a layered structure with arbitrary connectivity. We show that, if the channel gains are chosen independently according to continuous distributions, then, with probability 1, two-unicast layered Gaussian networks can only have 1, 3/2 or 2 sum degrees-of-freedom (unless both source-destination pairs are disconnected, in which case no degrees-of-freedom can be achieved). We provide sufficient and necessary conditions for each case based on network connectivity and a new notion of source-destination paths with manageable interference. Our achievability scheme is based on forwarding the received signals at all nodes, except for a small fraction of them in at most two key layers. Hence, we effectively create a "condensed network" that has at most four layers (including the sources layer and the destinations layer). We design the transmission strategies based on the structure of this condensed network. The converse results are obtained by developing information-theoretic inequalities that capture the structures of the network connectivity. Finally, we extend this result and characterize the full degrees-of-freedom region of two-unicast layered wireless networks.

preprint2012arXiv

Worst-Case Source for Distributed Compression with Quadratic Distortion

We consider the k-encoder source coding problem with a quadratic distortion measure. We show that among all source distributions with a given covariance matrix K, the jointly Gaussian source requires the highest rates in order to meet a given set of distortion constraints.

Ilan Shomorony

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

On the Capacity of Noisy Frequency-based Channels

Coded Shotgun Sequencing

Reassembly Codes for the Chop-and-Shuffle Channel

Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment

Communicating over the Torn-Paper Channel

DNA-Based Storage: Models and Fundamental Limits

Partial DNA Assembly: A Rate-Distortion Perspective

Do Read Errors Matter for Genome Assembly?

Informational Bottlenecks in Two-Unicast Wireless Networks with Delayed CSIT

A Generalized Cut-Set Bound for Deterministic Multi-Flow Networks and its Applications

Sampling Large Data on Graphs

Degrees of Freedom of Two-Hop Wireless Networks: "Everyone Gets the Entire Cake"

Network Compression: Worst-Case Analysis

On Min-Cut Algorithms for Half-Duplex Relay Networks

Worst-Case Additive Noise in Wireless Networks

Diamond Networks with Bursty Traffic: Bounds on the Minimum Energy-Per-Bit

Two-Unicast Wireless Networks: Characterizing the Degrees-of-Freedom

Worst-Case Source for Distributed Compression with Quadratic Distortion