Source author record

Farzad Farnoud

Farzad Farnoud appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Genomics Data Structures and Algorithms Discrete Mathematics math.CO Computation and Language Computer Science and Game Theory Formal Languages and Automata Theory Quantitative Methods Social and Information Networks

Catalog footprint

What is connected

19works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Data Deduplication with Random Substitutions

Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algorithms on data streams in which repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. More precisely, each symbol in a repeated string is substituted with a given edit probability. Deduplication algorithms in both the fixed-length scheme and the variable-length scheme are studied. The fixed-length deduplication algorithm is shown to be unsuitable for the proposed source model as it does not take into account the edit probability. Two modifications are proposed and shown to have performances within a constant factor of optimal with the knowledge of source model parameters. We also study the conventional variable-length deduplication algorithm and show that as source entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string, leading to high compression ratios.

preprint2022arXiv

Low-redundancy codes for correcting multiple short-duplication and edit errors

Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most $p$ edits, where a short duplication generates a copy of a substring with length $\leq 3$ and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to $p$ edits (in addition to duplications) at the additional cost of roughly $8p(\log_q n)(1+o(1))$ symbols of redundancy, thus achieving the same asymptotic rate, where $q\ge 4$ is the alphabet size and $p$ is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when $p$ is a constant with respect to the code length.

preprint2020arXiv

Coding for Optimized Writing Rate in DNA Storage

A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writing rate. Additionally, the encoding scheme studied here takes into account the existence of multiple copies of the DNA sequence, which are independently distorted. Finally, quantizers for various run-length distributions are designed.

preprint2020arXiv

Error-correcting Codes for Noisy Duplication Channels

Because of its high data density and longevity, DNA is emerging as a promising candidate for satisfying increasing data storage needs. Compared to conventional storage media, however, data stored in DNA is subject to a wider range of errors resulting from various processes involved in the data storage pipeline. In this paper, we consider correcting duplication errors for both exact and noisy tandem duplications of a given length k. An exact duplication inserts a copy of a substring of length k of the sequence immediately after that substring, e.g., ACGT to ACGACGT, where k = 3, while a noisy duplication inserts a copy suffering from substitution noise, e.g., ACGT to ACGATGT. Specifically, we design codes that can correct any number of exact duplication and one noisy duplication errors, where in the noisy duplication case the copy is at Hamming distance 1 from the original. Our constructions rely upon recovering the duplication root of the stored codeword. We characterize the ways in which duplication errors manifest in the root of affected sequences and design efficient codes for correcting these error patterns. We show that the proposed construction is asymptotically optimal, in the sense that it has the same asymptotic rate as optimal codes correcting exact duplications only.

preprint2020arXiv

Single-Error Detection and Correction for Duplication and Substitution Channels

Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and error-correcting codes are constructed, which can handle correctly any number of tandem duplications of a fixed length $k$, and at most a single substitution occurring at any time during the mutation process.

preprint2016arXiv

Duplication Distance to the Root for Binary Sequences

We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. We consider both exact and approximate tandem duplications. For exact duplication, denoting the maximum distance to the root of a sequence of length $n$ by $f(n)$, we prove that $f(n)=Θ(n)$. For the case of approximate duplication, where a $β$-fraction of symbols may be duplicated incorrectly, we show that the maximum distance has a sharp transition from linear in $n$ to logarithmic at $β=1/2$. We also study the duplication distance to the root for sequences with a given root and for special classes of sequences, namely, the de Bruijn sequences, the Thue-Morse sequence, and the Fibbonaci words. The problem is motivated by genomic tandem duplication mutations and the smallest number of tandem duplication events required to generate a given biological sequence.

preprint2016arXiv

Duplication-Correcting Codes for Data Storage in the DNA of Living Organisms

The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we provide error-correcting codes for errors caused by tandem duplications, which create a copy of a block of the sequence and insert it in a tandem manner, i.e., next to the original. In particular, we present two families of codes for correcting errors due to tandem-duplications of a fixed length, the first family can correct any number of errors while the second corrects a bounded number of errors. We also study codes for correcting tandem duplications of length up to a given constant $k$, where we are primarily focused on the cases of $k=2,3$. Finally, we provide a full classification of the sets of lengths allowed in tandem duplication that result in a unique root for all sequences.

preprint2015arXiv

Capacity and Expressiveness of Genomic Tandem Duplication

The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibility of generating a large number of sequences from a \textit{seed}, i.e.\ a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, a notion that quantifies the system's generating power. Our results include \textit{exact capacity} values for certain tandem duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the \textit{expressiveness} of a tandem duplication system as the capability of expressing arbitrary substrings. We then \textit{completely} characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. In particular, based on a celebrated result by Axel Thue from 1906, presenting a construction for ternary square-free sequences, we show that for alphabets of size 4 or larger, bounded tandem duplication systems, regardless of the seed and the bound on duplication length, are not fully expressive, i.e. they cannot generate all strings even as substrings of other strings. Note that the alphabet of size 4 is of particular interest as it pertains to the genomic alphabet. Building on this result, we also show that these systems do not have full capacity. In general, our results illustrate that duplication lengths play a more significant role than the seed in generating a large number of sequences for these systems.

preprint2014arXiv

Computing Similarity Distances Between Rankings

We address the problem of computing distances between rankings that take into account similarities between candidates. The need for evaluating such distances is governed by applications as diverse as rank aggregation, bioinformatics, social sciences and data storage. The problem may be summarized as follows: Given two rankings and a positive cost function on transpositions that depends on the similarity of the candidates involved, find a smallest cost sequence of transpositions that converts one ranking into another. Our focus is on costs that may be described via special metric-tree structures and on complete rankings modeled as permutations. The presented results include a quadratic-time algorithm for finding a minimum cost decomposition for simple cycles, and a quadratic-time, $4/3$-approximation algorithm for permutations that contain multiple cycles. The proposed methods rely on investigating a newly introduced balancing property of cycles embedded in trees, cycle-merging methods, and shortest path optimization techniques.

preprint2014arXiv

Rate-Distortion for Ranking with Incomplete Information

We study the rate-distortion relationship in the set of permutations endowed with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For the Kendall Tau metric we provide bounds for small, medium, and large distortion regimes, while for the Chebyshev metric we present bounds that are valid for all distortions and are especially accurate for small distortions. In addition, for the Chebyshev metric, we provide a construction for covering codes.

preprint2014arXiv

The Capacity of String-Replication Systems

It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple replication rules, including those resembling genomic replication processes. In other words, our goal is to find out the capacity, or the expressive power, of these string-replication systems. Our results include exact capacities, and bounds on the capacities, of four fundamental string-replication systems.

preprint2013arXiv

Error-Correction in Flash Memories via Codes in the Ulam Metric

We consider rank modulation codes for flash memories that allow for handling arbitrary charge-drop errors. Unlike classical rank modulation codes used for correcting errors that manifest themselves as swaps of two adjacently ranked elements, the proposed \emph{translocation rank codes} account for more general forms of errors that arise in storage systems. Translocations represent a natural extension of the notion of adjacent transpositions and as such may be analyzed using related concepts in combinatorics and rank modulation coding. Our results include derivation of the asymptotic capacity of translocation rank codes, construction techniques for asymptotically good codes, as well as simple decoding methods for one class of constructed codes. As part of our exposition, we also highlight the close connections between the new code family and permutations with short common subsequences, deletion and insertion error-correcting codes for permutations, and permutation codes in the Hamming distance.

preprint2013arXiv

MetaPar: Metagenomic Sequence Assembly via Iterative Reclassification

We introduce a parallel algorithmic architecture for metagenomic sequence assembly, termed MetaPar, which allows for significant reductions in assembly time and consequently enables the processing of large genomic datasets on computers with low memory usage. The gist of the approach is to iteratively perform read (re)classification based on phylogenetic marker genes and assembler outputs generated from random subsets of metagenomic reads. Once a sufficiently accurate classification within genera is performed, de novo metagenomic assemblers (such as Velvet or IDBA-UD) or reference based assemblers may be used for contig construction. We analyze the performance of MetaPar on synthetic data consisting of 15 randomly chosen species from the NCBI database through the effective gap and effective coverage metrics.

preprint2012arXiv

A General Framework for Distributed Vote Aggregation

We present a general model for opinion dynamics in a social network together with several possibilities for object selections at times when the agents are communicating. We study the limiting behavior of such a dynamics and show that this dynamics almost surely converges. We consider some special implications of the convergence result for gossip and top-$k$ selective gossip models. In particular, we provide an answer to the open problem of the convergence property of the top-$k$ selective gossip model, and show that the convergence holds in a much more general setting. Moreover, we propose an extension of the gossip and top-$k$ selective gossip models and provide some results for their limiting behavior.

preprint2012arXiv

A Novel Distance-Based Approach to Constrained Rank Aggregation

We consider a classical problem in choice theory -- vote aggregation -- using novel distance measures between permutations that arise in several practical applications. The distance measures are derived through an axiomatic approach, taking into account various issues arising in voting with side constraints. The side constraints of interest include non-uniform relevance of the top and the bottom of rankings (or equivalently, eliminating negative outliers in votes) and similarities between candidates (or equivalently, introducing diversity in the voting process). The proposed distance functions may be seen as weighted versions of the Kendall $τ$ distance and weighted versions of the Cayley distance. In addition to proposing the distance measures and providing the theoretical underpinnings for their applications, we also consider algorithmic aspects associated with distance-based aggregation processes. We focus on two methods. One method is based on approximating weighted distance measures by a generalized version of Spearman's footrule distance, and it has provable constant approximation guarantees. The second class of algorithms is based on a non-uniform Markov chain method inspired by PageRank, for which currently only heuristic guarantees are known. We illustrate the performance of the proposed algorithms for a number of distance measures for which the optimal solution may be easily computed.

preprint2012arXiv

Alternating Markov Chains for Distribution Estimation in the Presence of Errors

We consider a class of small-sample distribution estimators over noisy channels. Our estimators are designed for repetition channels, and rely on properties of the runs of the observed sequences. These runs are modeled via a special type of Markov chains, termed alternating Markov chains. We show that alternating chains have redundancy that scales sub-linearly with the lengths of the sequences, and describe how to use a distribution estimator for alternating chains for the purpose of distribution estimation over repetition channels.

preprint2012arXiv

Nonuniform Vote Aggregation Algorithms

We consider the problem of non-uniform vote aggregation, and in particular, the algorithmic aspects associated with the aggregation process. For a novel class of weighted distance measures on votes, we present two different aggregation methods. The first algorithm is based on approximating the weighted distance measure by Spearman's footrule distance, with provable constant approximation guarantees. The second algorithm is based on a non-uniform Markov chain method inspired by PageRank, for which currently only heuristic guarantees are known. We illustrate the performance of the proposed algorithms on a number of distance measures for which the optimal solution may be easily computed.

preprint2012arXiv

Novel Distance Measures for Vote Aggregation

We consider the problem of rank aggregation based on new distance measures derived through axiomatic approaches and based on score-based methods. In the first scenario, we derive novel distance measures that allow for discriminating between the ranking process of highest and lowest ranked elements in the list. These distance functions represent weighted versions of Kendall's tau measure and may be computed efficiently in polynomial time. Furthermore, we describe how such axiomatic approaches may be extended to the study of score-based aggregation and present the first analysis of distributed vote aggregation over networks.

preprint2010arXiv

Sorting of Permutations by Cost-Constrained Transpositions

We address the problem of finding the minimum decomposition of a permutation in terms of transpositions with non-uniform cost. For arbitrary non-negative cost functions, we describe polynomial-time, constant-approximation decomposition algorithms. For metric-path costs, we describe exact polynomial-time decomposition algorithms. Our algorithms represent a combination of Viterbi-type algorithms and graph-search techniques for minimizing the cost of individual transpositions, and dynamic programing algorithms for finding minimum cost cycle decompositions. The presented algorithms have applications in information theory, bioinformatics, and algebra.

Farzad Farnoud

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Data Deduplication with Random Substitutions

Low-redundancy codes for correcting multiple short-duplication and edit errors

Coding for Optimized Writing Rate in DNA Storage

Error-correcting Codes for Noisy Duplication Channels

Single-Error Detection and Correction for Duplication and Substitution Channels

Duplication Distance to the Root for Binary Sequences

Duplication-Correcting Codes for Data Storage in the DNA of Living Organisms

Capacity and Expressiveness of Genomic Tandem Duplication

Computing Similarity Distances Between Rankings

Rate-Distortion for Ranking with Incomplete Information

The Capacity of String-Replication Systems

Error-Correction in Flash Memories via Codes in the Ulam Metric

MetaPar: Metagenomic Sequence Assembly via Iterative Reclassification

A General Framework for Distributed Vote Aggregation

A Novel Distance-Based Approach to Constrained Rank Aggregation

Alternating Markov Chains for Distribution Estimation in the Presence of Errors

Nonuniform Vote Aggregation Algorithms

Novel Distance Measures for Vote Aggregation

Sorting of Permutations by Cost-Constrained Transpositions