Source author record

Olgica Milenkovic

Olgica Milenkovic appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

56works

20topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Spatially-Coupled Network RNA Velocities: A Control-Theoretic Perspective

RNA velocity is an important model that combines cellular spliced and unspliced RNA counts to infer dynamical properties of various regulatory functions. Despite its wide applicability and many variants used in practice, the model has not been adequately designed to directly account for both intracellular gene regulatory network interactions and spatial intercellular communications. Here, we propose a new RNA velocity approach that jointly and directly captures two new network structures: an intracellular gene regulatory network (GRN) and an intercellular interaction network that captures interactions between (neighboring) cells, with relevance to spatial transcriptomics. We theoretically analyze this two-level network system through the lens of control and consensus theory. In particular, we investigate network equilibria, stability, cellular network consensus, and optimal control approaches for targeted drug intervention.

preprint2022arXiv

Balanced and Swap-Robust Trades for Dynamical Distributed Storage

Trades, introduced by Hedayat, are two sets of blocks of elements which may be exchanged (traded) without altering the counts of certain subcollections of elements within their constituent blocks. They are of importance in applications where certain combinations of elements dynamically become prohibited from being placed in the same group of elements, since in this case one can trade the offending blocks with allowed ones. This is particularly the case in distributed storage systems, where due to privacy and other constraints, data of some groups of users cannot be stored together on the same server. We introduce a new class of balanced trades, important for access balancing of servers, and perturbation resilient balanced trades, important for studying the stability of server access frequencies with respect to changes in data popularity. The constructions and bounds on our new trade schemes rely on specialized selections of defining sets in minimal trades and number-theoretic analyses.

preprint2022arXiv

HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering

The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is ``more'' tree-like, when evaluated in terms of Gromov's $δ$ hyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, $\ell_p$ norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is $125.94\%$.

preprint2022arXiv

Linear Classifiers in Product Space Forms

Embedding methods for product spaces are powerful techniques for low-distortion and low-dimensional representation of complex data structures. Here, we address the new problem of linear classification in product space forms -- products of Euclidean, spherical, and hyperbolic spaces. First, we describe novel formulations for linear classifiers on a Riemannian manifold using geodesics and Riemannian metrics which generalize straight lines and inner products in vector spaces. Second, we prove that linear classifiers in $d$-dimensional space forms of any curvature have the same expressive power, i.e., they can shatter exactly $d+1$ points. Third, we formalize linear classifiers in product space forms, describe the first known perceptron and support vector machine classifiers for such spaces and establish rigorous convergence results for perceptrons. Moreover, we prove that the Vapnik-Chervonenkis dimension of linear classifiers in a product space form of dimension $d$ is \emph{at least} $d+1$. We support our theoretical findings with simulations on several datasets, including synthetic data, image data, and single-cell RNA sequencing (scRNA-seq) data. The results show that classification in low-dimensional product space forms for scRNA-seq data offers, on average, a performance improvement of $\sim15\%$ when compared to that in Euclidean spaces of the same dimension.

preprint2022arXiv

Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction

Learning on graphs has attracted significant attention in the learning community due to numerous real-world applications. In particular, graph neural networks (GNNs), which take numerical node features and graph structure as inputs, have been shown to achieve state-of-the-art performance on various graph-related learning tasks. Recent works exploring the correlation between numerical node features and graph structure via self-supervised learning have paved the way for further performance improvements of GNNs. However, methods used for extracting numerical node features from raw data are still graph-agnostic within standard GNN pipelines. This practice is sub-optimal as it prevents one from fully utilizing potential correlations between graph topology and node attributes. To mitigate this issue, we propose a new self-supervised learning framework, Graph Information Aided Node feature exTraction (GIANT). GIANT makes use of the eXtreme Multi-label Classification (XMC) formalism, which is crucial for fine-tuning the language model based on graph information, and scales to large datasets. We also provide a theoretical analysis that justifies the use of XMC over link prediction and motivates integrating XR-Transformers, a powerful method for solving XMC problems, into the GIANT framework. We demonstrate the superior performance of GIANT over the standard GNN pipeline on Open Graph Benchmark datasets: For example, we improve the accuracy of the top-ranked method GAMLP from $68.25\%$ to $69.67\%$, SGC from $63.29\%$ to $66.10\%$ and MLP from $47.24\%$ to $61.10\%$ on the ogbn-papers100M dataset by leveraging GIANT.

preprint2022arXiv

Provably Accurate and Scalable Linear Classifiers in Hyperbolic Spaces

Many high-dimensional practical data sets have hierarchical structures induced by graphs or time series. Such data sets are hard to process in Euclidean spaces and one often seeks low-dimensional embeddings in other space forms to perform the required learning tasks. For hierarchical data, the space of choice is a hyperbolic space because it guarantees low-distortion embeddings for tree-like structures. The geometry of hyperbolic spaces has properties not encountered in Euclidean spaces that pose challenges when trying to rigorously analyze algorithmic solutions. We propose a unified framework for learning scalable and simple hyperbolic linear classifiers with provable performance guarantees. The gist of our approach is to focus on Poincaré ball models and formulate the classification problems using tangent space formalisms. Our results include a new hyperbolic perceptron algorithm as well as an efficient and highly accurate convex optimization setup for hyperbolic support vector machine classifiers. Furthermore, we adapt our approach to accommodate second-order perceptrons, where data is preprocessed based on second-order information (correlation) to accelerate convergence, and strategic perceptrons, where potentially manipulated data arrives in an online manner and decisions are made sequentially. The excellent performance of the Poincaré second-order and strategic perceptrons shows that the proposed framework can be extended to general machine learning problems in hyperbolic spaces. Our experimental results, pertaining to synthetic, single-cell RNA-seq expression measurements, CIFAR10, Fashion-MNIST and mini-ImageNet, establish that all algorithms provably converge and have complexity comparable to those of their Euclidean counterparts. Accompanying codes can be found at: https://github.com/thupchnsky/PoincareLinearClassification.

preprint2022arXiv

You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks

Hypergraphs are used to model higher-order interactions amongst agents and there exist many practically relevant instances of hypergraph datasets. To enable efficient processing of hypergraph-structured data, several hypergraph neural network platforms have been proposed for learning hypergraph properties and structure, with a special focus on node classification. However, almost all existing methods use heuristic propagation rules and offer suboptimal performance on many datasets. We propose AllSet, a new hypergraph neural network paradigm that represents a highly general framework for (hyper)graph neural networks and for the first time implements hypergraph neural network layers as compositions of two multiset functions that can be efficiently learned for each task and each dataset. Furthermore, AllSet draws on new connections between hypergraph neural networks and recent advances in deep learning of multiset functions. In particular, the proposed architecture utilizes Deep Sets and Set Transformer architectures that allow for significant modeling flexibility and offer high expressive power. To evaluate the performance of AllSet, we conduct the most extensive experiments to date involving ten known benchmarking datasets and three newly curated datasets that represent significant challenges for hypergraph node classification. The results demonstrate that AllSet has the unique ability to consistently either match or outperform all other hypergraph neural networks across the tested datasets.

preprint2021arXiv

Image processing in DNA

The main obstacles for the practical deployment of DNA-based data storage platforms are the prohibitively high cost of synthetic DNA and the large number of errors introduced during synthesis. In particular, synthetic DNA products contain both individual oligo (fragment) symbol errors as well as missing DNA oligo errors, with rates that exceed those of modern storage systems by orders of magnitude. These errors can be corrected either through the use of a large number of redundant oligos or through cycles of writing, reading, and rewriting of information that eliminate the errors. Both approaches add to the overall storage cost and are hence undesirable. Here we propose the first method for storing quantized images in DNA that uses signal processing and machine learning techniques to deal with error and cost issues without resorting to the use of redundant oligos or rewriting. Our methods rely on decoupling the RGB channels of images, performing specialized quantization and compression on the individual color channels, and using new discoloration detection and image inpainting techniques. We demonstrate the performance of our approach experimentally on a collection of movie posters stored in DNA.

preprint2021arXiv

Semiquantitative Group Testing in at Most Two Rounds

Semiquantitative group testing (SQGT) is a pooling method in which the test outcomes represent bounded intervals for the number of defectives. Alternatively, it may be viewed as an adder channel with quantized outputs. SQGT represents a natural choice for Covid-19 group testing as it allows for a straightforward interpretation of the cycle threshold values produced by polymerase chain reactions (PCR). Prior work on SQGT did not address the need for adaptive testing with a small number of rounds as required in practice. We propose conceptually simple methods for 2-round and nonadaptive SQGT that significantly improve upon existing schemes by using ideas on nonbinary measurement matrices based on expander graphs and list-disjunct matrices.

preprint2020arXiv

Group Testing with Runlength Constraints for Topological Molecular Storage

Motivated by applications in topological DNA-based data storage, we introduce and study a novel setting of Non-Adaptive Group Testing (NAGT) with runlength constraints on the columns of the test matrix, in the sense that any two 1's must be separated by a run of at least d 0's. We describe and analyze a probabilistic construction of a runlength-constrained scheme in the zero-error and vanishing error settings, and show that the number of tests required by this construction is optimal up to logarithmic factors in the runlength constraint d and the number of defectives k in both cases. Surprisingly, our results show that runlength-constrained NAGT is not more demanding than unconstrained NAGT when d=O(k), and that for almost all choices of d and k it is not more demanding than NAGT with a column Hamming weight constraint only. Towards obtaining runlength-constrained Quantitative NAGT (QNAGT) schemes with good parameters, we also provide lower bounds for this setting and a nearly optimal probabilistic construction of a QNAGT scheme with a column Hamming weight constraint.

preprint2020arXiv

Mass Error-Correction Codes for Polymer-Based Data Storage

We consider the problem of correcting mass readout errors in information encoded in binary polymer strings. Our work builds on results for string reconstruction problems using composition multisets [Acharya et al., 2015] and the unique string reconstruction framework proposed in [Pattabiraman et al., 2019]. Binary polymer-based data storage systems [Laure et al., 2016] operate by designing two molecules of significantly different masses to represent the symbols $\{0,1\}$ and perform readouts through noisy tandem mass spectrometry. Tandem mass spectrometers fragment the strings to be read into shorter substrings and only report their masses, often with errors due to imprecise ionization. Modeling the fragmentation process output in terms of composition multisets allows for designing asymptotically optimal codes capable of unique reconstruction and the correction of a single mass error [Pattabiraman et al., 2019] through the use of derivatives of Catalan paths. Nevertheless, no solutions for multiple-mass error-corrections are currently known. Our work addresses this issue by describing the first multiple-error correction codes that use the polynomial factorization approach for the Turnpike problem [Skiena et al., 1990] and the related factorization described in [Acharya et al., 2015]. Adding Reed-Solomon type coding redundancy into the corresponding polynomials allows for correcting $t$ mass errors in polynomial time using $t^2\, \log\,k$ redundant bits, where $k$ is the information string length. The redundancy can be improved to $\log\,k + t$. However, no decoding algorithm that runs polynomial-time in both $t$ and $n$ for this scheme are currently known, where $n$ is the length of the coded string.

preprint2020arXiv

MaxMinSum Steiner Systems for Access-Balancing in Distributed Storage

Many code families such as low-density parity-check codes, fractional repetition codes, batch codes and private information retrieval codes with low storage overhead rely on the use of combinatorial block designs or derivatives thereof. In the context of distributed storage applications, one is often faced with system design issues that impose additional constraints on the coding schemes, and therefore on the underlying block designs. Here, we address one such problem, pertaining to server access frequency balancing, by introducing a new form of Steiner systems, termed MaxMinSum Steiner systems. MaxMinSum Steiner systems are characterized by the property that the minimum value of the sum of points (elements) within a block is maximized, or that the minimum sum of block indices containing some fixed point is maximized. We show that proper relabelings of points in the Bose and Skolem constructions for Steiner triple systems lead to optimal MaxMin values for the sums of interest; for the duals of the designs, we exhibit block labelings that are within a 3/4 multiplicative factor from the optimum.

preprint2020arXiv

Repairing Reed-Solomon Codes via Subspace Polynomials

We propose new repair schemes for Reed-Solomon codes that use subspace polynomials and hence generalize previous works in the literature that employ trace polynomials. The Reed-Solomon codes are over $\mathbb{F}_{q^\ell}$ and have redundancy $r = n-k \geq q^m$, $1\leq m\leq \ell$, where $n$ and $k$ are the code length and dimension, respectively. In particular, for one erasure, we show that our schemes can achieve optimal repair bandwidths whenever $n=q^\ell$ and $r = q^m,$ for all $1 \leq m \leq \ell$. For two erasures, our schemes use the same bandwidth per erasure as the single erasure schemes, for $\ell/m$ is a power of $q$, and for $\ell=q^a$, $m=q^b-1>1$ ($a \geq b \geq 1$), and for $m\geq \ell/2$ when $\ell$ is even and $q$ is a power of two.

preprint2020arXiv

Repairing Reed-Solomon Codes With Multiple Erasures

Despite their exceptional error-correcting properties, Reed-Solomon codes have been overlooked in distributed storage applications due to the common belief that they have poor repair bandwidth: A naive repair approach would require the whole file to be reconstructed in order to recover a single erased codeword symbol. In a recent work, Guruswami and Wootters (STOC'16) proposed a single-erasure repair method for Reed-Solomon codes that achieves the optimal repair bandwidth amongst all linear encoding schemes. Their key idea is to recover the erased symbol by collecting a sufficiently large number of its traces, each of which can be constructed from a number of traces of other symbols. We extend the trace collection technique to cope with two and three erasures.

preprint2020arXiv

Support Estimation with Sampling Artifacts and Errors

The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science, physics and biology. Most of the existing work in this domain has focused on settings that assume perfectly accurate sampling approaches, which is seldom true in practical data science. Here we introduce the first known approach to support estimation in the presence of sampling artifacts and errors where each sample is assumed to arise from a Poisson repeat channel which simultaneously captures repetitions and deletions of samples. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of so-called Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated using discretized semi-infite programming methods. The estimation approach is tested on synthetic and textual data, as well as on GISAID data collected to address a new problem in computational biology: mutational support estimation in genes of the SARS-Cov-2 virus. In the later setting, the Poisson channel captures the fact that many individuals are tested multiple times for the presence of viral RNA, thereby leading to repeated samples, while other individual's results are not recorded due to test errors. For all experiments performed, we observed significant improvements of our integrated methods compared to those obtained through adequate modifications of state-of-the-art noiseless support estimation methods.

preprint2016arXiv

A new correlation clustering method for cancer mutation analysis

Cancer genomes exhibit a large number of different alterations that affect many genes in a diverse manner. It is widely believed that these alterations follow combinatorial patterns that have a strong connection with the underlying molecular interaction networks and functional pathways. A better understanding of the generative mechanisms behind the mutation rules and their influence on gene communities is of great importance for the process of driver mutations discovery and for identification of network modules related to cancer development and progression. We developed a new method for cancer mutation pattern analysis based on a constrained form of correlation clustering. Correlation clustering is an agnostic learning method that can be used for general community detection problems in which the number of communities or their structure is not known beforehand. The resulting algorithm, named $C^3$, leverages mutual exclusivity of mutations, patient coverage, and driver network concentration principles; it accepts as its input a user determined combination of heterogeneous patient data, such as that available from TCGA (including mutation, copy number, and gene expression information), and creates a large number of clusters containing mutually exclusive mutated genes in a particular type of cancer. The cluster sizes may be required to obey some useful soft size constraints, without impacting the computational complexity of the algorithm. To test $C^3$, we performed a detailed analysis on TCGA breast cancer and glioblastoma data and showed that our algorithm outperforms the state-of-the-art CoMEt method in terms of discovering mutually exclusive gene modules and identifying driver genes. Our $C^3$ method represents a unique tool for efficient and reliable identification of mutation patterns and driver pathways in large-scale cancer genomics studies.

preprint2016arXiv

Asymmetric Lee Distance Codes for DNA-Based Storage

We consider a new family of codes, termed asymmetric Lee distance codes, that arise in the design and implementation of DNA-based storage systems and systems with parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes; furthermore, symbol confusability is dictated by their underlying binary representation. Our contributions are two-fold. First, we demonstrate that the new distance represents a linear combination of the Lee and Hamming distance and derive upper bounds on the size of the codes under this metric based on linear programming techniques. Second, we propose a number of code constructions which imply lower bounds.

preprint2016arXiv

Balanced Permutation Codes

Motivated by charge balancing constraints for rank modulation schemes, we introduce the notion of balanced permutations and derive the capacity of balanced permutation codes. We also describe simple interleaving methods for permutation code constructions and show that they approach capacity

preprint2016arXiv

Correlation Clustering and Biclustering with Locally Bounded Errors

We consider a generalized version of the correlation clustering problem, defined as follows. Given a complete graph $G$ whose edges are labeled with $+$ or $-$, we wish to partition the graph into clusters while trying to avoid errors: $+$ edges between clusters or $-$ edges within clusters. Classically, one seeks to minimize the total number of such errors. We introduce a new framework that allows the objective to be a more general function of the number of errors at each vertex (for example, we may wish to minimize the number of errors at the worst vertex) and provide a rounding algorithm which converts "fractional clusterings" into discrete clusterings while causing only a constant-factor blowup in the number of errors at each vertex. This rounding algorithm yields constant-factor approximation algorithms for the discrete problem under a wide variety of objective functions.

preprint2016arXiv

Latent Network Features and Overlapping Community Discovery via Boolean Intersection Representations

We propose a new latent Boolean feature model for complex networks that captures different types of node interactions and network communities. The model is based on a new concept in graph theory, termed the Boolean intersection representation of a graph, which generalizes the notion of an intersection representation. We mostly focus on one form of Boolean intersection, termed cointersection, and describe how to use this representation to deduce node feature sets and their communities. We derive several general bounds on the minimum number of features used in cointersection representations and discuss graph families for which exact cointersection characterizations are possible. Our results also include algorithms for finding optimal and approximate cointersection representations of a graph.

preprint2016arXiv

Weakly Mutually Uncorrelated Codes

We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based storage systems and synchronization protocols. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. In addition, WMU sequences used in DNA-based storage systems are required to have balanced compositions of symbols and to be at large mutual Hamming distance from each other. We present a number of constructions for balanced, error-correcting WMU codes using Dyck paths, Knuth's balancing principle, prefix synchronized and cyclic codes.

preprint2015arXiv

A Perspective on Future Research Directions in Information Theory

Information theory is rapidly approaching its 70th birthday. What are promising future directions for research in information theory? Where will information theory be having the most impact in 10-20 years? What new and emerging areas are ripe for the most impact, of the sort that information theory has had on the telecommunications industry over the last 60 years? How should the IEEE Information Theory Society promote high-risk new research directions and broaden the reach of information theory, while continuing to be true to its ideals and insisting on the intellectual rigor that makes its breakthroughs so powerful? These are some of the questions that an ad hoc committee (composed of the present authors) explored over the past two years. We have discussed and debated these questions, and solicited detailed inputs from experts in fields including genomics, biology, economics, and neuroscience. This report is the result of these discussions.

preprint2015arXiv

A Rewritable, Random-Access DNA-Based Storage System

We describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing read-only methods that require decoding the whole file in order to read one data fragment. Our system is based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications.

preprint2015arXiv

Code Construction and Decoding Algorithms for Semi-Quantitative Group Testing with Nonuniform Thresholds

We analyze a new group testing scheme, termed semi-quantitative group testing, which may be viewed as a concatenation of an adder channel and a discrete quantizer. Our focus is on non-uniform quantizers with arbitrary thresholds. For the most general semi-quantitative group testing model, we define three new families of sequences capturing the constraints on the code design imposed by the choice of the thresholds. The sequences represent extensions and generalizations of Bh and certain types of super-increasing and lexicographically ordered sequences, and they lead to code structures amenable for efficient recursive decoding. We describe the decoding methods and provide an accompanying computational complexity and performance analysis.

preprint2015arXiv

Codes for DNA Sequence Profiles

We consider the problem of storing and retrieving information from synthetic DNA media. The mathematical basis of the problem is the construction and design of sequences that may be discriminated based on their collection of substrings observed through a noisy channel. This problem of reconstructing sequences from traces was first investigated in the noiseless setting under the name of "Markov type" analysis. Here, we explain the connection between the reconstruction problem and the problem of DNA synthesis and sequencing, and introduce the notion of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and propose new asymmetric coding techniques to combat the effects of synthesis and sequencing noise. In our analysis, we make use of restricted de Bruijn graphs and Ehrhart theory for rational polytopes.

preprint2015arXiv

Codes for DNA Storage Channels

We consider the problem of assembling a sequence based on a collection of its substrings observed through a noisy channel. The mathematical basis of the problem is the construction and design of sequences that may be discriminated based on a collection of their substrings observed through a noisy channel. We explain the connection between the sequence reconstruction problem and the problem of DNA synthesis and sequencing, and introduce the notion of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and propose new asymmetric coding techniques to combat the effects of synthesis and sequencing noise. In our analysis, we make use of restricted de Bruijn graphs and Ehrhart theory for rational polytopes.

preprint2015arXiv

Correlation Clustering with Constrained Cluster Sizes and Extended Weights Bounds

We consider the problem of correlation clustering on graphs with constraints on both the cluster sizes and the positive and negative weights of edges. Our contributions are twofold: First, we introduce the problem of correlation clustering with bounded cluster sizes. Second, we extend the regime of weight values for which the clustering may be performed with constant approximation guarantees in polynomial time and apply the results to the bounded cluster size problem.

preprint2015arXiv

DNA-Based Storage: Trends and Methods

We provide an overview of current approaches to DNA-based storage system design and accompanying synthesis, sequencing and editing methods. We also introduce and analyze a suite of new constrained coding schemes for both archival and random access DNA storage channels. The mathematical basis of our work is the construction and design of sequences over discrete alphabets that avoid pre-specified address patterns, have balanced base content, and exhibit other relevant substring constraints. These schemes adapt the stored signals to the DNA medium and thereby reduce the inherent error-rate of the system.

preprint2014arXiv

Computing Similarity Distances Between Rankings

We address the problem of computing distances between rankings that take into account similarities between candidates. The need for evaluating such distances is governed by applications as diverse as rank aggregation, bioinformatics, social sciences and data storage. The problem may be summarized as follows: Given two rankings and a positive cost function on transpositions that depends on the similarity of the candidates involved, find a smallest cost sequence of transpositions that converts one ranking into another. Our focus is on costs that may be described via special metric-tree structures and on complete rankings modeled as permutations. The presented results include a quadratic-time algorithm for finding a minimum cost decomposition for simple cycles, and a quadratic-time, $4/3$-approximation algorithm for permutations that contain multiple cycles. The proposed methods rely on investigating a newly introduced balancing property of cycles embedded in trees, cycle-merging methods, and shortest path optimization techniques.

preprint2014arXiv

String Reconstruction from Substring Compositions

Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tight bounds on the largest number of strings with given substring compositions. The lower bounds are derived by combinatorial arguments and the upper bounds by algebraic considerations that precisely characterize the set of strings with the same substring compositions in terms of the factorization of bivariate polynomials. The problem can be viewed as a combinatorial simplification of the turnpike problem, and its solution may shed light on this long-standing problem as well. Using well known results on transience of multi-dimensional random walks, we also provide a reconstruction algorithm that reconstructs random strings over alphabets of size $\ge4$ in optimal near-quadratic time.

preprint2014arXiv

Synchronizing Edits in Distributed Storage Networks

We consider the problem of synchronizing data in distributed storage networks under an edit model that includes deletions and insertions. We present two modifications of MDS, regenerating and locally repairable codes that allow updates in the parity-check values to be performed with one round of communication at low bit rates and using small storage overhead. Our main contributions are novel protocols for synchronizing both hot and semi-static data and protocols for data deduplication applications, based on intermediary permutation, Vandermonde and Cauchy matrix coding.

preprint2014arXiv

Synchronizing Rankings via Interactive Communication

We consider the problem of exact synchronization of two rankings at remote locations connected by a two-way channel. Such synchronization problems arise when items in the data are distinguishable, as is the case for playlists, tasklists, crowdvotes and recommender systems rankings. Our model accounts for different constraints on the communication throughput of the forward and feedback links, resulting in different anchoring, syndrome and checksum computation strategies. Information editing is assumed of the form of deletions, insertions, block deletions/insertions, translocations and transpositions. The protocols developed under the given model are order-optimal with respect to genie aided lower bounds.

preprint2013arXiv

Error-Correction in Flash Memories via Codes in the Ulam Metric

We consider rank modulation codes for flash memories that allow for handling arbitrary charge-drop errors. Unlike classical rank modulation codes used for correcting errors that manifest themselves as swaps of two adjacently ranked elements, the proposed \emph{translocation rank codes} account for more general forms of errors that arise in storage systems. Translocations represent a natural extension of the notion of adjacent transpositions and as such may be analyzed using related concepts in combinatorics and rank modulation coding. Our results include derivation of the asymptotic capacity of translocation rank codes, construction techniques for asymptotically good codes, as well as simple decoding methods for one class of constructed codes. As part of our exposition, we also highlight the close connections between the new code family and permutations with short common subsequences, deletion and insertion error-correcting codes for permutations, and permutation codes in the Hamming distance.

preprint2013arXiv

MCUIUC -- A New Framework for Metagenomic Read Compression

Metagenomics is an emerging field of molecular biology concerned with analyzing the genomes of environmental samples comprising many different diverse organisms. Given the nature of metagenomic data, one usually has to sequence the genomic material of all organisms in a batch, leading to a mix of reads coming from different DNA sequences. In deep high-throughput sequencing experiments, the volume of the raw reads is extremely high, frequently exceeding 600 Gb. With an ever increasing demand for storing such reads for future studies, the issue of efficient metagenomic compression becomes of paramount importance. We present the first known approach to metagenome read compression, termed MCUIUC (Metagenomic Compression at UIUC). The gist of the proposed algorithm is to perform classification of reads based on unique organism identifiers, followed by reference-based alignment of reads for individually identified organisms, and metagenomic assembly of unclassified reads. Once assembly and classification are completed, lossless reference based compression is performed via positional encoding. We evaluate the performance of the algorithm on moderate sized synthetic metagenomic samples involving 15 randomly selected organisms and describe future directions for improving the proposed compression method.

preprint2013arXiv

MetaPar: Metagenomic Sequence Assembly via Iterative Reclassification

We introduce a parallel algorithmic architecture for metagenomic sequence assembly, termed MetaPar, which allows for significant reductions in assembly time and consequently enables the processing of large genomic datasets on computers with low memory usage. The gist of the approach is to iteratively perform read (re)classification based on phylogenetic marker genes and assembler outputs generated from random subsets of metagenomic reads. Once a sufficiently accurate classification within genera is performed, de novo metagenomic assemblers (such as Velvet or IDBA-UD) or reference based assemblers may be used for contig construction. We analyze the performance of MetaPar on synthetic data consisting of 15 randomly chosen species from the NCBI database through the effective gap and effective coverage metrics.

preprint2013arXiv

Semi-Quantitative Group Testing: A Unifying Framework for Group Testing with Applications in Genotyping

We propose a novel group testing method, termed semi-quantitative group testing, motivated by a class of problems arising in genome screening experiments. Semi-quantitative group testing (SQGT) is a (possibly) non-binary pooling scheme that may be viewed as a concatenation of an adder channel and an integer-valued quantizer. In its full generality, SQGT may be viewed as a unifying framework for group testing, in the sense that most group testing models are special instances of SQGT. For the new testing scheme, we define the notion of SQ-disjunct and SQ-separable codes, representing generalizations of classical disjunct and separable codes. We describe several combinatorial and probabilistic constructions for such codes. While for most of these constructions we assume that the number of defectives is much smaller than total number of test subjects, we also consider the case in which there is no restriction on the number of defectives and they may be as large as the total number of subjects. For the codes constructed in this paper, we describe a number of efficient decoding algorithms. In addition, we describe a belief propagation decoder for sparse SQGT codes for which no other efficient decoder is currently known. Finally, we define the notion of capacity of SQGT and evaluate it for some special choices of parameters using information theoretic methods.

preprint2012arXiv

A General Framework for Distributed Vote Aggregation

We present a general model for opinion dynamics in a social network together with several possibilities for object selections at times when the agents are communicating. We study the limiting behavior of such a dynamics and show that this dynamics almost surely converges. We consider some special implications of the convergence result for gossip and top-$k$ selective gossip models. In particular, we provide an answer to the open problem of the convergence property of the top-$k$ selective gossip model, and show that the convergence holds in a much more general setting. Moreover, we propose an extension of the gossip and top-$k$ selective gossip models and provide some results for their limiting behavior.

preprint2012arXiv

A Novel Distance-Based Approach to Constrained Rank Aggregation

We consider a classical problem in choice theory -- vote aggregation -- using novel distance measures between permutations that arise in several practical applications. The distance measures are derived through an axiomatic approach, taking into account various issues arising in voting with side constraints. The side constraints of interest include non-uniform relevance of the top and the bottom of rankings (or equivalently, eliminating negative outliers in votes) and similarities between candidates (or equivalently, introducing diversity in the voting process). The proposed distance functions may be seen as weighted versions of the Kendall $τ$ distance and weighted versions of the Cayley distance. In addition to proposing the distance measures and providing the theoretical underpinnings for their applications, we also consider algorithmic aspects associated with distance-based aggregation processes. We focus on two methods. One method is based on approximating weighted distance measures by a generalized version of Spearman's footrule distance, and it has provable constant approximation guarantees. The second class of algorithms is based on a non-uniform Markov chain method inspired by PageRank, for which currently only heuristic guarantees are known. We illustrate the performance of the proposed algorithms for a number of distance measures for which the optimal solution may be easily computed.

preprint2012arXiv

Alternating Markov Chains for Distribution Estimation in the Presence of Errors

We consider a class of small-sample distribution estimators over noisy channels. Our estimators are designed for repetition channels, and rely on properties of the runs of the observed sequences. These runs are modeled via a special type of Markov chains, termed alternating Markov chains. We show that alternating chains have redundancy that scales sub-linearly with the lengths of the sequences, and describe how to use a distribution estimator for alternating chains for the purpose of distribution estimation over repetition channels.

preprint2012arXiv

Casual Compressive Sensing for Gene Network Inference

We propose a novel framework for studying causal inference of gene interactions using a combination of compressive sensing and Granger causality techniques. The gist of the approach is to discover sparse linear dependencies between time series of gene expressions via a Granger-type elimination method. The method is tested on the Gardner dataset for the SOS network in E. coli, for which both known and unknown causal relationships are discovered.

preprint2012arXiv

Hybrid Noncoherent Network Coding

We describe a novel extension of subspace codes for noncoherent networks, suitable for use when the network is viewed as a communication system that introduces both dimension and symbol errors. We show that when symbol erasures occur in a significantly large number of different basis vectors transmitted through the network and when the min-cut of the networks is much smaller then the length of the transmitted codewords, the new family of codes outperforms their subspace code counterparts. For the proposed coding scheme, termed hybrid network coding, we derive two upper bounds on the size of the codes. These bounds represent a variation of the Singleton and of the sphere-packing bound. We show that a simple concatenated scheme that represents a combination of subspace codes and Reed-Solomon codes is asymptotically optimal with respect to the Singleton bound. Finally, we describe two efficient decoding algorithms for concatenated subspace codes that in certain cases have smaller complexity than subspace decoders.

preprint2012arXiv

Nonuniform Vote Aggregation Algorithms

We consider the problem of non-uniform vote aggregation, and in particular, the algorithmic aspects associated with the aggregation process. For a novel class of weighted distance measures on votes, we present two different aggregation methods. The first algorithm is based on approximating the weighted distance measure by Spearman's footrule distance, with provable constant approximation guarantees. The second algorithm is based on a non-uniform Markov chain method inspired by PageRank, for which currently only heuristic guarantees are known. We illustrate the performance of the proposed algorithms on a number of distance measures for which the optimal solution may be easily computed.

preprint2012arXiv

Novel Distance Measures for Vote Aggregation

We consider the problem of rank aggregation based on new distance measures derived through axiomatic approaches and based on score-based methods. In the first scenario, we derive novel distance measures that allow for discriminating between the ranking process of highest and lowest ranked elements in the list. These distance functions represent weighted versions of Kendall's tau measure and may be computed efficiently in polynomial time. Furthermore, we describe how such axiomatic approaches may be extended to the study of score-based aggregation and present the first analysis of distributed vote aggregation over networks.

preprint2012arXiv

Semi-Quantitative Group Testing

We consider a novel group testing procedure, termed semi-quantitative group testing, motivated by a class of problems arising in genome sequence processing. Semi-quantitative group testing (SQGT) is a non-binary pooling scheme that may be viewed as a combination of an adder model followed by a quantizer. For the new testing scheme we define the capacity and evaluate the capacity for some special choices of parameters using information theoretic methods. We also define a new class of disjunct codes suitable for SQGT, termed SQ-disjunct codes. We also provide both explicit and probabilistic code construction methods for SQGT with simple decoding algorithms.

preprint2011arXiv

Information Theoretic Bounds for Tensor Rank Minimization over Finite Fields

We consider the problem of noiseless and noisy low-rank tensor completion from a set of random linear measurements. In our derivations, we assume that the entries of the tensor belong to a finite field of arbitrary size and that reconstruction is based on a rank minimization framework. The derived results show that the smallest number of measurements needed for exact reconstruction is upper bounded by the product of the rank, the order and the dimension of a cubic tensor. Furthermore, this condition is also sufficient for unique minimization. Similar bounds hold for the noisy rank minimization scenario, except for a scaling function that depends on the channel error probability.

preprint2011arXiv

Structured sublinear compressive sensing via belief propagation

Compressive sensing (CS) is a sampling technique designed for reducing the complexity of sparse data acquisition. One of the major obstacles for practical deployment of CS techniques is the signal reconstruction time and the high storage cost of random sensing matrices. We propose a new structured compressive sensing scheme, based on codes of graphs, that allows for a joint design of structured sensing matrices and logarithmic-complexity reconstruction algorithms. The compressive sensing matrices can be shown to offer asymptotically optimal performance when used in combination with Orthogonal Matching Pursuit (OMP) methods. For more elaborate greedy reconstruction schemes, we propose a new family of list decoding belief propagation algorithms, as well as reinforced- and multiple-basis belief propagation algorithms. Our simulation results indicate that reinforced BP CS schemes offer very good complexity-performance tradeoffs for very sparse signal vectors.

preprint2011arXiv

Symmetric Group Testing and Superimposed Codes

We describe a generalization of the group testing problem termed symmetric group testing. Unlike in classical binary group testing, the roles played by the input symbols zero and one are "symmetric" while the outputs are drawn from a ternary alphabet. Using an information-theoretic approach, we derive sufficient and necessary conditions for the number of tests required for noise-free and noisy reconstructions. Furthermore, we extend the notion of disjunct (zero-false-drop) and separable (uniquely decipherable) codes to the case of symmetric group testing. For the new family of codes, we derive bounds on their size based on probabilistic methods, and provide construction methods based on coding theoretic ideas.

preprint2010arXiv

A Geometric Approach to Low-Rank Matrix Completion

The low-rank matrix completion problem can be succinctly stated as follows: given a subset of the entries of a matrix, find a low-rank matrix consistent with the observations. While several low-complexity algorithms for matrix completion have been proposed so far, it remains an open problem to devise search procedures with provable performance guarantees for a broad class of matrix models. The standard approach to the problem, which involves the minimization of an objective function defined using the Frobenius metric, has inherent difficulties: the objective function is not continuous and the solution set is not closed. To address this problem, we consider an optimization procedure that searches for a column (or row) space that is geometrically consistent with the partial observations. The geometric objective function is continuous everywhere and the solution set is the closure of the solution set of the Frobenius metric. We also preclude the existence of local minimizers, and hence establish strong performance guarantees, for special completion scenarios, which do not require matrix incoherence or large matrix size.

preprint2010arXiv

SET: an algorithm for consistent matrix completion

A new algorithm, termed subspace evolution and transfer (SET), is proposed for solving the consistent matrix completion problem. In this setting, one is given a subset of the entries of a low-rank matrix, and asked to find one low-rank matrix consistent with the given observations. We show that this problem can be solved by searching for a column space that matches the observations. The corresponding algorithm consists of two parts -- subspace evolution and subspace transfer. In the evolution part, we use a line search procedure to refine the column space. However, line search is not guaranteed to converge, as there may exist barriers along the search path that prevent the algorithm from reaching a global optimum. To address this problem, in the transfer part, we design mechanisms to detect barriers and transfer the estimated column space from one side of the barrier to the another. The SET algorithm exhibits excellent empirical performance for very low-rank matrices.

preprint2010arXiv

Sorting of Permutations by Cost-Constrained Transpositions

We address the problem of finding the minimum decomposition of a permutation in terms of transpositions with non-uniform cost. For arbitrary non-negative cost functions, we describe polynomial-time, constant-approximation decomposition algorithms. For metric-path costs, we describe exact polynomial-time decomposition algorithms. Our algorithms represent a combination of Viterbi-type algorithms and graph-search techniques for minimizing the cost of individual transpositions, and dynamic programing algorithms for finding minimum cost cycle decompositions. The presented algorithms have applications in information theory, bioinformatics, and algebra.

preprint2010arXiv

Subspace Evolution and Transfer (SET) for Low-Rank Matrix Completion

We describe a new algorithm, termed subspace evolution and transfer (SET), for solving low-rank matrix completion problems. The algorithm takes as its input a subset of entries of a low-rank matrix, and outputs one low-rank matrix consistent with the given observations. The completion task is accomplished by searching for a column space on the Grassmann manifold that matches the incomplete observations. The SET algorithm consists of two parts -- subspace evolution and subspace transfer. In the evolution part, we use a gradient descent method on the Grassmann manifold to refine our estimate of the column space. Since the gradient descent algorithm is not guaranteed to converge, due to the existence of barriers along the search path, we design a new mechanism for detecting barriers and transferring the estimated column space across the barriers. This mechanism constitutes the core of the transfer step of the algorithm. The SET algorithm exhibits excellent empirical performance for both high and low sampling rate regimes.

preprint2009arXiv

Multiple-Bases Belief-Propagation Decoding of High-Density Cyclic Codes

We introduce a new method for decoding short and moderate length linear block codes with dense parity-check matrix representations of cyclic form, termed multiple-bases belief-propagation (MBBP). The proposed iterative scheme makes use of the fact that a code has many structurally diverse parity-check matrices, capable of detecting different error patterns. We show that this inherent code property leads to decoding algorithms with significantly better performance when compared to standard BP decoding. Furthermore, we describe how to choose sets of parity-check matrices of cyclic form amenable for multiple-bases decoding, based on analytical studies performed for the binary erasure channel. For several cyclic and extended cyclic codes, the MBBP decoding performance can be shown to closely follow that of maximum-likelihood decoders.

preprint2008arXiv

Permutation Decoding and the Stopping Redundancy Hierarchy of Cyclic and Extended Cyclic Codes

We introduce the notion of the stopping redundancy hierarchy of a linear block code as a measure of the trade-off between performance and complexity of iterative decoding for the binary erasure channel. We derive lower and upper bounds for the stopping redundancy hierarchy via Lovasz's Local Lemma and Bonferroni-type inequalities, and specialize them for codes with cyclic parity-check matrices. Based on the observed properties of parity-check matrices with good stopping redundancy characteristics, we develop a novel decoding technique, termed automorphism group decoding, that combines iterative message passing and permutation decoding. We also present bounds on the smallest number of permutations of an automorphism group decoder needed to correct any set of erasures up to a prescribed size. Simulation results demonstrate that for a large number of algebraic codes, the performance of the new decoding method is close to that of maximum likelihood decoding.

preprint2008arXiv

The Trapping Redundancy of Linear Block Codes

We generalize the notion of the stopping redundancy in order to study the smallest size of a trapping set in Tanner graphs of linear block codes. In this context, we introduce the notion of the trapping redundancy of a code, which quantifies the relationship between the number of redundant rows in any parity-check matrix of a given code and the size of its smallest trapping set. Trapping sets with certain parameter sizes are known to cause error-floors in the performance curves of iterative belief propagation decoders, and it is therefore important to identify decoding matrices that avoid such sets. Bounds on the trapping redundancy are obtained using probabilistic and constructive methods, and the analysis covers both general and elementary trapping sets. Numerical values for these bounds are computed for the [2640,1320] Margulis code and the class of projective geometry codes, and compared with some new code-specific trapping set size estimates.

preprint2006arXiv

Shortened Array Codes of Large Girth

One approach to designing structured low-density parity-check (LDPC) codes with large girth is to shorten codes with small girth in such a manner that the deleted columns of the parity-check matrix contain all the variables involved in short cycles. This approach is especially effective if the parity-check matrix of a code is a matrix composed of blocks of circulant permutation matrices, as is the case for the class of codes known as array codes. We show how to shorten array codes by deleting certain columns of their parity-check matrices so as to increase their girth. The shortening approach is based on the observation that for array codes, and in fact for a slightly more general class of LDPC codes, the cycles in the corresponding Tanner graph are governed by certain homogeneous linear equations with integer coefficients. Consequently, we can selectively eliminate cycles from an array code by only retaining those columns from the parity-check matrix of the original code that are indexed by integer sequences that do not contain solutions to the equations governing those cycles. We provide Ramsey-theoretic estimates for the maximum number of columns that can be retained from the original parity-check matrix with the property that the sequence of their indices avoid solutions to various types of cycle-governing equations. This translates to estimates of the rate penalty incurred in shortening a code to eliminate cycles. Simulation results show that for the codes considered, shortening them to increase the girth can lead to significant gains in signal-to-noise ratio in the case of communication over an additive white Gaussian noise channel.

preprint2005arXiv

DNA Codes that Avoid Secondary Structures

In this paper, we consider the problem of designing DNA sequences (codewords) for DNA storage systems and DNA computing that are unlikely to fold back onto themselves to form undesirable secondary structures. The paper addresses both the issue of enumerating the sequences with such properties and the problem of practical code construction.

Olgica Milenkovic

What is connected

Connect this record

See the researcher in context

Building this map preview

56 published item(s)

Spatially-Coupled Network RNA Velocities: A Control-Theoretic Perspective

Balanced and Swap-Robust Trades for Dynamical Distributed Storage

HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering

Linear Classifiers in Product Space Forms

Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction

Provably Accurate and Scalable Linear Classifiers in Hyperbolic Spaces

You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks

Image processing in DNA

Semiquantitative Group Testing in at Most Two Rounds

Group Testing with Runlength Constraints for Topological Molecular Storage

Mass Error-Correction Codes for Polymer-Based Data Storage

MaxMinSum Steiner Systems for Access-Balancing in Distributed Storage

Repairing Reed-Solomon Codes via Subspace Polynomials

Repairing Reed-Solomon Codes With Multiple Erasures

Support Estimation with Sampling Artifacts and Errors

A new correlation clustering method for cancer mutation analysis

Asymmetric Lee Distance Codes for DNA-Based Storage

Balanced Permutation Codes

Correlation Clustering and Biclustering with Locally Bounded Errors

Latent Network Features and Overlapping Community Discovery via Boolean Intersection Representations

Weakly Mutually Uncorrelated Codes

A Perspective on Future Research Directions in Information Theory

A Rewritable, Random-Access DNA-Based Storage System

Code Construction and Decoding Algorithms for Semi-Quantitative Group Testing with Nonuniform Thresholds

Codes for DNA Sequence Profiles

Codes for DNA Storage Channels

Correlation Clustering with Constrained Cluster Sizes and Extended Weights Bounds

DNA-Based Storage: Trends and Methods

Computing Similarity Distances Between Rankings

String Reconstruction from Substring Compositions

Synchronizing Edits in Distributed Storage Networks

Synchronizing Rankings via Interactive Communication

Error-Correction in Flash Memories via Codes in the Ulam Metric

MCUIUC -- A New Framework for Metagenomic Read Compression

MetaPar: Metagenomic Sequence Assembly via Iterative Reclassification

Semi-Quantitative Group Testing: A Unifying Framework for Group Testing with Applications in Genotyping

A General Framework for Distributed Vote Aggregation

A Novel Distance-Based Approach to Constrained Rank Aggregation

Alternating Markov Chains for Distribution Estimation in the Presence of Errors

Casual Compressive Sensing for Gene Network Inference

Hybrid Noncoherent Network Coding

Nonuniform Vote Aggregation Algorithms

Novel Distance Measures for Vote Aggregation

Semi-Quantitative Group Testing

Information Theoretic Bounds for Tensor Rank Minimization over Finite Fields

Structured sublinear compressive sensing via belief propagation

Symmetric Group Testing and Superimposed Codes

A Geometric Approach to Low-Rank Matrix Completion

SET: an algorithm for consistent matrix completion

Sorting of Permutations by Cost-Constrained Transpositions

Subspace Evolution and Transfer (SET) for Low-Rank Matrix Completion

Multiple-Bases Belief-Propagation Decoding of High-Density Cyclic Codes

Permutation Decoding and the Stopping Redundancy Hierarchy of Cyclic and Extended Cyclic Codes

The Trapping Redundancy of Linear Block Codes

Shortened Array Codes of Large Girth

DNA Codes that Avoid Secondary Structures