Source author record

Alexandru I. Tomescu

Alexandru I. Tomescu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Complexity Discrete Mathematics Genomics math.CO math.OC Quantitative Methods Computational Engineering, Finance, and Science Computer Science and Game Theory Logic in Computer Science Populations and Evolution

Catalog footprint

What is connected

18works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Minimum Flow Decomposition in Graphs with Cycles using Integer Linear Programming

Minimum flow decomposition (MFD) -- the problem of finding a minimum set of weighted source-to-sink paths that perfectly decomposes a flow -- is a classical problem in Computer Science, and variants of it are powerful models in different fields such as Bioinformatics and Transportation. Even on acyclic graphs, the problem is NP-hard, and most practical solutions have been via heuristics or approximations. While there is an extensive body of research on acyclic graphs, currently, there is no \emph{exact} solution on graphs with cycles. In this paper, we present the first ILP formulation for three natural variants of the MFD problem in graphs with cycles, asking for a decomposition consisting only of weighted source-to-sink paths or cycles, trails, and walks, respectively. On three datasets of increasing levels of complexity from both Bioinformatics and Transportation, our approaches solve any instance in under 10 minutes. Our implementations are freely available at github.com/algbio/MFD-ILP.

preprint2022arXiv

Algorithms and Complexity on Indexing Founder Graphs

We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alphabet, but there naturally exist also graph classes that are easy to index: E.g. Wheeler graphs (Gagie et al. Theor. Comp. Sci. 2017) cover graphs admitting a Burrows-Wheeler transform -based indexing scheme. However, it is NP-complete to recognize if a graph is a Wheeler graph (Gibney, Thankachan, ESA 2019). We propose an approach to alleviate the construction bottleneck of Wheeler graphs. Rather than starting from an arbitrary graph, we study graphs induced from multiple sequence alignments (MSAs). Elastic degenerate strings (Bernadini et al. SPIRE 2017, ICALP 2019) can be seen as such graphs, and we introduce here their generalization: elastic founder graphs. We first prove that even such induced graphs are hard to index under OVH. Then we introduce two subclasses, repeat-free and semi-repeat-free graphs, that are easy to index. We give a linear time algorithm to construct a repeat-free non-elastic founder graph from a gapless MSA, and (parameterized) near-linear time algorithms to construct semi-repeat-free (repeat-free, respectively) elastic founder graphs from general MSAs. Finally, we show that repeat-free elastic founder graphs admit a reduction to Wheeler graphs in polynomial time.

preprint2022arXiv

Fast, Flexible, and Exact Minimum Flow Decompositions via ILP

Minimum flow decomposition (MFD) (the problem of finding a minimum set of paths that perfectly decomposes a flow) is a classical problem in Computer Science, and variants of it are powerful models in multiassembly problems in Bioinformatics (e.g. RNA assembly). However, because this problem and its variants are NP-hard, practical multiassembly tools either use heuristics or solve simpler, polynomial-time solvable versions of the problem, which may yield solutions that are not mini-mal or do not perfectly decompose the flow. Many RNA assemblers also use integer linear programming(ILP) formulations of such practical variants, having the major limitation they need to encode all the potentially exponentially many solution paths. Moreover, the only exact solver for MFD does not scale to large instances and cannot be efficiently generalized to practical MFD variants. In this work, we provide the first practical ILP formulation for MFD (and thus the first fast and exact solver for MFD), based on encoding all of the exponentially many solution paths using only a quadratic number of variables. On both simulated and real flow graphs, our approach solves any instance in under 13 seconds. We also show that our ILP formulation can be easily and efficiently adapted for many practical variants, such as incorporating longer or paired-end reads or minimizing flow errors. We hope that our results can remove the current tradeoff between the complexity of a multi assembly model and its tractability and can lie at the core of future practical RNA assembly tools.

preprint2022arXiv

Optimizing Safe Flow Decompositions in DAGs

Network flow is one of the most studied combinatorial optimization problems having innumerable applications. Any flow on a directed acyclic graph $G$ having $n$ vertices and $m$ edges can be decomposed into a set of $O(m)$ paths. In some applications, each solution (decomposition) corresponds to some particular data that generated the original flow. Given the possibility of multiple optimal solutions, no optimization criterion ensures the identification of the correct decomposition. Hence, recently flow decomposition was studied [RECOMB22] in the Safe and Complete framework, particularly for RNA Assembly. They presented a characterization of the safe paths, resulting in an $O(mn+out_R)$ time algorithm to compute all safe paths, where $out_R$ is the size of the raw output reporting each safe path explicitly. They also showed that $out_R$ can be $Ω(mn^2)$ in the worst case but $O(m)$ in the best case. Hence, they further presented an algorithm to report a concise representation of the output $out_C$ in $O(mn+out_C)$ time, where $out_C$ can be $Ω(mn)$ in the worst case but $O(m)$ in the best case. In this work, we study how different safe paths interact, resulting in optimal output-sensitive algorithms requiring $O(m+out_R)$ and $O(m+out_C)$ time for computing the existing representations of the safe paths. Further, we propose a new characterization of the safe paths resulting in the {\em optimal} representation of safe paths $out_O$, which can be $Ω(mn)$ in the worst case but requires optimal $O(1)$ space for every safe path reported, with a near-optimal computation algorithm. Overall we further develop the theory of safe and complete solutions for the flow decomposition problem, giving an optimal algorithm for the explicit representation, and a near-optimal algorithm for the optimal representation of the safe paths

preprint2022arXiv

Safety and Completeness in Flow Decompositions for RNA Assembly

Decomposing a network flow into weighted paths has numerous applications. Some applications require any decomposition that is optimal w.r.t. some property such as number of paths, robustness, or length. Many bioinformatic applications require a specific decomposition where the paths correspond to some underlying data that generated the flow. For real inputs, no optimization criteria guarantees to uniquely identify the correct decomposition. Therefore, we propose to report safe paths, i.e., subpaths of at least one path in every flow decomposition. Ma, Zheng, and Kingsford [WABI 2020] addressed the existence of multiple optimal solutions in a probabilistic framework, i.e., non-identifiability. Later [RECOMB 2021], they gave a quadratic-time algorithm based on a global criterion for solving a problem called AND-Quant, which generalizes the problem of reporting whether a given path is safe. We give the first local characterization of safe paths for flow decompositions in directed acyclic graphs (DAGs), leading to a practical algorithm for finding the complete set of safe paths. We evaluated our algorithms against the trivial safe algorithms (unitigs, extended unitigs) and the popularly used heuristic (greedy-width) for flow decomposition on RNA transcripts datasets. Despite maintaining perfect precision our algorithm reports significantly higher coverage ($\approx 50\%$ more) than trivial safe algorithms. The greedy-width algorithm though reporting a better coverage, has significantly lower precision on complex graphs. Overall, our algorithm outperforms (by $\approx 20\%$) greedy-width on a unified metric (F-Score) when the dataset has significant number of complex graphs. Moreover, it has superior time ($3-5\times$) and space efficiency ($1.2-2.2\times$), resulting in a better and more practical approach for bioinformatics applications of flow decomposition.

preprint2021arXiv

The Labeled Direct Product Optimally Solves String Problems on Graphs

Suffix trees are an important data structure at the core of optimal solutions to many fundamental string problems, such as exact pattern matching, longest common substring, matching statistics, and longest repeated substring. Recent lines of research focused on extending some of these problems to vertex-labeled graphs, although using ad-hoc approaches which in some cases do not generalize to all input graphs. In the absence of a ubiquitous tool like the suffix tree for labeled graphs, we introduce the labeled direct product of two graphs as a general tool for obtaining optimal algorithms: we obtain conceptually simpler algorithms for the quadratic problems of string matching (SMLG) and longest common substring (LCSP) in labeled graphs. Our algorithms are also more efficient, since they run in time linear in the size of the labeled product graph, which may be smaller than quadratic for some inputs, and their run-time is predictable, because the size of the labeled direct product graph can be precomputed efficiently. We also solve LCSP on graphs containing cycles, which was left as an open problem by Shimohira et al. in 2011. To show the power of the labeled product graph, we also apply it to solve the matching statistics (MSP) and the longest repeated string (LRSP) problems in labeled graphs. Moreover, we show that our (worst-case quadratic) algorithms are also optimal, conditioned on the Orthogonal Vectors Hypothesis. Finally, we complete the complexity picture around LRSP by studying it on undirected graphs.

preprint2020arXiv

Computing all $s$-$t$ bridges and articulation points simplified

Given a directed graph $G$ and a pair of nodes $s$ and $t$, an $s$-$t$ bridge of $G$ is an edge whose removal breaks all $s$-$t$ paths of $G$. Similarly, an $s$-$t$ articulation point of $G$ is a node whose removal breaks all $s$-$t$ paths of $G$. Computing the sequence of all $s$-$t$ bridges of $G$ (as well as the $s$-$t$ articulation points) is a basic graph problem, solvable in linear time using the classical min-cut algorithm. When dealing with cuts of unit size ($s$-$t$ bridges) this algorithm can be simplified to a single graph traversal from $s$ to $t$ avoiding an arbitrary $s$-$t$ path, which is interrupted at the $s$-$t$ bridges. Further, the corresponding proof is also simplified making it independent of the theory of network flows.

preprint2020arXiv

Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails

We consider the following string matching problem on a node-labeled graph $G=(V,E)$: given a pattern string $P$, decide whether there exists a path in $G$ whose concatenation of node labels equals $P$. This is a basic primitive in various problems in bioinformatics, graph databases, or networks. The hardness results of Backurs and Indyk (FOCS 2016) imply that this problem cannot be solved in better than $O(|E||P|)$ time, under the Orthogonal Vectors Hypothesis (OVH), and this holds even under various restrictions on the graph (Equi et al., ICALP 2019). In this paper we consider its offline version, namely the one in which we are allowed to index the graph in order to support time-efficient string matching queries. Indeed, it was tantalizing in the string matching community to believe that sub-quadratic time queries can be achieved, e.g. at the cost of a high-degree polynomial-time indexing. We disprove this belief, showing that, under OVH, no polynomial-time index can support querying $P$ in time $O(|E|^δ|P|^β)$, with either $δ< 1$ or $β< 1$. We prove this tight bound employing a known self-reducibility technique, e.g. from the field of dynamic algorithms, which translates conditional lower bounds for an online problem to its offline version. As a side-contribution, we formalize this technique with the notion of linear independent-components reduction, allowing for a simple proof of our result. As another illustration of our technique, we also translate the quadratic conditional lower bound of Backurs and Indyk (STOC 2015) for the problem of matching a query string inside a text, under edit distance. We obtain an analogous tight quadratic lower bound for its offline version, improving the recent result of Cohen-Addad, Feuilloley and Starikovskaya (SODA 2019), but with a slightly different boundary condition.

preprint2020arXiv

Linear Time Construction of Indexable Founder Block Graphs

We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410\times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3\%$ of the size of the MSA.

preprint2016arXiv

Complexity and algorithms for finding a perfect phylogeny from mixed tumor samples

Recently, Hajirasouliha and Raphael (WABI 2014) proposed a model for deconvoluting mixed tumor samples measured from a collection of high-throughput sequencing reads. This is related to understanding tumor evolution and critical cancer mutations. In short, their formulation asks to split each row of a binary matrix so that the resulting matrix corresponds to a perfect phylogeny and has the minimum number of rows among all matrices with this property. In this paper we disprove several claims about this problem, including an NP-hardness proof of it. However, we show that the problem is indeed NP-hard, by providing a different proof. We also prove NP-completeness of a variant of this problem proposed in the same paper. On the positive side, we propose an efficient (though not necessarily optimal) heuristic algorithm based on coloring co-comparability graphs, and a polynomial time algorithm for solving the problem optimally on matrix instances in which no column is contained in both columns of a pair of conflicting columns. Implementations of these algorithms are freely available at https://github.com/alexandrutomescu/MixedPerfectPhylogeny

preprint2016arXiv

Safe and complete contig assembly via omnitigs

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph $G$ (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from $G$ as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.

preprint2014arXiv

Enumeration of the adjunctive hierarchy of hereditarily finite sets

Hereditarily finite sets (sets which are finite and have only hereditarily finite sets as members) are basic mathematical and computational objects, and also stand at the basis of some programming languages. This raises the need for efficient representation of such sets, for example by numbers. In 2008, Kirby proposed an adjunctive hierarchy of hereditarily finite sets, based on the fact that they can also be seen as built up from the empty set by repeated adjunction, that is, by the addition of a new single element drawn from the already existing sets to an already existing set. Determining the cardinality $a_n$ of each level of this hierarchy, problem crucial in establishing whether the natural adjunctive hierarchy leads to an efficient encoding by numbers, was left open. In this paper we solve this problem. Our results can be generalized to hereditarily finite sets with atoms, or can be further refined by imposing restrictions on rank, on cardinality, or on the maximum level from where the new adjoined element can be drawn. We also show that $a_n$ satisfies the asymptotic formula $a_n = C^{2^n} + O(C^{2^{n-1}})$, for a constant $C \approx 1.3399$, which is a too fast asymptotic growth for practical purposes. We thus propose a very natural variant of the adjunctive hierarchy, whose asymptotic behavior we prove to be $Θ(2^n)$. To our knowledge, this is the first result of this kind.

preprint2014arXiv

Motif matching using gapped patterns

We present new algorithms for the problem of multiple string matching of gapped patterns, where a gapped pattern is a sequence of strings such that there is a gap of fixed length between each two consecutive strings. The problem has applications in the discovery of transcription factor binding sites in DNA sequences when using generalized versions of the Position Weight Matrix model to describe transcription factor specificities. In these models a motif can be matched as a set of gapped patterns with unit-length keywords. The existing algorithms for matching a set of gapped patterns are worst-case efficient but not practical, or vice versa, in this particular case. The novel algorithms that we present are based on dynamic programming and bit-parallelism, and lie in a middle-ground among the existing algorithms. In fact, their time complexity is close to the best existing bound and, yet, they are also practical. We also provide experimental results which show that the presented algorithms are fast in practice, and preferable if all the strings in the patterns have unit-length.

preprint2013arXiv

A Novel Combinatorial Method for Estimating Transcript Expression with RNA-Seq: Bounding the Number of Paths

RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. The problem is usually modeled by a weighted splicing graph whose nodes stand for exons and whose edges stand for split alignments to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the coverages of the graph under various fitness functions, such least sum of squares. In (Tomescu et al. RECOMB-seq 2013) we showed that under general fitness functions, if we allow a polynomially bounded number of paths in an optimal solution, this problem can be solved in polynomial time by a reduction to a min-cost flow program. In this paper we further refine this problem by asking for a bounded number k of paths that optimally explain the splicing graph. This problem becomes NP-hard in the strong sense, but we give a fast combinatorial algorithm based on dynamic programming for it. In order to obtain a practical tool, we implement three optimizations and heuristics, which achieve better performance on real data, and similar or better performance on simulated data, than state-of-the-art tools Cufflinks, IsoLasso and SLIDE. Our tool, called Traph, is available at http://www.cs.helsinki.fi/gsa/traph/

preprint2013arXiv

Combinatorial decomposition approaches for efficient counting and random generation FPTASes

Given a combinatorial decomposition for a counting problem, we resort to the simple scheme of approximating large numbers by floating-point representations in order to obtain efficient Fully Polynomial Time Approximation Schemes (FPTASes) for it. The number of bits employed for the exponent and the mantissa will depend on the error parameter $0 < \varepsilon \leq 1$ and on the characteristics of the problem. Accordingly, we propose the first FPTASes with $1 \pm \varepsilon$ relative error for counting and generating uniformly at random a labeled DAG with a given number of vertices. This is accomplished starting from a classical recurrence for counting DAGs, whose values we approximate by floating-point numbers. After extending these results to other families of DAGs, we show how the same approach works also with problems where we are given a compact representation of a combinatorial ensemble and we are asked to count and sample elements from it. We employ here the floating-point approximation method to transform the classic pseudo-polynomial algorithm for counting 0/1 Knapsack solutions into a very simple FPTAS with $1 - \varepsilon$ relative error. Its complexity improves upon the recent result (Štefankovič et al., SIAM J. Comput., 2012), and, when $\varepsilon^{-1} = Ω(n)$, also upon the best-known randomized algorithm (Dyer, STOC, 2003). To show the versatility of this technique, we also apply it to a recent generalization of the problem of counting 0/1 Knapsack solutions in an arc-weighted DAG, obtaining a faster and simpler FPTAS than the existing one.

preprint2013arXiv

Indexes for Jumbled Pattern Matching in Strings, Trees and Graphs

We consider how to index strings, trees and graphs for jumbled pattern matching when we are asked to return a match if one exists. For example, we show how, given a tree containing two colours, we can build a quadratic-space index with which we can find a match in time proportional to the size of the match. We also show how we need only linear space if we are content with approximate matches.

preprint2012arXiv

Graph Operations on Parity Games and Polynomial-Time Algorithms

Parity games are games that are played on directed graphs whose vertices are labeled by natural numbers, called priorities. The players push a token along the edges of the digraph. The winner is determined by the parity of the greatest priority occurring infinitely often in this infinite play. A motivation for studying parity games comes from the area of formal verification of systems by model checking. Deciding the winner in a parity game is polynomial time equivalent to the model checking problem of the modal mu-calculus. Another strong motivation lies in the fact that the exact complexity of solving parity games is a long-standing open problem, the currently best known algorithm being subexponential. It is known that the problem is in the complexity classes UP and coUP. In this paper we identify restricted classes of digraphs where the problem is solvable in polynomial time, following an approach from structural graph theory. We consider three standard graph operations: the join of two graphs, repeated pasting along vertices, and the addition of a vertex. Given a class C of digraphs on which we can solve parity games in polynomial time, we show that the same holds for the class obtained from C by applying once any of these three operations to its elements. These results provide, in particular, polynomial time algorithms for parity games whose underlying graph is an orientation of a complete graph, a complete bipartite graph, a block graph, or a block-cactus graph. These are classes where the problem was not known to be efficiently solvable. Previous results concerning restricted classes of parity games which are solvable in polynomial time include classes of bounded tree-width, bounded DAG-width, and bounded clique-width. We also prove that recognising the winning regions of a parity game is not easier than computing them from scratch.

preprint2012arXiv

Set graphs. II. Complexity of set graph recognition and similar problems

A graph $G$ is said to be a `set graph' if it admits an acyclic orientation that is also `extensional', in the sense that the out-neighborhoods of its vertices are pairwise distinct. Equivalently, a set graph is the underlying graph of the digraph representation of a hereditarily finite set. In this paper, we continue the study of set graphs and related topics, focusing on computational complexity aspects. We prove that set graph recognition is NP-complete, even when the input is restricted to bipartite graphs with exactly two leaves. The problem remains NP-complete if, in addition, we require that the extensional acyclic orientation be also `slim', that is, that the digraph obtained by removing any arc from it is not extensional. We also show that the counting variants of the above problems are #P-complete, and prove similar complexity results for problems related to a generalization of extensional acyclic digraphs, the so-called `hyper-extensional digraphs', which were proposed by Aczel to describe hypersets. Our proofs are based on reductions from variants of the Hamiltonian Path problem. We also consider a variant of the well-known notion of a separating code in a digraph, the so-called `open-out-separating code', and show that it is NP-complete to determine whether an input extensional acyclic digraph contains an open-out-separating code of given size.

Alexandru I. Tomescu

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Minimum Flow Decomposition in Graphs with Cycles using Integer Linear Programming

Algorithms and Complexity on Indexing Founder Graphs

Fast, Flexible, and Exact Minimum Flow Decompositions via ILP

Optimizing Safe Flow Decompositions in DAGs

Safety and Completeness in Flow Decompositions for RNA Assembly

The Labeled Direct Product Optimally Solves String Problems on Graphs

Computing all $s$-$t$ bridges and articulation points simplified

Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails

Linear Time Construction of Indexable Founder Block Graphs

Complexity and algorithms for finding a perfect phylogeny from mixed tumor samples

Safe and complete contig assembly via omnitigs

Enumeration of the adjunctive hierarchy of hereditarily finite sets

Motif matching using gapped patterns

A Novel Combinatorial Method for Estimating Transcript Expression with RNA-Seq: Bounding the Number of Paths

Combinatorial decomposition approaches for efficient counting and random generation FPTASes

Indexes for Jumbled Pattern Matching in Strings, Trees and Graphs

Graph Operations on Parity Games and Polynomial-Time Algorithms

Set graphs. II. Complexity of set graph recognition and similar problems