Researcher profile

Alexandru I. Tomescu

Alexandru I. Tomescu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2023arXiv

Minimum Flow Decomposition in Graphs with Cycles using Integer Linear Programming

Minimum flow decomposition (MFD) -- the problem of finding a minimum set of weighted source-to-sink paths that perfectly decomposes a flow -- is a classical problem in Computer Science, and variants of it are powerful models in different fields such as Bioinformatics and Transportation. Even on acyclic graphs, the problem is NP-hard, and most practical solutions have been via heuristics or approximations. While there is an extensive body of research on acyclic graphs, currently, there is no \emph{exact} solution on graphs with cycles. In this paper, we present the first ILP formulation for three natural variants of the MFD problem in graphs with cycles, asking for a decomposition consisting only of weighted source-to-sink paths or cycles, trails, and walks, respectively. On three datasets of increasing levels of complexity from both Bioinformatics and Transportation, our approaches solve any instance in under 10 minutes. Our implementations are freely available at github.com/algbio/MFD-ILP.

preprint2022arXiv

Algorithms and Complexity on Indexing Founder Graphs

We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alphabet, but there naturally exist also graph classes that are easy to index: E.g. Wheeler graphs (Gagie et al. Theor. Comp. Sci. 2017) cover graphs admitting a Burrows-Wheeler transform -based indexing scheme. However, it is NP-complete to recognize if a graph is a Wheeler graph (Gibney, Thankachan, ESA 2019). We propose an approach to alleviate the construction bottleneck of Wheeler graphs. Rather than starting from an arbitrary graph, we study graphs induced from multiple sequence alignments (MSAs). Elastic degenerate strings (Bernadini et al. SPIRE 2017, ICALP 2019) can be seen as such graphs, and we introduce here their generalization: elastic founder graphs. We first prove that even such induced graphs are hard to index under OVH. Then we introduce two subclasses, repeat-free and semi-repeat-free graphs, that are easy to index. We give a linear time algorithm to construct a repeat-free non-elastic founder graph from a gapless MSA, and (parameterized) near-linear time algorithms to construct semi-repeat-free (repeat-free, respectively) elastic founder graphs from general MSAs. Finally, we show that repeat-free elastic founder graphs admit a reduction to Wheeler graphs in polynomial time.

preprint2022arXiv

Fast, Flexible, and Exact Minimum Flow Decompositions via ILP

Minimum flow decomposition (MFD) (the problem of finding a minimum set of paths that perfectly decomposes a flow) is a classical problem in Computer Science, and variants of it are powerful models in multiassembly problems in Bioinformatics (e.g. RNA assembly). However, because this problem and its variants are NP-hard, practical multiassembly tools either use heuristics or solve simpler, polynomial-time solvable versions of the problem, which may yield solutions that are not mini-mal or do not perfectly decompose the flow. Many RNA assemblers also use integer linear programming(ILP) formulations of such practical variants, having the major limitation they need to encode all the potentially exponentially many solution paths. Moreover, the only exact solver for MFD does not scale to large instances and cannot be efficiently generalized to practical MFD variants. In this work, we provide the first practical ILP formulation for MFD (and thus the first fast and exact solver for MFD), based on encoding all of the exponentially many solution paths using only a quadratic number of variables. On both simulated and real flow graphs, our approach solves any instance in under 13 seconds. We also show that our ILP formulation can be easily and efficiently adapted for many practical variants, such as incorporating longer or paired-end reads or minimizing flow errors. We hope that our results can remove the current tradeoff between the complexity of a multi assembly model and its tractability and can lie at the core of future practical RNA assembly tools.

preprint2022arXiv

Optimizing Safe Flow Decompositions in DAGs

Network flow is one of the most studied combinatorial optimization problems having innumerable applications. Any flow on a directed acyclic graph $G$ having $n$ vertices and $m$ edges can be decomposed into a set of $O(m)$ paths. In some applications, each solution (decomposition) corresponds to some particular data that generated the original flow. Given the possibility of multiple optimal solutions, no optimization criterion ensures the identification of the correct decomposition. Hence, recently flow decomposition was studied [RECOMB22] in the Safe and Complete framework, particularly for RNA Assembly. They presented a characterization of the safe paths, resulting in an $O(mn+out_R)$ time algorithm to compute all safe paths, where $out_R$ is the size of the raw output reporting each safe path explicitly. They also showed that $out_R$ can be $Ω(mn^2)$ in the worst case but $O(m)$ in the best case. Hence, they further presented an algorithm to report a concise representation of the output $out_C$ in $O(mn+out_C)$ time, where $out_C$ can be $Ω(mn)$ in the worst case but $O(m)$ in the best case. In this work, we study how different safe paths interact, resulting in optimal output-sensitive algorithms requiring $O(m+out_R)$ and $O(m+out_C)$ time for computing the existing representations of the safe paths. Further, we propose a new characterization of the safe paths resulting in the {\em optimal} representation of safe paths $out_O$, which can be $Ω(mn)$ in the worst case but requires optimal $O(1)$ space for every safe path reported, with a near-optimal computation algorithm. Overall we further develop the theory of safe and complete solutions for the flow decomposition problem, giving an optimal algorithm for the explicit representation, and a near-optimal algorithm for the optimal representation of the safe paths

preprint2022arXiv

Safety and Completeness in Flow Decompositions for RNA Assembly

Decomposing a network flow into weighted paths has numerous applications. Some applications require any decomposition that is optimal w.r.t. some property such as number of paths, robustness, or length. Many bioinformatic applications require a specific decomposition where the paths correspond to some underlying data that generated the flow. For real inputs, no optimization criteria guarantees to uniquely identify the correct decomposition. Therefore, we propose to report safe paths, i.e., subpaths of at least one path in every flow decomposition. Ma, Zheng, and Kingsford [WABI 2020] addressed the existence of multiple optimal solutions in a probabilistic framework, i.e., non-identifiability. Later [RECOMB 2021], they gave a quadratic-time algorithm based on a global criterion for solving a problem called AND-Quant, which generalizes the problem of reporting whether a given path is safe. We give the first local characterization of safe paths for flow decompositions in directed acyclic graphs (DAGs), leading to a practical algorithm for finding the complete set of safe paths. We evaluated our algorithms against the trivial safe algorithms (unitigs, extended unitigs) and the popularly used heuristic (greedy-width) for flow decomposition on RNA transcripts datasets. Despite maintaining perfect precision our algorithm reports significantly higher coverage ($\approx 50\%$ more) than trivial safe algorithms. The greedy-width algorithm though reporting a better coverage, has significantly lower precision on complex graphs. Overall, our algorithm outperforms (by $\approx 20\%$) greedy-width on a unified metric (F-Score) when the dataset has significant number of complex graphs. Moreover, it has superior time ($3-5\times$) and space efficiency ($1.2-2.2\times$), resulting in a better and more practical approach for bioinformatics applications of flow decomposition.

preprint2021arXiv

The Labeled Direct Product Optimally Solves String Problems on Graphs

Suffix trees are an important data structure at the core of optimal solutions to many fundamental string problems, such as exact pattern matching, longest common substring, matching statistics, and longest repeated substring. Recent lines of research focused on extending some of these problems to vertex-labeled graphs, although using ad-hoc approaches which in some cases do not generalize to all input graphs. In the absence of a ubiquitous tool like the suffix tree for labeled graphs, we introduce the labeled direct product of two graphs as a general tool for obtaining optimal algorithms: we obtain conceptually simpler algorithms for the quadratic problems of string matching (SMLG) and longest common substring (LCSP) in labeled graphs. Our algorithms are also more efficient, since they run in time linear in the size of the labeled product graph, which may be smaller than quadratic for some inputs, and their run-time is predictable, because the size of the labeled direct product graph can be precomputed efficiently. We also solve LCSP on graphs containing cycles, which was left as an open problem by Shimohira et al. in 2011. To show the power of the labeled product graph, we also apply it to solve the matching statistics (MSP) and the longest repeated string (LRSP) problems in labeled graphs. Moreover, we show that our (worst-case quadratic) algorithms are also optimal, conditioned on the Orthogonal Vectors Hypothesis. Finally, we complete the complexity picture around LRSP by studying it on undirected graphs.

preprint2020arXiv

Computing all $s$-$t$ bridges and articulation points simplified

Given a directed graph $G$ and a pair of nodes $s$ and $t$, an $s$-$t$ bridge of $G$ is an edge whose removal breaks all $s$-$t$ paths of $G$. Similarly, an $s$-$t$ articulation point of $G$ is a node whose removal breaks all $s$-$t$ paths of $G$. Computing the sequence of all $s$-$t$ bridges of $G$ (as well as the $s$-$t$ articulation points) is a basic graph problem, solvable in linear time using the classical min-cut algorithm. When dealing with cuts of unit size ($s$-$t$ bridges) this algorithm can be simplified to a single graph traversal from $s$ to $t$ avoiding an arbitrary $s$-$t$ path, which is interrupted at the $s$-$t$ bridges. Further, the corresponding proof is also simplified making it independent of the theory of network flows.

preprint2020arXiv

Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails

We consider the following string matching problem on a node-labeled graph $G=(V,E)$: given a pattern string $P$, decide whether there exists a path in $G$ whose concatenation of node labels equals $P$. This is a basic primitive in various problems in bioinformatics, graph databases, or networks. The hardness results of Backurs and Indyk (FOCS 2016) imply that this problem cannot be solved in better than $O(|E||P|)$ time, under the Orthogonal Vectors Hypothesis (OVH), and this holds even under various restrictions on the graph (Equi et al., ICALP 2019). In this paper we consider its offline version, namely the one in which we are allowed to index the graph in order to support time-efficient string matching queries. Indeed, it was tantalizing in the string matching community to believe that sub-quadratic time queries can be achieved, e.g. at the cost of a high-degree polynomial-time indexing. We disprove this belief, showing that, under OVH, no polynomial-time index can support querying $P$ in time $O(|E|^δ|P|^β)$, with either $δ< 1$ or $β< 1$. We prove this tight bound employing a known self-reducibility technique, e.g. from the field of dynamic algorithms, which translates conditional lower bounds for an online problem to its offline version. As a side-contribution, we formalize this technique with the notion of linear independent-components reduction, allowing for a simple proof of our result. As another illustration of our technique, we also translate the quadratic conditional lower bound of Backurs and Indyk (STOC 2015) for the problem of matching a query string inside a text, under edit distance. We obtain an analogous tight quadratic lower bound for its offline version, improving the recent result of Cohen-Addad, Feuilloley and Starikovskaya (SODA 2019), but with a slightly different boundary condition.

preprint2020arXiv

Linear Time Construction of Indexable Founder Block Graphs

We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410\times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3\%$ of the size of the MSA.