Source author record

Lucia Williams

Lucia Williams appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Genomics Data Structures and Algorithms math.CO math.OC Computational Geometry

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Minimum Flow Decomposition in Graphs with Cycles using Integer Linear Programming

Minimum flow decomposition (MFD) -- the problem of finding a minimum set of weighted source-to-sink paths that perfectly decomposes a flow -- is a classical problem in Computer Science, and variants of it are powerful models in different fields such as Bioinformatics and Transportation. Even on acyclic graphs, the problem is NP-hard, and most practical solutions have been via heuristics or approximations. While there is an extensive body of research on acyclic graphs, currently, there is no \emph{exact} solution on graphs with cycles. In this paper, we present the first ILP formulation for three natural variants of the MFD problem in graphs with cycles, asking for a decomposition consisting only of weighted source-to-sink paths or cycles, trails, and walks, respectively. On three datasets of increasing levels of complexity from both Bioinformatics and Transportation, our approaches solve any instance in under 10 minutes. Our implementations are freely available at github.com/algbio/MFD-ILP.

preprint2022arXiv

Fast, Flexible, and Exact Minimum Flow Decompositions via ILP

Minimum flow decomposition (MFD) (the problem of finding a minimum set of paths that perfectly decomposes a flow) is a classical problem in Computer Science, and variants of it are powerful models in multiassembly problems in Bioinformatics (e.g. RNA assembly). However, because this problem and its variants are NP-hard, practical multiassembly tools either use heuristics or solve simpler, polynomial-time solvable versions of the problem, which may yield solutions that are not mini-mal or do not perfectly decompose the flow. Many RNA assemblers also use integer linear programming(ILP) formulations of such practical variants, having the major limitation they need to encode all the potentially exponentially many solution paths. Moreover, the only exact solver for MFD does not scale to large instances and cannot be efficiently generalized to practical MFD variants. In this work, we provide the first practical ILP formulation for MFD (and thus the first fast and exact solver for MFD), based on encoding all of the exponentially many solution paths using only a quadratic number of variables. On both simulated and real flow graphs, our approach solves any instance in under 13 seconds. We also show that our ILP formulation can be easily and efficiently adapted for many practical variants, such as incorporating longer or paired-end reads or minimizing flow errors. We hope that our results can remove the current tradeoff between the complexity of a multi assembly model and its tractability and can lie at the core of future practical RNA assembly tools.

preprint2022arXiv

Safety and Completeness in Flow Decompositions for RNA Assembly

Decomposing a network flow into weighted paths has numerous applications. Some applications require any decomposition that is optimal w.r.t. some property such as number of paths, robustness, or length. Many bioinformatic applications require a specific decomposition where the paths correspond to some underlying data that generated the flow. For real inputs, no optimization criteria guarantees to uniquely identify the correct decomposition. Therefore, we propose to report safe paths, i.e., subpaths of at least one path in every flow decomposition. Ma, Zheng, and Kingsford [WABI 2020] addressed the existence of multiple optimal solutions in a probabilistic framework, i.e., non-identifiability. Later [RECOMB 2021], they gave a quadratic-time algorithm based on a global criterion for solving a problem called AND-Quant, which generalizes the problem of reporting whether a given path is safe. We give the first local characterization of safe paths for flow decompositions in directed acyclic graphs (DAGs), leading to a practical algorithm for finding the complete set of safe paths. We evaluated our algorithms against the trivial safe algorithms (unitigs, extended unitigs) and the popularly used heuristic (greedy-width) for flow decomposition on RNA transcripts datasets. Despite maintaining perfect precision our algorithm reports significantly higher coverage ($\approx 50\%$ more) than trivial safe algorithms. The greedy-width algorithm though reporting a better coverage, has significantly lower precision on complex graphs. Overall, our algorithm outperforms (by $\approx 20\%$) greedy-width on a unified metric (F-Score) when the dataset has significant number of complex graphs. Moreover, it has superior time ($3-5\times$) and space efficiency ($1.2-2.2\times$), resulting in a better and more practical approach for bioinformatics applications of flow decomposition.

preprint2020arXiv

Reconstructing Embedded Graphs from Persistence Diagrams

The persistence diagram (PD) is an increasingly popular topological descriptor. By encoding the size and prominence of topological features at varying scales, the PD provides important geometric and topological information about a space. Recent work has shown that well-chosen (finite) sets of PDs can differentiate between geometric simplicial complexes, providing a method for representing complex shapes using a finite set of descriptors. A related inverse problem is the following: given a set of PDs (or an oracle we can query for persistence diagrams), what is underlying geometric simplicial complex? In this paper, we present an algorithm for reconstructing embedded graphs in $\mathbb{R}^d$ (plane graphs in $\mathbb{R}^2$) with $n$ vertices from $n^2 - n + d + 1$ directional (augmented) PDs. Additionally, we empirically validate the correctness and time-complexity of our algorithm in $\mathbb{R}^2$ on randomly generated plane graphs using our implementation, and explain the numerical limitations of implementing our algorithm.