Source author record

Veli Mäkinen

Veli Mäkinen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Complexity Computational Engineering, Finance, and Science Quantitative Methods

Catalog footprint

What is connected

12works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Algorithms and Complexity on Indexing Founder Graphs

We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alphabet, but there naturally exist also graph classes that are easy to index: E.g. Wheeler graphs (Gagie et al. Theor. Comp. Sci. 2017) cover graphs admitting a Burrows-Wheeler transform -based indexing scheme. However, it is NP-complete to recognize if a graph is a Wheeler graph (Gibney, Thankachan, ESA 2019). We propose an approach to alleviate the construction bottleneck of Wheeler graphs. Rather than starting from an arbitrary graph, we study graphs induced from multiple sequence alignments (MSAs). Elastic degenerate strings (Bernadini et al. SPIRE 2017, ICALP 2019) can be seen as such graphs, and we introduce here their generalization: elastic founder graphs. We first prove that even such induced graphs are hard to index under OVH. Then we introduce two subclasses, repeat-free and semi-repeat-free graphs, that are easy to index. We give a linear time algorithm to construct a repeat-free non-elastic founder graph from a gapless MSA, and (parameterized) near-linear time algorithms to construct semi-repeat-free (repeat-free, respectively) elastic founder graphs from general MSAs. Finally, we show that repeat-free elastic founder graphs admit a reduction to Wheeler graphs in polynomial time.

preprint2020arXiv

Chaining with overlaps revisited

Chaining algorithms aim to form a semi-global alignment of two sequences based on a set of anchoring local alignments as input. Depending on the optimization criteria and the exact definition of a chain, there are several $O(n \log n)$ time algorithms to solve this problem optimally, where $n$ is the number of input anchors. In this paper, we focus on a formulation allowing the anchors to overlap in a chain. This formulation was studied by Shibuya and Kurochin (WABI 2003), but their algorithm comes with no proof of correctness. We revisit and modify their algorithm to consider a strict definition of precedence relation on anchors, adding the required derivation to convince on the correctness of the resulting algorithm that runs in $O(n \log^2 n)$ time on anchors formed by exact matches. With the more relaxed definition of precedence relation considered by Shibuya and Kurochin or when anchors are non-nested such as matches of uniform length ($k$-mers), the algorithm takes $O(n \log n)$ time. We also establish a connection between chaining with overlaps to the widely studied longest common subsequence (LCS) problem.

preprint2020arXiv

Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails

We consider the following string matching problem on a node-labeled graph $G=(V,E)$: given a pattern string $P$, decide whether there exists a path in $G$ whose concatenation of node labels equals $P$. This is a basic primitive in various problems in bioinformatics, graph databases, or networks. The hardness results of Backurs and Indyk (FOCS 2016) imply that this problem cannot be solved in better than $O(|E||P|)$ time, under the Orthogonal Vectors Hypothesis (OVH), and this holds even under various restrictions on the graph (Equi et al., ICALP 2019). In this paper we consider its offline version, namely the one in which we are allowed to index the graph in order to support time-efficient string matching queries. Indeed, it was tantalizing in the string matching community to believe that sub-quadratic time queries can be achieved, e.g. at the cost of a high-degree polynomial-time indexing. We disprove this belief, showing that, under OVH, no polynomial-time index can support querying $P$ in time $O(|E|^δ|P|^β)$, with either $δ< 1$ or $β< 1$. We prove this tight bound employing a known self-reducibility technique, e.g. from the field of dynamic algorithms, which translates conditional lower bounds for an online problem to its offline version. As a side-contribution, we formalize this technique with the notion of linear independent-components reduction, allowing for a simple proof of our result. As another illustration of our technique, we also translate the quadratic conditional lower bound of Backurs and Indyk (STOC 2015) for the problem of matching a query string inside a text, under edit distance. We obtain an analogous tight quadratic lower bound for its offline version, improving the recent result of Cohen-Addad, Feuilloley and Starikovskaya (SODA 2019), but with a slightly different boundary condition.

preprint2020arXiv

Linear Time Construction of Indexable Founder Block Graphs

We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410\times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3\%$ of the size of the MSA.

preprint2020arXiv

On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Exact pattern matching in labeled graphs is the problem of searching paths of a graph $G=(V,E)$ that spell the same string as the pattern $P[1..m]$. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks, where the nodes of some paths must match a sequence of labels or types. We describe a simple conditional lower bound that, for any constant $ε>0$, an $O(|E|^{1 - ε} \, m)$-time or an $O(|E| \, m^{1 - ε})$-time algorithm for exact pattern matching on graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. The result holds even if restricted to undirected graphs of maximum degree three or directed acyclic graphs of maximum sum of indegree and outdegree three. Although a conditional lower bound of this kind can be somehow derived from previous results (Backurs and Indyk, FOCS'16), we give a direct reduction from SETH for dissemination purposes, as the result might interest researchers from several areas, such as computational biology, graph database, and graph mining, as mentioned before. Indeed, as approximate pattern matching on graphs can be solved in $O(|E|\,m)$ time, exact and approximate matching are thus equally hard (quadratic time) on graphs under the SETH assumption. In comparison, the same problems restricted to strings have linear time vs quadratic time solutions, respectively, where the latter ones have a matching SETH lower bound on computing the edit distance of two strings (Backurs and Indyk, STOC'15).

preprint2020arXiv

Tailoring r-index for metagenomics

A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with $k$-mer hashing-based pseudoalignment: A read is assigned to species A if each of its $k$-mer hits to reference collection is located only on strains of A. We study the underlying primitives required in pseudoalignment and related tasks. We propose three space-efficient solutions building upon the document listing with frequencies problem. All the solutions use an $r$-index (Gagie et al., SODA 2018) as an underlying index structure for the text obtained as concatenation of the set of species, as well as for each species. Given $t$ species whose concatenation length is $n$, and whose Burrows-Wheeler transform contains $r$ runs, our first solution, based on a grammar-compressed document array with precomputed queries at non terminal symbols, reports the frequencies for the ${\tt ndoc}$ distinct documents in which the pattern of length $m$ occurs in ${\cal O}(m + \log(n){\tt ndoc}) $ time. Our second solution is also based on a grammar-compressed document array, but enhanced with bitvectors and reports the frequencies in ${\cal O}(m + ((t/w)\log n + \log(n/r)){\tt ndoc})$ time, over a machine with wordsize $w$. Our third solution, based on the interleaved LCP array, answers the same query in ${\cal O}(m + \log(n/r){\tt ndoc})$. We implemented our solutions and tested them on real-world and synthetic datasets. The results show that all the solutions are fast on highly-repetitive data, and the size overhead introduced by the indexes are comparable with the size of the $r$-index.

preprint2016arXiv

Fully Dynamic de Bruijn Graphs

We present a space- and time-efficient fully dynamic implementation de Bruijn graphs, which can also support fixed-length jumbled pattern matching.

preprint2016arXiv

Linear-time string indexing and analysis in small space

The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string $T\in \{1,\ldots,σ\}^n$ can be built in deterministic $O(n)$ time using just $O(n\logσ)$ bits of space, where $σ\leq n$. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of $T$. Many fundamental string analysis problems can be mapped to such enumeration, and can thus be solved in deterministic $O(n)$ time and in $O(n\logσ)$ bits of space from the input string. We also show how to build many of the existing indexes based on the BWT, such as the CSA, the compressed suffix tree (CST), and the bidirectional BWT index, in randomized $O(n)$ time and in $O(n\logσ)$ bits of space. The previously fastest construction algorithms for BWT, CSA and CST, which used $O(n\logσ)$ bits of space, took $O(n\log{\logσ})$ time for the first two structures, and $O(n\log^εn)$ time for the third, where $ε$ is any positive constant. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output.

preprint2015arXiv

Interval scheduling maximizing minimum coverage

In the classical interval scheduling type of problems, a set of $n$ jobs, characterized by their start and end time, need to be executed by a set of machines, under various constraints. In this paper we study a new variant in which the jobs need to be assigned to at most $k$ identical machines, such that the minimum number of machines that are busy at the same time is maximized. This is relevant in the context of genome sequencing and haplotyping, specifically when a set of DNA reads aligned to a genome needs to be pruned so that no more than $k$ reads overlap, while maintaining as much read coverage as possible across the entire genome. We show that the problem can be solved in time $\min\left(O(n^2\log k / \log n),O(nk\log k)\right)$ by using max-flows. We also give an $O(n\log n)$-time approximation algorithm with approximation ratio $ρ=\frac{k}{\lfloor k/2 \rfloor}$.

preprint2013arXiv

A Novel Combinatorial Method for Estimating Transcript Expression with RNA-Seq: Bounding the Number of Paths

RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. The problem is usually modeled by a weighted splicing graph whose nodes stand for exons and whose edges stand for split alignments to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the coverages of the graph under various fitness functions, such least sum of squares. In (Tomescu et al. RECOMB-seq 2013) we showed that under general fitness functions, if we allow a polynomially bounded number of paths in an optimal solution, this problem can be solved in polynomial time by a reduction to a min-cost flow program. In this paper we further refine this problem by asking for a bounded number k of paths that optimally explain the splicing graph. This problem becomes NP-hard in the strong sense, but we give a fast combinatorial algorithm based on dynamic programming for it. In order to obtain a practical tool, we implement three optimizations and heuristics, which achieve better performance on real data, and similar or better performance on simulated data, than state-of-the-art tools Cufflinks, IsoLasso and SLIDE. Our tool, called Traph, is available at http://www.cs.helsinki.fi/gsa/traph/

preprint2011arXiv

Indexing Finite Language Representation of Population Genotypes

With the recent advances in DNA sequencing, it is now possible to have complete genomes of individuals sequenced and assembled. This rich and focused genotype information can be used to do different population-wide studies, now first time directly on whole genome level. We propose a way to index population genotype information together with the complete genome sequence, so that one can use the index to efficiently align a given sequence to the genome with all plausible genotype recombinations taken into account. This is achieved through converting a multiple alignment of individual genomes into a finite automaton recognizing all strings that can be read from the alignment by switching the sequence at any time. The finite automaton is indexed with an extension of Burrows-Wheeler transform to allow pattern search inside the plausible recombinant sequences. The size of the index stays limited, because of the high similarity of individual genomes. The index finds applications in variation calling and in primer design. On a variation calling experiment, we found about 1.0% of matches to novel recombinants just with exact matching, and up to 2.4% with approximate matching.

preprint2011arXiv

Pattern Kits

Suppose we have just performed searches in a self-index for two patterns $A$ and $B$ and now we want to search for their concatenation \A B); how can we best make use of our previous computations? In this paper we consider this problem and, more generally, how we can store a dynamic library of patterns that we can easily manipulate in interesting ways. We give a space- and time-efficient data structure for this problem that is compatible with many of the best self-indexes.

Veli Mäkinen

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Algorithms and Complexity on Indexing Founder Graphs

Chaining with overlaps revisited

Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails

Linear Time Construction of Indexable Founder Block Graphs

On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Tailoring r-index for metagenomics

Fully Dynamic de Bruijn Graphs

Linear-time string indexing and analysis in small space

Interval scheduling maximizing minimum coverage

A Novel Combinatorial Method for Estimating Transcript Expression with RNA-Seq: Bounding the Number of Paths

Indexing Finite Language Representation of Population Genotypes

Pattern Kits