Source author record

Johannes Fischer

Johannes Fischer appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Artificial Intelligence Computation and Language Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

13works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Linear Time Runs over General Ordered Alphabets

A run in a string is a maximal periodic substring. For example, the string $\texttt{bananatree}$ contains the runs $\texttt{anana} = (\texttt{an})^{3/2}$ and $\texttt{ee} = \texttt{e}^2$. There are less than $n$ runs in any length-$n$ string, and computing all runs for a string over a linearly-sortable alphabet takes $\mathcal{O}(n)$ time (Bannai et al., SODA 2015). Kosolobov conjectured that there also exists a linear time runs algorithm for general ordered alphabets (Inf. Process. Lett. 2016). The conjecture was almost proven by Crochemore et al., who presented an $\mathcal{O}(nα(n))$ time algorithm (where $α(n)$ is the extremely slowly growing inverse Ackermann function). We show how to achieve $\mathcal{O}(n)$ time by exploiting combinatorial properties of the Lyndon array, thus proving Kosolobov's conjecture.

preprint2020arXiv

LCP-Aware Parallel String Sorting

When lexicographically sorting strings, it is not always necessary to inspect all symbols. For example, the lexicographical rank of "europar" amongst the strings "eureka", "eurasia", and "excells" only depends on its so called relevant prefix "euro". The distinguishing prefix size $D$ of a set of strings is the number of symbols that actually need to be inspected to establish the lexicographical ordering of all strings. Efficient string sorters should be $D$-aware, i.e. their complexity should depend on $D$ rather than on the total number $N$ of all symbols in all strings. While there are many $D$-aware sorters in the sequential setting, there appear to be no such results in the PRAM model. We propose a framework yielding a $D$-aware modification of any existing PRAM string sorter. The derived algorithms are work-optimal with respect to their original counterpart: If the original algorithm requires $O(w(N))$ work, the derived one requires $O(w(D))$ work. The execution time increases only by a small factor that is logarithmic in the length of the longest relevant prefix. Our framework universally works for deterministic and randomized algorithms in all variations of the PRAM model, such that future improvements in ($D$-unaware) parallel string sorting will directly result in improvements in $D$-aware parallel string sorting.

preprint2020arXiv

Towards an Interoperable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different Levels of Interoperability

With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.

preprint2016arXiv

Engineering a Distributed Full-Text Index

We present a distributed full-text index for big data applications in a distributed environment. Our index can answer different types of pattern matching queries (existential, counting and enumeration). We perform experiments on inputs up to 100 GiB using up to 512 processors, and compare our index with the distributed suffix array by Arroyuelo et al. [Parall. Comput. 40(9): 471--495, 2014]. The result is that our index answers counting queries up to 5.5 times faster than the distributed suffix array, while using about the same space. We also provide a succinct variant of our index that uses only one third of the memory compared with our non-succinct variant, at the expense of only 20% slower query times.

preprint2016arXiv

On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching

We present parallel algorithms for exact and approximate pattern matching with suffix arrays, using a CREW-PRAM with $p$ processors. Given a static text of length $n$, we first show how to compute the suffix array interval of a given pattern of length $m$ in $O(\frac{m}{p}+ \lg p + \lg\lg p\cdot\lg\lg n)$ time for $p \le m$. For approximate pattern matching with $k$ differences or mismatches, we show how to compute all occurrences of a given pattern in $O(\frac{m^kσ^k}{p}\max\left(k,\lg\lg n\right)\!+\!(1+\frac{m}{p}) \lg p\cdot \lg\lg n + \text{occ})$ time, where $σ$ is the size of the alphabet and $p \le σ^k m^k$. The workhorse of our algorithms is a data structure for merging suffix array intervals quickly: Given the suffix array intervals for two patterns $P$ and $P'$, we present a data structure for computing the interval of $PP'$ in $O(\lg\lg n)$ sequential time, or in $O(1+\lg_p\lg n)$ parallel time. All our data structures are of size $O(n)$ bits (in addition to the suffix array).

preprint2015arXiv

Approximating LZ77 via Small-Space Multiple-Pattern Matching

We generalize Karp-Rabin string matching to handle multiple patterns in $\mathcal{O}(n \log n + m)$ time and $\mathcal{O}(s)$ space, where $n$ is the length of the text and $m$ is the total length of the $s$ patterns, returning correct answers with high probability. As a prime application of our algorithm, we show how to approximate the LZ77 parse of a string of length $n$. If the optimal parse consists of $z$ phrases, using only $\mathcal{O}(z)$ working space we can return a parse consisting of at most $(1+\varepsilon)z$ phrases in $\mathcal{O}(\varepsilon^{-1}n\log n)$ time, for any $\varepsilon\in (0,1]$. As previous quasilinear-time algorithms for LZ77 use $Ω(n/\textrm{polylog }n)$ space, but $z$ can be exponentially small in $n$, these improvements in space are substantial.

preprint2015arXiv

Lempel Ziv Computation In Small Space (LZ-CISS)

For both the Lempel Ziv 77- and 78-factorization we propose algorithms generating the respective factorization using $(1+ε) n \lg n + O(n)$ bits (for any positive constant $ε\le 1$) working space (including the space for the output) for any text of size \$n\$ over an integer alphabet in $O(n / ε^{2})$ time.

preprint2013arXiv

Alphabet-Dependent String Searching with Wexponential Search Trees

It is widely assumed that $O(m+\lg σ)$ is the best one can do for finding a pattern of length $m$ in a compacted trie storing strings over an alphabet of size $σ$, if one insists on linear-size data structures and deterministic worst-case running times [Cole et al., ICALP'06]. In this article, we first show that a rather straightforward combination of well-known ideas yields $O(m+\lg\lg σ)$ deterministic worst-case searching time for static tries. Then we move on to dynamic tries, where we achieve a worst-case bound of $O(m+\frac{\lg^{2}\lgσ}{\lg\lg\lgσ})$ per query or update, which should again be compared to the previously known $O(m+\lgσ)$ deterministic worst-case bounds [Cole et al., ICALP'06], and to the alphabet \emph{in}dependent $O(m+\sqrt{\lg n/\lg\lg n})$ deterministic worst-case bounds [Andersson and Thorup, SODA'01], where $n$ is the number of nodes in the trie. The basis of our update procedure is a weighted variant of exponential search trees which, while simple, might be of independent interest. As one particular application, the above bounds (static and dynamic) apply to suffix trees. There, an update corresponds to pre- or appending a letter to the text, and an additional goal is to do the updates quicker than rematching entire suffixes. We show how to do this in $O(\lg\lg n + \frac{\lg^{2}\lgσ}{\lg\lg\lgσ})$ time, which improves the previously known $O(\lg n)$ bound [Amir et al., SPIRE'05].

preprint2013arXiv

LZ-Compressed String Dictionaries

We show how to compress string dictionaries using the Lempel-Ziv (LZ78) data compression algorithm. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often outperforming the existing alternatives, especially on dictionaries containing many repeated substrings. Our query times remain competitive.

preprint2011arXiv

Combined Data Structure for Previous- and Next-Smaller-Values

Let $A$ be a static array storing $n$ elements from a totally ordered set. We present a data structure of optimal size at most $n\log_2(3+2\sqrt{2})+o(n)$ bits that allows us to answer the following queries on $A$ in constant time, without accessing $A$: (1) previous smaller value queries, where given an index $i$, we wish to find the first index to the left of $i$ where $A$ is strictly smaller than at $i$, and (2) next smaller value queries, which search to the right of $i$. As an additional bonus, our data structure also allows to answer a third kind of query: given indices $i<j$, find the position of the minimum in $A[i..j]$. Our data structure has direct consequences for the space-efficient storage of suffix trees.

preprint2011arXiv

Inducing the LCP-Array

We show how to modify the linear-time construction algorithm for suffix arrays based on induced sorting (Nong et al., DCC'09) such that it computes the array of longest common prefixes (LCP-array) as well. Practical tests show that this outperforms recent LCP-array construction algorithms (Gog and Ohlebusch, ALENEX'11).

preprint2010arXiv

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

LRM-Trees are an elegant way to partition a sequence of values into sorted consecutive blocks, and to express the relative position of the first element of each block within a previous block. They were used to encode ordinal trees and to index integer arrays in order to support range minimum queries on them. We describe how they yield many other convenient results in a variety of areas, from data structures to algorithms: some compressed succinct indices for range minimum queries; a new adaptive sorting algorithm; and a compressed succinct data structure for permutations supporting direct and indirect application in time all the shortest as the permutation is compressible.

preprint2010arXiv

Wee LCP

We prove that longest common prefix (LCP) information can be stored in much less space than previously known. More precisely, we show that in the presence of the text and the suffix array, o(n) additional bits are sufficient to answer LCP-queries asymptotically in the same time that is needed to retrieve an entry from the suffix array. This yields the smallest compressed suffix tree with sub-logarithmic navigation time.

Johannes Fischer

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Linear Time Runs over General Ordered Alphabets

LCP-Aware Parallel String Sorting

Towards an Interoperable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different Levels of Interoperability

Engineering a Distributed Full-Text Index

On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching

Approximating LZ77 via Small-Space Multiple-Pattern Matching

Lempel Ziv Computation In Small Space (LZ-CISS)

Alphabet-Dependent String Searching with Wexponential Search Trees

LZ-Compressed String Dictionaries

Combined Data Structure for Previous- and Next-Smaller-Values

Inducing the LCP-Array

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

Wee LCP