Source author record

Jakub Radoszewski

Jakub Radoszewski appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Discrete Mathematics Formal Languages and Automata Theory math.CO

Catalog footprint

What is connected

21works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A note on the maximum number of $k$-powers in a finite word

A \emph{power} is a word of the form $\underbrace{uu...u}_{k \; \text{times}}$, where $u$ is a word and $k$ is a positive integer; the power is also called a {\em $k$-power} and $k$ is its {\em exponent}. We prove that for any $k \ge 2$, the maximum number of different non-empty $k$-power factors in a word of length $n$ is between $\frac{n}{k-1}-Θ(\sqrt{n})$ and $\frac{n-1}{k-1}$. We also show that the maximum number of different non-empty power factors of exponent at least 2 in a length-$n$ word is at most $n-1$. Both upper bounds generalize the recent upper bound of $n-1$ on the maximum number of different square factors in a length-$n$ word by Brlek and Li (2022).

preprint2020arXiv

Approximating longest common substring with $k$ mismatches: Theory and practice

In the problem of the longest common substring with $k$ mismatches we are given two strings $X, Y$ and must find the maximal length $\ell$ such that there is a length-$\ell$ substring of $X$ and a length-$\ell$ substring of $Y$ that differ in at most $k$ positions. The length $\ell$ can be used as a robust measure of similarity between $X, Y$. In this work, we develop new approximation algorithms for computing $\ell$ that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound.

preprint2020arXiv

Circular Pattern Matching with $k$ Mismatches

The $k$-mismatch problem consists in computing the Hamming distance between a pattern $P$ of length $m$ and every length-$m$ substring of a text $T$ of length $n$, if this distance is no more than $k$. In many real-world applications, any cyclic rotation of $P$ is a relevant pattern, and thus one is interested in computing the minimal distance of every length-$m$ substring of $T$ and any cyclic rotation of $P$. This is the circular pattern matching with $k$ mismatches ($k$-CPM) problem. A multitude of papers have been devoted to solving this problem but, to the best of our knowledge, only average-case upper bounds are known. In this paper, we present the first non-trivial worst-case upper bounds for the $k$-CPM problem. Specifically, we show an $O(nk)$-time algorithm and an $O(n+\frac{n}{m}\,k^4)$-time algorithm. The latter algorithm applies in an extended way a technique that was very recently developed for the $k$-mismatch problem [Bringmann et al., SODA 2019]. A preliminary version of this work appeared at FCT 2019. In this version we improve the time complexity of the main algorithm from $O(n+\frac{n}{m}\,k^5)$ to $O(n+\frac{n}{m}\,k^4)$.

preprint2020arXiv

Counting Distinct Patterns in Internal Dictionary Matching

We consider the problem of preprocessing a text $T$ of length $n$ and a dictionary $\mathcal{D}$ in order to be able to efficiently answer queries $CountDistinct(i,j)$, that is, given $i$ and $j$ return the number of patterns from $\mathcal{D}$ that occur in the fragment $T[i \mathinner{.\,.} j]$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, the dictionary takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total length, which could be $Θ(n\cdot d)$. An $\tilde{\mathcal{O}}(n+d)$-size data structure that answers $CountDistinct(i,j)$ queries $\mathcal{O}(\log n)$-approximately in $\tilde{\mathcal{O}}(1)$ time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an $\tilde{\mathcal{O}}(n+d)$-size data structure that answers $CountDistinct(i,j)$ queries $2$-approximately in $\tilde{\mathcal{O}}(1)$ time. Using range queries, for any $m$, we give an $\tilde{\mathcal{O}}(\min(nd/m,n^2/m^2)+d)$-size data structure that answers $CountDistinct(i,j)$ queries exactly in $\tilde{\mathcal{O}}(m)$ time. We also consider the special case when the dictionary consists of all square factors of the string. We design an $\mathcal{O}(n \log^2 n)$-size data structure that allows us to count distinct squares in a text fragment $T[i \mathinner{.\,.} j]$ in $\mathcal{O}(\log n)$ time.

preprint2020arXiv

Efficient Representation and Counting of Antipower Factors in Words

A $k$-antipower (for $k \ge 2$) is a concatenation of $k$ pairwise distinct words of the same length. The study of fragments of a word being antipowers was initiated by Fici et al. (ICALP 2016) and first algorithms for computing such fragments were presented by Badkobeh et al. (Inf. Process. Lett., 2018). We address two open problems posed by Badkobeh et al. We propose efficient algorithms for counting and reporting fragments of a word which are $k$-antipowers. They work in $\mathcal{O}(nk \log k)$ time and $\mathcal{O}(nk \log k + C)$ time, respectively, where $C$ is the number of reported fragments. For $k=o(\sqrt{n/\log n})$, this improves the time complexity of $\mathcal{O}(n^2/k)$ of the solution by Badkobeh et al. We also show that the number of different $k$-antipower factors of a word of length $n$ can be computed in $\mathcal{O}(nk^4 \log k \log n)$ time. Our main algorithmic tools are runs and gapped repeats. Finally we present an improved data structure that checks, for a given fragment of a word and an integer $k$, if the fragment is a $k$-antipower. This is a full and extended version of a paper from LATA 2019. In particular, all results about counting different antipowers factors are completely new compared with the LATA proceedings version.

preprint2020arXiv

Internal Quasiperiod Queries

Internal pattern matching requires one to answer queries about factors of a given string. Many results are known on answering internal period queries, asking for the periods of a given factor. In this paper we investigate (for the first time) internal queries asking for covers (also known as quasiperiods) of a given factor. We propose a data structure that answers such queries in $O(\log n \log \log n)$ time for the shortest cover and in $O(\log n (\log \log n)^2)$ time for a representation of all the covers, after $O(n \log n)$ time and space preprocessing.

preprint2020arXiv

k-Approximate Quasiperiodicity under Hamming and Edit Distance

Quasiperiodicity in strings was introduced almost 30 years ago as an extension of string periodicity. The basic notions of quasiperiodicity are cover and seed. A cover of a text $T$ is a string whose occurrences in $T$ cover all positions of $T$. A seed of text $T$ is a cover of a superstring of $T$. In various applications exact quasiperiodicity is still not sufficient due to the presence of errors. We consider approximate notions of quasiperiodicity, for which we allow approximate occurrences in $T$ with a small Hamming, Levenshtein or weighted edit distance. In previous work Sip et al. (2002) and Christodoulakis et al. (2005) showed that computing approximate covers and seeds, respectively, under weighted edit distance is NP-hard. They, therefore, considered restricted approximate covers and seeds which need to be factors of the original string $T$ and presented polynomial-time algorithms for computing them. Further algorithms, considering approximate occurrences with Hamming distance bounded by $k$, were given in several contributions by Guth et al. They also studied relaxed approximate quasiperiods that do not need to cover all positions of $T$. In case of large data the exponents in polynomial time complexity play a crucial role. We present more efficient algorithms for computing restricted approximate covers and seeds. In particular, we improve upon the complexities of many of the aforementioned algorithms, also for relaxed quasiperiods. Our solutions are especially efficient if the number (or total cost) of allowed errors is bounded. We also show NP-hardness of computing non-restricted approximate covers and seeds under Hamming distance. Approximate covers were studied in three recent contributions at CPM over the last three years. However, these works consider a different definition of an approximate cover of $T$.

preprint2020arXiv

The Number of Repetitions in 2D-Strings

The notions of periodicity and repetitions in strings, and hence these of runs and squares, naturally extend to two-dimensional strings. We consider two types of repetitions in 2D-strings: 2D-runs and quartics (quartics are a 2D-version of squares in standard strings). Amir et al. introduced 2D-runs, showed that there are $O(n^3)$ of them in an $n \times n$ 2D-string and presented a simple construction giving a lower bound of $Ω(n^2)$ for their number (TCS 2020). We make a significant step towards closing the gap between these bounds by showing that the number of 2D-runs in an $n \times n$ 2D-string is $O(n^2 \log^2 n)$. In particular, our bound implies that the $O(n^2\log n + \textsf{output})$ run-time of the algorithm of Amir et al. for computing 2D-runs is also $O(n^2 \log^2 n)$. We expect this result to allow for exploiting 2D-runs algorithmically in the area of 2D pattern matching. A quartic is a 2D-string composed of $2 \times 2$ identical blocks (2D-strings) that was introduced by Apostolico and Brimkov (TCS 2000), where by quartics they meant only primitively rooted quartics, i.e. built of a primitive block. Here our notion of quartics is more general and analogous to that of squares in 1D-strings. Apostolico and Brimkov showed that there are $O(n^2 \log^2 n)$ occurrences of primitively rooted quartics in an $n \times n$ 2D-string and that this bound is attainable. Consequently the number of distinct primitively rooted quartics is $O(n^2 \log^2 n)$. Here, we prove that the number of distinct general quartics is also $O(n^2 \log^2 n)$. This extends the rich combinatorial study of the number of distinct squares in a 1D-string, that was initiated by Fraenkel and Simpson (J. Comb. Theory A 1998), to two dimensions. Finally, we show some algorithmic applications of 2D-runs. (Abstract shortened due to arXiv requirements.)

preprint2016arXiv

Efficient Index for Weighted Sequences

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z \log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.

preprint2016arXiv

Maximum Number of Distinct and Nonequivalent Nonstandard Squares in a Word

The combinatorics of squares in a word depends on how the equivalence of halves of the square is defined. We consider Abelian squares, parameterized squares, and order-preserving squares. The word $uv$ is an Abelian (parameterized, order-preserving) square if $u$ and $v$ are equivalent in the Abelian (parameterized, order-preserving) sense. The maximum number of ordinary squares in a word is known to be asymptotically linear, but the exact bound is still investigated. We present several results on the maximum number of distinct squares for nonstandard subword equivalence relations. Let $\mathit{SQ}_{\mathrm{Abel}}(n,σ)$ and $\mathit{SQ}'_{\mathrm{Abel}}(n,σ)$ denote the maximum number of Abelian squares in a word of length $n$ over an alphabet of size $σ$, which are distinct as words and which are nonequivalent in the Abelian sense, respectively. For $σ\ge 2$ we prove that $\mathit{SQ}_{\mathrm{Abel}}(n,σ)=Θ(n^2)$, $\mathit{SQ}'_{\mathrm{Abel}}(n,σ)=Ω(n^{3/2})$ and $\mathit{SQ}'_{\mathrm{Abel}}(n,σ) = O(n^{11/6})$. We also give linear bounds for parameterized and order-preserving squares for alphabets of constant size: $\mathit{SQ}_{\mathrm{param}}(n,O(1))=Θ(n)$, $\mathit{SQ}_{\mathrm{op}}(n,O(1))=Θ(n)$. The upper bounds have quadratic dependence on the alphabet size for order-preserving squares and exponential dependence for parameterized squares. As a side result we construct infinite words over the smallest alphabet which avoid nontrivial order-preserving squares and nontrivial parameterized cubes (nontrivial parameterized squares cannot be avoided in an infinite word).

preprint2016arXiv

Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the runs in a string can be computed via a linear number of LCE queries. The first to consider these problems over a general ordered alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an $O(n (\log n)^{2/3})$-time algorithm for answering $O(n)$ LCE queries. This result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to $O(n \log \log n)$ time. In this work we note a special \emph{non-crossing} property of LCE queries asked in the runs computation. We show that any $n$ such non-crossing queries can be answered on-line in $O(n α(n))$ time, which yields an $O(n α(n))$-time algorithm for computing runs.

preprint2016arXiv

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterized by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem.

preprint2015arXiv

On the Greedy Algorithm for the Shortest Common Superstring Problem with Reversals

We study a variation of the classical Shortest Common Superstring (SCS) problem in which a shortest superstring of a finite set of strings $S$ is sought containing as a factor every string of $S$ or its reversal. We call this problem Shortest Common Superstring with Reversals (SCS-R). This problem has been introduced by Jiang et al., who designed a greedy-like algorithm with length approximation ratio $4$. In this paper, we show that a natural adaptation of the classical greedy algorithm for SCS has (optimal) compression ratio $\frac12$, i.e., the sum of the overlaps in the output string is at least half the sum of the overlaps in an optimal solution. We also provide a linear-time implementation of our algorithm.

preprint2014arXiv

Covering Problems for Partial Words and for Indeterminate Strings

We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don't care symbol. We prove that indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to $k$, the number of non-solid symbols. For the indeterminate string covering problem we obtain a $2^{O(k \log k)} + n k^{O(1)}$-time algorithm. For the partial word covering problem we obtain a $2^{O(\sqrt{k}\log k)} + nk^{O(1)}$-time algorithm. We prove that, unless the Exponential Time Hypothesis is false, no $2^{o(\sqrt{k})} n^{O(1)}$-time solution exists for either problem, which shows that our algorithm for this case is close to optimal. We also present an algorithm for both problems which is feasible in practice.

preprint2014arXiv

On the String Consensus Problem and the Manhattan Sequence Consensus Problem

In the Manhattan Sequence Consensus problem (MSC problem) we are given $k$ integer sequences, each of length $l$, and we are to find an integer sequence $x$ of length $l$ (called a consensus sequence), such that the maximum Manhattan distance of $x$ from each of the input sequences is minimized. For binary sequences Manhattan distance coincides with Hamming distance, hence in this case the string consensus problem (also called string center problem or closest string problem) is a special case of MSC. Our main result is a practically efficient $O(l)$-time algorithm solving MSC for $k\le 5$ sequences. Practicality of our algorithms has been verified experimentally. It improves upon the quadratic algorithm by Amir et al.\ (SPIRE 2012) for string consensus problem for $k=5$ binary strings. Similarly as in Amir's algorithm we use a column-based framework. We replace the implied general integer linear programming by its easy special cases, due to combinatorial properties of the MSC for $k\le 5$. We also show that for a general parameter $k$ any instance can be reduced in linear time to a kernel of size $k!$, so the problem is fixed-parameter tractable. Nevertheless, for $k\ge 4$ this is still too large for any naive solution to be feasible in practice.

preprint2013arXiv

Fast Algorithm for Partial Covers in Words

A factor $u$ of a word $w$ is a cover of $w$ if every position in $w$ lies within some occurrence of $u$ in $w$. A word $w$ covered by $u$ thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of $u$. In this article we introduce a new notion of $α$-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least $α$ positions in $w$. We develop a data structure of $O(n)$ size (where $n=|w|$) that can be constructed in $O(n\log n)$ time which we apply to compute all shortest $α$-partial covers for a given $α$. We also employ it for an $O(n\log n)$-time algorithm computing a shortest $α$-partial cover for each $α=1,2,\ldots,n$.

preprint2013arXiv

Order-Preserving Suffix Trees and Their Algorithmic Applications

Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this result to pattern matching with multiple patterns. We make one step forward in the analysis and give an $O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix trees in the order-preserving setting. We show a number of applications of order-preserving suffix trees to identify patterns and repetitions in time series.

preprint2012arXiv

A Note on Efficient Computation of All Abelian Periods in a String

We derive a simple efficient algorithm for Abelian periods knowing all Abelian squares in a string. An efficient algorithm for the latter problem was given by Cummings and Smyth in 1997. By the way we show an alternative algorithm for Abelian squares. We also obtain a linear time algorithm finding all `long' Abelian periods. The aim of the paper is a (new) reduction of the problem of all Abelian periods to that of (already solved) all Abelian squares which provides new insight into both connected problems.

preprint2011arXiv

Efficient Seeds Computation Revisited

The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the paper we give linear time algorithms for some of its versions --- computing shortest left-seed array, longest left-seed array and checking for seeds of a given length. The algorithm for the last problem is used to compute the seed array of a string (i.e., the shortest seeds for all the prefixes of the string) in $O(n^2)$ time. We describe also a simpler alternative algorithm computing efficiently the shortest seeds. As a by-product we obtain an $O(n\log{(n/m)})$ time algorithm checking if the shortest seed has length at least $m$ and finding the corresponding seed. We also correct some important details missing in the previously known shortest-seed algorithm (Iliopoulos et al., 1996).

preprint2010arXiv

On the maximal sum of exponents of runs in a string

A run is an inclusion maximal occurrence in a string (as a subinterval) of a repetition $v$ with a period $p$ such that $2p \le |v|$. The exponent of a run is defined as $|v|/p$ and is $\ge 2$. We show new bounds on the maximal sum of exponents of runs in a string of length $n$. Our upper bound of $4.1n$ is better than the best previously known proven bound of $5.6n$ by Crochemore & Ilie (2008). The lower bound of $2.035n$, obtained using a family of binary words, contradicts the conjecture of Kolpakov & Kucherov (1999) that the maximal sum of exponents of runs in a string of length $n$ is smaller than $2n$

preprint2009arXiv

On the maximal number of cubic subwords in a string

We investigate the problem of the maximum number of cubic subwords (of the form $www$) in a given word. We also consider square subwords (of the form $ww$). The problem of the maximum number of squares in a word is not well understood. Several new results related to this problem are produced in the paper. We consider two simple problems related to the maximum number of subwords which are squares or which are highly repetitive; then we provide a nontrivial estimation for the number of cubes. We show that the maximum number of squares $xx$ such that $x$ is not a primitive word (nonprimitive squares) in a word of length $n$ is exactly $\lfloor \frac{n}{2}\rfloor - 1$, and the maximum number of subwords of the form $x^k$, for $k\ge 3$, is exactly $n-2$. In particular, the maximum number of cubes in a word is not greater than $n-2$ either. Using very technical properties of occurrences of cubes, we improve this bound significantly. We show that the maximum number of cubes in a word of length $n$ is between $(1/2)n$ and $(4/5)n$. (In particular, we improve the lower bound from the conference version of the paper.)

Jakub Radoszewski

What is connected

Connect this record

See the researcher in context

Building this map preview

21 published item(s)

A note on the maximum number of $k$-powers in a finite word

Approximating longest common substring with $k$ mismatches: Theory and practice

Circular Pattern Matching with $k$ Mismatches

Counting Distinct Patterns in Internal Dictionary Matching

Efficient Representation and Counting of Antipower Factors in Words

Internal Quasiperiod Queries

k-Approximate Quasiperiodicity under Hamming and Edit Distance

The Number of Repetitions in 2D-Strings

Efficient Index for Weighted Sequences

Maximum Number of Distinct and Nonequivalent Nonstandard Squares in a Word

Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

On the Greedy Algorithm for the Shortest Common Superstring Problem with Reversals

Covering Problems for Partial Words and for Indeterminate Strings

On the String Consensus Problem and the Manhattan Sequence Consensus Problem

Fast Algorithm for Partial Covers in Words

Order-Preserving Suffix Trees and Their Algorithmic Applications

A Note on Efficient Computation of All Abelian Periods in a String

Efficient Seeds Computation Revisited

On the maximal sum of exponents of runs in a string

On the maximal number of cubic subwords in a string