Source author record

Marinella Sciortino

Marinella Sciortino appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Formal Languages and Automata Theory Discrete Mathematics math.CO Genomics

Catalog footprint

What is connected

10works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Computing matching statistics on Wheeler DFAs

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.

preprint2022arXiv

Logarithmic equal-letter runs for BWT of purely morphic words

In this paper we study the number $r_{bwt}$ of equal-letter runs produced by the Burrows-Wheeler transform ($BWT$) when it is applied to purely morphic finite words, which are words generated by iterating prolongable morphisms. Such a parameter $r_{bwt}$ is very significant since it provides a measure of the performances of the $BWT$, in terms of both compressibility and indexing. In particular, we prove that, when $BWT$ is applied to any purely morphic finite word on a binary alphabet, $r_{bwt}$ is $\mathcal{O}(\log n)$, where $n$ is the length of the word. Moreover, we prove that $r_{bwt}$ is $Θ(\log n)$ for the binary words generated by a large class of prolongable binary morphisms. These bounds are proved by providing some new structural properties of the \emph{bispecial circular factors} of such words.

preprint2022arXiv

String Attractors and Infinite Words

The notion of string attractor has been introduced in [Kempa and Prezza, 2018] in the context of Data Compression and it represents a set of positions of a finite word in which all of its factors can be "attracted". The smallest size $γ^*$ of a string attractor for a finite word is a lower bound for several repetitiveness measures associated with the most common compression schemes, including BWT-based and LZ-based compressors. The combinatorial properties of the measure $γ^*$ have been studied in [Mantaci et al., 2021]. Very recently, a complexity measure, called string attractor profile function, has been introduced for infinite words, by evaluating $γ^*$ on each prefix. Such a measure has been studied for automatic sequences and linearly recurrent infinite words [Schaeffer and Shallit, 2021]. In this paper, we study the relationship between such a complexity measure and other well-known combinatorial notions related to repetitiveness in the context of infinite words, such as the factor complexity and the recurrence. Furthermore, we introduce new string attractor-based complexity measures, in which the structure and the distribution of positions in a string attractor of the prefixes of infinite words are considered. We show that such measures provide a finer classification of some infinite families of words.

preprint2022arXiv

Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform

The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic. Some who persevere are later shown the Positional BWT (PBWT), which was published twenty years after the BWT. In this paper we argue that the PBWT should be taught {\em before} the BWT. We first use the PBWT's close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for {\em positional search} on a set of strings (that is, given a pattern and a position, quickly list the strings containing that pattern starting in that position). We then observe that {\em prefix search} (listing all the strings that start with the pattern) is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to {\em substring search} in that string (listing all the starting positions of occurrences of the pattern in the string). Storing naïvely a PBWT of the suffixes of a string is space-{\em inefficient} but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same -- and equal to the BWT of the string. Thus we can teach the BWT and the FM-index via the PBWT.

preprint2019arXiv

Inducing the Lyndon Array

In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $σ+ O(1)$ words of working space, where $n$ is the length of the text and $σ$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use $O(n)$ words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not only space-efficient but also fast in practice.

preprint2016arXiv

Cyclic Complexity of Words

We introduce and study a complexity function on words $c_x(n),$ called \emph{cyclic complexity}, which counts the number of conjugacy classes of factors of length $n$ of an infinite word $x.$ We extend the well-known Morse-Hedlund theorem to the setting of cyclic complexity by showing that a word is ultimately periodic if and only if it has bounded cyclic complexity. Unlike most complexity functions, cyclic complexity distinguishes between Sturmian words of different slopes. We prove that if $x$ is a Sturmian word and $y$ is a word having the same cyclic complexity of $x,$ then up to renaming letters, $x$ and $y$ have the same set of factors. In particular, $y$ is also Sturmian of slope equal to that of $x.$ Since $c_x(n)=1$ for some $n\geq 1$ implies $x$ is periodic, it is natural to consider the quantity $\liminf_{n\rightarrow \infty} c_x(n).$ We show that if $x$ is a Sturmian word, then $\liminf_{n\rightarrow \infty} c_x(n)=2.$ We prove however that this is not a characterization of Sturmian words by exhibiting a restricted class of Toeplitz words, including the period-doubling word, which also verify this same condition on the limit infimum. In contrast we show that, for the Thue-Morse word $t$, $\liminf_{n\rightarrow \infty} c_t(n)=+\infty.$

preprint2016arXiv

Lightweight LCP Construction for Very Large Collections of Strings

The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is realized by performing disk data accesses only via sequential scans, and the total disk space usage never needs more than twice the output size, excluding the disk space required for the input. Moreover, extLCP allows to compute also the suffix array of the strings of the collection, without any other further data structure is needed. Finally, we test our algorithm on real data and compare our results with another tool capable to work in external memory on large collections of strings.

preprint2014arXiv

Universal Lyndon Words

A word $w$ over an alphabet $Σ$ is a Lyndon word if there exists an order defined on $Σ$ for which $w$ is lexicographically smaller than all of its conjugates (other than itself). We introduce and study \emph{universal Lyndon words}, which are words over an $n$-letter alphabet that have length $n!$ and such that all the conjugates are Lyndon words. We show that universal Lyndon words exist for every $n$ and exhibit combinatorial and structural properties of these words. We then define particular prefix codes, which we call Hamiltonian lex-codes, and show that every Hamiltonian lex-code is in bijection with the set of the shortest unrepeated prefixes of the conjugates of a universal Lyndon word. This allows us to give an algorithm for constructing all the universal Lyndon words.

preprint2013arXiv

Lightweight LCP Construction for Next-Generation Sequencing Datasets

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in human whole genome sequencing experiments.

preprint2013arXiv

Sorting suffixes of a text via its Lyndon Factorization

The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can be adapted to different implementative scenarios.

Marinella Sciortino

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Computing matching statistics on Wheeler DFAs

Logarithmic equal-letter runs for BWT of purely morphic words

String Attractors and Infinite Words

Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform

Inducing the Lyndon Array

Cyclic Complexity of Words

Lightweight LCP Construction for Very Large Collections of Strings

Universal Lyndon Words

Lightweight LCP Construction for Next-Generation Sequencing Datasets

Sorting suffixes of a text via its Lyndon Factorization