Researcher profile

Maxime Crochemore

Maxime Crochemore contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

Checking whether a word is Hamming-isometric in linear time

A finite word $f$ is Hamming-isometric if for any two word $u$ and $v$ of same length avoiding $f$, $u$ can be transformed into $v$ by changing one by one all the letters on which $u$ differs from $v$, in such a way that all of the new words obtained in this process also avoid~$f$. Words which are not Hamming-isometric have been characterized as words having a border with two mismatches. We derive from this characterization a linear-time algorithm to check whether a word is Hamming-isometric. It is based on pattern matching algorithms with $k$ mismatches. Lee-isometric words over a four-letter alphabet have been characterized as words having a border with two Lee-errors. We derive from this characterization a linear-time algorithm to check whether a word over an alphabet of size four is Lee-isometric.

preprint2020arXiv

Internal Quasiperiod Queries

Internal pattern matching requires one to answer queries about factors of a given string. Many results are known on answering internal period queries, asking for the periods of a given factor. In this paper we investigate (for the first time) internal queries asking for covers (also known as quasiperiods) of a given factor. We propose a data structure that answers such queries in $O(\log n \log \log n)$ time for the shortest cover and in $O(\log n (\log \log n)^2)$ time for a representation of all the covers, after $O(n \log n)$ time and space preprocessing.

preprint2013arXiv

Order-Preserving Suffix Trees and Their Algorithmic Applications

Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this result to pattern matching with multiple patterns. We make one step forward in the analysis and give an $O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix trees in the order-preserving setting. We show a number of applications of order-preserving suffix trees to identify patterns and repetitions in time series.

preprint2013arXiv

Suffix Tree of Alignment: An Efficient Index for Similar Data

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings $A$ and $B$ is a compacted trie representing all suffixes in $A$ and $B$. It has $|A|+|B|$ leaves and can be constructed in $O(|A|+|B|)$ time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of $A$ and $B$. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of $A$ and $B$ has $|A| + l_d + l_1$ leaves where $l_d$ is the sum of the lengths of all parts of $B$ different from $A$ and $l_1$ is the sum of the lengths of some common parts of $A$ and $B$. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern $P$ in $O(|P|+occ)$ time where $occ$ is the number of occurrences of $P$ in $A$ and $B$. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires $O(|A| + l_d + l_1 + l_2)$ time where $l_2$ is the sum of the lengths of other common substrings of $A$ and $B$. When the suffix tree of $A$ is already given, it requires $O(l_d + l_1 + l_2)$ time.

preprint2012arXiv

Fewest repetitions in infinite binary words

A square is the concatenation of a nonempty word with itself. A word has period p if its letters at distance p match. The exponent of a nonempty word is the quotient of its length over its smallest period. In this article we give a proof of the fact that there exists an infinite binary word which contains finitely many squares and simultaneously avoids words of exponent larger than 7/3. Our infinite word contains 12 squares, which is the smallest possible number of squares to get the property, and 2 factors of exponent 7/3. These are the only factors of exponent larger than 2. The value 7/3 introduces what we call the finite-repetition threshold of the binary alphabet. We conjecture it is 7/4 for the ternary alphabet, like its repetitive threshold.

preprint2012arXiv

Note on the Greedy Parsing Optimality for Dictionary-Based Text Compression

Dynamic dictionary-based compression schemes are the most daily used data compression schemes since they appeared in the foundational papers of Ziv and Lempel in 1977, commonly referred to as LZ77. Their work is the base of Deflate, gZip, WinZip, 7Zip and many others compression software. All of those compression schemes use variants of the greedy approach to parse the text into dictionary phrases. Greedy parsing optimality was proved by Cohn et al. (1996) for fixed length code and unbounded dictionaries. The optimality of the greedy parsing was never proved for bounded size dictionary which actually all of those schemes require. We define the suffix-closed property for dynamic dictionaries and we show that any LZ77-based dictionary, including the bounded variants, satisfy this property. Under this condition we prove the optimality of the greedy parsing as a variant of the proof by Cohn et al.

preprint2012arXiv

Quasiperiodicities in Fibonacci strings

We consider the problem of finding quasiperiodicities in a Fibonacci string. A factor u of a string y is a cover of y if every letter of y falls within some occurrence of u in y. A string v is a seed of y, if it is a cover of a superstring of y. A left seed of a string y is a prefix of y that it is a cover of a superstring of y. Similarly a right seed of a string y is a suffix of y that it is a cover of a superstring of y. In this paper, we present some interesting results regarding quasiperiodicities in Fibonacci strings, we identify all covers, left/right seeds and seeds of a Fibonacci string and all covers of a circular Fibonacci string.

preprint2012arXiv

The Rightmost Equal-Cost Position Problem

LZ77-based compression schemes compress the input text by replacing factors in the text with an encoded reference to a previous occurrence formed by the couple (length, offset). For a given factor, the smallest is the offset, the smallest is the resulting compression ratio. This is optimally achieved by using the rightmost occurrence of a factor in the previous text. Given a cost function, for instance the minimum number of bits used to represent an integer, we define the Rightmost Equal-Cost Position (REP) problem as the problem of finding one of the occurrences of a factor which cost is equal to the cost of the rightmost one. We present the Multi-Layer Suffix Tree data structure that, for a text of length n, at any time i, it provides REP(LPF) in constant time, where LPF is the longest previous factor, i.e. the greedy phrase, a reference to the list of REP({set of prefixes of LPF}) in constant time and REP(p) in time O(|p| log log n) for any given pattern p.

preprint2011arXiv

Efficient Seeds Computation Revisited

The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the paper we give linear time algorithms for some of its versions --- computing shortest left-seed array, longest left-seed array and checking for seeds of a given length. The algorithm for the last problem is used to compute the seed array of a string (i.e., the shortest seeds for all the prefixes of the string) in $O(n^2)$ time. We describe also a simpler alternative algorithm computing efficiently the shortest seeds. As a by-product we obtain an $O(n\log{(n/m)})$ time algorithm checking if the shortest seed has length at least $m$ and finding the corresponding seed. We also correct some important details missing in the previously known shortest-seed algorithm (Iliopoulos et al., 1996).

preprint2011arXiv

Finite-Repetition threshold for infinite ternary words

The exponent of a word is the ratio of its length over its smallest period. The repetitive threshold r(a) of an a-letter alphabet is the smallest rational number for which there exists an infinite word whose finite factors have exponent at most r(a). This notion was introduced in 1972 by Dejean who gave the exact values of r(a) for every alphabet size a as it has been eventually proved in 2009. The finite-repetition threshold for an a-letter alphabet refines the above notion. It is the smallest rational number FRt(a) for which there exists an infinite word whose finite factors have exponent at most FRt(a) and that contains a finite number of factors with exponent r(a). It is known from Shallit (2008) that FRt(2)=7/3. With each finite-repetition threshold is associated the smallest number of r(a)-exponent factors that can be found in the corresponding infinite word. It has been proved by Badkobeh and Crochemore (2010) that this number is 12 for infinite binary words whose maximal exponent is 7/3. We show that FRt(3)=r(3)=7/4 and that the bound is achieved with an infinite word containing only two 7/4-exponent words, the smallest number. Based on deep experiments we conjecture that FRt(4)=r(4)=7/5. The question remains open for alphabets with more than four letters. Keywords: combinatorics on words, repetition, repeat, word powers, word exponent, repetition threshold, pattern avoidability, word morphisms.