Optimal Computation of Avoided Words
The deviation of the observed frequency of a word $w$ from its expected frequency in a given sequence $x$ is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of $w$, denoted by $std(w)$, effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word $w$ of length $k>2$ is a $ρ$-avoided word in $x$ if $std(w) \leq ρ$, for a given threshold $ρ< 0$. Notice that such a word may be completely absent from $x$. Hence computing all such words na\"ıvely can be a very time-consuming procedure, in particular for large $k$. In this article, we propose an $O(n)$-time and $O(n)$-space algorithm to compute all $ρ$-avoided words of length $k$ in a given sequence $x$ of length $n$ over a fixed-sized alphabet. We also present a time-optimal $O(σn)$-time and $O(σn)$-space algorithm to compute all $ρ$-avoided words (of any length) in a sequence of length $n$ over an alphabet of size $σ$. Furthermore, we provide a tight asymptotic upper bound for the number of $ρ$-avoided words and the expected length of the longest one. We make available an open-source implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation.