Source author record

Yann Ponty

Yann Ponty appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Quantitative Methods Data Structures and Algorithms Discrete Mathematics Formal Languages and Automata Theory Biomolecules Computation and Language math.CO Other Computer Science Populations and Evolution

Catalog footprint

What is connected

17works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Combinatorial RNA Design Designability and Structure-Approximating Algorithm in Watson-Crick and Nussinov-Jacobson Energy Models

We consider the Combinatorial RNA Design problem, a minimal instance of RNA design where one must produce an RNA sequence that adopts a given secondary structure as its minimal free-energy structure. We consider two free-energy models where the contributions of base pairs are additive and independent: the purely combinatorial Watson-Crick model, which only allows equally-contributing A -- U and C -- G base pairs, and the real-valued Nussinov-Jacobson model, which associates arbitrary energies to A -- U, C -- G and G -- U base pairs. We first provide a complete characterization of designable structures using restricted alphabets and, in the four-letter alphabet, provide a complete characterization for designable structures without unpaired bases. When unpaired bases are allowed, we characterize extensive classes of (non-)designable structures, and prove the closure of the set of designable structures under the stutter operation. Membership of a given structure to any of the classes can be tested in $Θ$(n) time, including the generation of a solution sequence for positive instances. Finally, we consider a structure-approximating relaxation of the design, and provide a $Θ$(n) algorithm which, given a structure S that avoids two trivially non-designable motifs, transforms S into a designable structure constructively by adding at most one base-pair to each of its stems.

preprint2016arXiv

Counting, generating and sampling tree alignments

Pairwise ordered tree alignment are combinatorial objects that appear in RNA secondary structure comparison. However, the usual representation of tree alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce identical sets of matches between identical pairs of trees. This ambiguity is uninformative, and detrimental to any probabilistic analysis.In this work, we consider tree alignments up to equivalence. Our first result is a precise asymptotic enumeration of tree alignments, obtained from a context-free grammar by mean of basic analytic combinatorics. Our second result focuses on alignments between two given ordered trees $S$ and $T$. By refining our grammar to align specific trees, we obtain a decomposition scheme for the space of alignments, and use it to design an efficient dynamic programming algorithm for sampling alignments under the Gibbs-Boltzmann probability distribution. This generalizes existing tree alignment algorithms, and opens the door for a probabilistic analysis of the space of suboptimal RNA secondary structures alignments.

preprint2016arXiv

Taming Reluctant Random Walks in the Positive Quadrant

A lattice walk model is said to be reluctant if the defining step set has a strong drift towards the boundaries. We describe efficient random generation strategies for these walks.

preprint2015arXiv

Assessing the robustness of parsimonious predictions for gene neighborhoods from reconciled phylogenies

The availability of a large number of assembled genomes opens the way to study the evolution of syntenic character within a phylogenetic context. The DeCo algorithm, recently introduced by B{é}rard et al. allows the computation of parsimonious evolutionary scenarios for gene adjacencies, from pairs of reconciled gene trees. Following the approach pioneered by Sturmfels and Pachter, we describe how to modify the DeCo dynamic programming algorithm to identify classes of cost schemes that generates similar parsimonious evolutionary scenarios for gene adjacencies, as well as the robustness to changes to the cost scheme of evolutionary events of the presence or absence of specific ancestral gene adjacencies. We apply our method to six thousands mammalian gene families, and show that computing the robustness to changes to cost schemes provides new and interesting insights on the evolution of gene adjacencies and the DeCo model.

preprint2015arXiv

Combinatorial RNA Design: Designability and Structure-Approximating Algorithm

In this work, we consider the Combinatorial RNA Design problem, a minimal instance of the RNA design problem which aims at finding a sequence that admits a given target as its unique base pair maximizing structure. We provide complete characterizations for the structures that can be designed using restricted alphabets. Under a classic four-letter alphabet, we provide a complete characterization of designable structures without unpaired bases. When unpaired bases are allowed, we provide partial characterizations for classes of designable/undesignable structures, and show that the class of designable structures is closed under the stutter operation. Membership of a given structure to any of the classes can be tested in linear time and, for positive instances, a solution can be found in linear time. Finally, we consider a structure-approximating version of the problem that allows to extend bands (helices) and, assuming that the input structure avoids two motifs, we provide a linear-time algorithm that produces a designable structure with at most twice more base pairs than the input structure.

preprint2013arXiv

Flexible RNA design under structure and sequence constraints using formal languages

The problem of RNA secondary structure design (also called inverse folding) is the following: given a target secondary structure, one aims to create a sequence that folds into, or is compatible with, a given structure. In several practical applications in biology, additional constraints must be taken into account, such as the presence/absence of regulatory motifs, either at a specific location or anywhere in the sequence. In this study, we investigate the design of RNA sequences from their targeted secondary structure, given these additional sequence constraints. To this purpose, we develop a general framework based on concepts of language theory, namely context-free grammars and finite automata. We efficiently combine a comprehensive set of constraints into a unifying context-free grammar of moderate size. From there, we use generic generic algorithms to perform a (weighted) random generation, or an exhaustive enumeration, of candidate sequences. The resulting method, whose complexity scales linearly with the length of the RNA, was implemented as a standalone program. The resulting software was embedded into a publicly available dedicated web server. The applicability demonstrated of the method on a concrete case study dedicated to Exon Splicing Enhancers, in which our approach was successfully used in the design of \emph{in vitro} experiments.

preprint2013arXiv

Using structural and evolutionary information to detect and correct pyrosequencing errors in non-coding RNAs

Analysis of the sequence-structure relationship in RNA molecules are essential to evolutionary studies but also to concrete applications such as error-correction methodologies in sequencing technologies. The prohibitive sizes of the mutational and conformational landscapes combined with the volume of data to proceed require efficient algorithms to compute sequence-structure properties. More specifically, here we aim to calculate which mutations increase the most the likelihood of a sequence to a given structure and RNA family. In this paper, we introduce RNApyro, an efficient linear-time and space inside-outside algorithm that computes exact mutational probabilities under secondary structure and evolutionary constraints given as a multiple sequence alignment with a consensus structure. We develop a scoring scheme combining classical stacking base pair energies to novel isostericity scales, and apply our techniques to correct point-wise errors in 5s and 16s rRNA sequences. Our results suggest that RNApyro is a promising algorithm to complement existing tools in the NGS error-correction pipeline.

preprint2012arXiv

Non-redundant random generation algorithms for weighted context-free languages

We address the non-redundant random generation of $k$ words of length $n$ in a context-free language. Additionally, we want to avoid a predefined set of words. We study a rejection-based approach, whose worst-case time complexity is shown to grow exponentially with $k$ for some specifications and in the limit case of a coupon collector. We propose two algorithms respectively based on the recursive method and on an unranking approach. We show how careful implementations of these algorithms allow for a non-redundant generation of $k$ words of length $n$ in $\mathcal{O}(k\cdot n\cdot \log{n})$ arithmetic operations, after a precomputation of $Θ(n)$ numbers. The overall complexity is therefore dominated by the generation of $k$ words, and the non-redundancy comes at a negligible cost.

preprint2012arXiv

Rule-weighted and terminal-weighted context-free grammars have identical expressivity

Two formalisms, both based on context-free grammars, have recently been proposed as a basis for a non-uniform random generation of combinatorial objects. The former, introduced by Denise et al, associates weights with letters, while the latter, recently explored by Weinberg et al in the context of random generation, associates weights to transitions. In this short note, we use a simple modification of the Greibach Normal Form transformation algorithm, due to Blum and Koch, to show the equivalent expressivities, in term of their induced distributions, of these two formalisms.

preprint2012arXiv

The weighted words collector

Motivated by applications in bioinformatics, we consider the word collector problem, i.e. the expected number of calls to a random weighted generator of words of length $n$ before the full collection is obtained. The originality of this instance of the non-uniform coupon collector lies in the, potentially large, multiplicity of the words/coupons of a given probability/composition. We obtain a general theorem that gives an asymptotic equivalent for the expected waiting time of a general version of the Coupon Collector. This theorem is especially well-suited for classes of coupons featuring high multiplicities. Its application to a given language essentially necessitates some knowledge on the number of words of a given composition/probability. We illustrate the application of our theorem, in a step-by-step fashion, on three exemplary languages, revealing asymptotic regimes in $Θ(μ(n)\cdot n)$ and $Θ(μ(n)\cdot \log n)$, where $μ(n)$ is the sum of weights over words of length $n$.

preprint2012arXiv

Tree decomposition and parameterized algorithms for RNA structure-sequence alignment including tertiary interactions and pseudoknots

We present a general setting for structure-sequence comparison in a large class of RNA structures that unifies and generalizes a number of recent works on specific families on structures. Our approach is based on tree decomposition of structures and gives rises to a general parameterized algorithm, where the exponential part of the complexity depends on the family of structures. For each of the previously studied families, our algorithm has the same complexity as the specific algorithm that had been given before.

preprint2011arXiv

A Combinatorial Framework for Designing (Pseudoknotted) RNA Algorithms

We extend an hypergraph representation, introduced by Finkelstein and Roytberg, to unify dynamic programming algorithms in the context of RNA folding with pseudoknots. Classic applications of RNA dynamic programming energy minimization, partition function, base-pair probabilities...) are reformulated within this framework, giving rise to very simple algorithms. This reformulation allows one to conceptually detach the conformation space/energy model -- captured by the hypergraph model -- from the specific application, assuming unambiguity of the decomposition. To ensure the latter property, we propose a new combinatorial methodology based on generating functions. We extend the set of generic applications by proposing an exact algorithm for extracting generalized moments in weighted distribution, generalizing a prior contribution by Miklos and al. Finally, we illustrate our full-fledged programme on three exemplary conformation spaces (secondary structures, Akutsu's simple type pseudoknots and kissing hairpins). This readily gives sets of algorithms that are either novel or have complexity comparable to classic implementations for minimization and Boltzmann ensemble applications of dynamic programming.

preprint2010arXiv

Controlled non uniform random generation of decomposable structures

Consider a class of decomposable combinatorial structures, using different types of atoms $\Atoms = \{\At_1,\ldots ,\At_{|{\Atoms}|}\}$. We address the random generation of such structures with respect to a size $n$ and a targeted distribution in $k$ of its \emph{distinguished} atoms. We consider two variations on this problem. In the first alternative, the targeted distribution is given by $k$ real numbers $\TargFreq_1, \ldots, \TargFreq_k$ such that $0 < \TargFreq_i < 1$ for all $i$ and $\TargFreq_1+\cdots+\TargFreq_k \leq 1$. We aim to generate random structures among the whole set of structures of a given size $n$, in such a way that the {\em expected} frequency of any distinguished atom $\At_i$ equals $\TargFreq_i$. We address this problem by weighting the atoms with a $k$-tuple $\Weights$ of real-valued weights, inducing a weighted distribution over the set of structures of size $n$. We first adapt the classical recursive random generation scheme into an algorithm taking $\bigO{n^{1+o(1)}+mn\log{n}}$ arithmetic operations to draw $m$ structures from the $\Weights$-weighted distribution. Secondly, we address the analytical computation of weights such that the targeted frequencies are achieved asymptotically, i. e. for large values of $n$. We derive systems of functional equations whose resolution gives an explicit relationship between $\Weights$ and $\TargFreq_1, \ldots, \TargFreq_k$. Lastly, we give an algorithm in $\bigO{k n^4}$ for the inverse problem, {\it i.e.} computing the frequencies associated with a given $k$-tuple $\Weights$ of weights, and an optimized version in $\bigO{k n^2}$ in the case of context-free languages. This allows for a heuristic resolution of the weights/frequencies relationship suitable for complex specifications. In the second alternative, the targeted distribution is given by a $k$ natural numbers $n_1, \ldots, n_k$ such that $n_1+\cdots+n_k+r=n$ where $r \geq 0$ is the number of undistinguished atoms. The structures must be generated uniformly among the set of structures of size $n$ that contain {\em exactly} $n_i$ atoms $\At_i$ ($1 \leq i \leq k$). We give a $\bigO{r^2\prod_{i=1}^k n_i^2 +m n k \log n}$ algorithm for generating $m$ structures, which simplifies into a $\bigO{r\prod_{i=1}^k n_i +m n}$ for regular specifications.

preprint2010arXiv

Multi-dimensional Boltzmann Sampling of Languages

This paper addresses the uniform random generation of words from a context-free language (over an alphabet of size $k$), while constraining every letter to a targeted frequency of occurrence. Our approach consists in a multidimensional extension of Boltzmann samplers \cite{Duchon2004}. We show that, under mostly \emph{strong-connectivity} hypotheses, our samplers return a word of size in $[(1-\varepsilon)n, (1+\varepsilon)n]$ and exact frequency in $\mathcal{O}(n^{1+k/2})$ expected time. Moreover, if we accept tolerance intervals of width in $Ω(\sqrt{n})$ for the number of occurrences of each letters, our samplers perform an approximate-size generation of words in expected $\mathcal{O}(n)$ time. We illustrate these techniques on the generation of Tetris tessellations with uniform statistics in the different types of tetraminoes.

preprint2010arXiv

Non-redundant random generation from weighted context-free languages

We address the non-redundant random generation of k words of length n from a context-free language. Additionally, we want to avoid a predefined set of words. We study the limits of a rejection-based approach, whose time complexity is shown to grow exponentially in k in some cases. We propose an alternative recursive algorithm, whose careful implementation allows for a non-redundant generation of k words of size n in O(kn log n) arithmetic operations after the precomputation of O(n) numbers. The overall complexity is therefore dominated by the generation of k words, and the non-redundancy comes at a negligible cost.

preprint2010arXiv

Weighted random generation of context-free languages: Analysis of collisions in random urn occupancy models

The present work analyzes the redundancy of sets of combinatorial objects produced by a weighted random generation algorithm proposed by Denise et al. This scheme associates weights to the terminals symbols of a weighted context-free grammar, extends this weight definition multiplicatively on words, and draws words of length $n$ with probability proportional their weight. We investigate the level of redundancy within a sample of $k$ word, the proportion of the total probability covered by $k$ words (coverage), the time (number of generations) of the first collision, and the time of the full collection. For these four questions, we use an analytic urn analogy to derive asymptotic estimates and/or polynomially computable exact forms. We illustrate these tools by an analysis of an RNA secondary structure statistical sampling algorithm introduced by Ding et al.

preprint2006arXiv

Estimating seed sensitivity on homogeneous alignments

We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition.

Yann Ponty

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Combinatorial RNA Design Designability and Structure-Approximating Algorithm in Watson-Crick and Nussinov-Jacobson Energy Models

Counting, generating and sampling tree alignments

Taming Reluctant Random Walks in the Positive Quadrant

Assessing the robustness of parsimonious predictions for gene neighborhoods from reconciled phylogenies

Combinatorial RNA Design: Designability and Structure-Approximating Algorithm

Flexible RNA design under structure and sequence constraints using formal languages

Using structural and evolutionary information to detect and correct pyrosequencing errors in non-coding RNAs

Non-redundant random generation algorithms for weighted context-free languages

Rule-weighted and terminal-weighted context-free grammars have identical expressivity

The weighted words collector

Tree decomposition and parameterized algorithms for RNA structure-sequence alignment including tertiary interactions and pseudoknots

A Combinatorial Framework for Designing (Pseudoknotted) RNA Algorithms

Controlled non uniform random generation of decomposable structures

Multi-dimensional Boltzmann Sampling of Languages

Non-redundant random generation from weighted context-free languages

Weighted random generation of context-free languages: Analysis of collisions in random urn occupancy models

Estimating seed sensitivity on homogeneous alignments