Source author record

Florencia Leonardi

Florencia Leonardi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Methodology Statistics Theory math.PR Applications Computation Machine Learning

Catalog footprint

What is connected

9works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Structure recovery for partially observed discrete Markov random fields on graphs under not necessarily positive distributions

We propose a penalized pseudo-likelihood criterion to estimate the graph of conditional dependencies in a discrete Markov random field that can be partially observed. We prove the convergence of the estimator in the case of a finite or countable infinite set of nodes. In the finite case, the underlying graph can be recovered with probability one, while in the countable infinite case, we can recover any finite sub-graph with probability one by allowing the candidate neighborhoods to grow as a function o(log n), with n the sample size. Our method requires minimal assumptions on the probability distribution, and contrary to other approaches in the literature, the usual positivity condition is not needed. We evaluate the performance of the estimator on simulated data, and we apply the methodology to a real dataset of stock index markets in different countries.

preprint2016arXiv

Computationally efficient change point detection for high-dimensional regression

Large-scale sequential data is often exposed to some degree of inhomogeneity in the form of sudden changes in the parameters of the data-generating process. We consider the problem of detecting such structural changes in a high-dimensional regression setting. We propose a joint estimator of the number and the locations of the change points and of the parameters in the corresponding segments. The estimator can be computed using dynamic programming or, as we emphasize here, it can be approximated using a binary search algorithm with $O(n \log(n) \mathrm{Lasso}(n))$ computational operations while still enjoying essentially the same theoretical properties; here $\mathrm{Lasso}(n)$ denotes the computational cost of computing the Lasso for sample size $n$. We establish oracle inequalities for the estimator as well as for its binary search approximation, covering also the case with a large (asymptotically growing) number of change points. We evaluate the performance of the proposed estimation algorithms on simulated data and apply the methodology to real data.

preprint2015arXiv

A model selection approach for multiple sequence segmentation and dimensionality reduction

In this paper we consider the problem of segmenting $n$ aligned random sequences of equal length $m$, into a finite number of independent blocks. We propose to use a penalized maximum likelihood criterion to infer simultaneously the number of points of independence as well as the position of each one of these points. We show how to compute the estimator efficiently by means of a dynamic programming algorithm with time complexity $O(m^2n)$. We also propose another algorithm, called hierarchical algorithm, that provides an approximation to the estimator when the sample size increases and runs in time $O(mn)$. Our main theoretical result is the proof of almost sure consistency of the estimator and the convergence of the hierarchical algorithm when the sample size $n$ grows to infinity. We illustrate the convergence of these algorithms through some simulation examples and we apply the method to a real protein sequence alignment of Ebola Virus.

preprint2015arXiv

A test of hypotheses for random graph distributions built from EEG data

The theory of random graphs is being applied in recent years to model neural interactions in the brain. While the probabilistic properties of random graphs has been extensively studied in the literature, the development of statistical inference methods for this class of objects has received less attention. In this work we propose a non-parametric test of hypotheses to test if two samples of random graphs were originated from the same probability distribution. We show how to compute efficiently the test statistic and we study its performance on simulated data. We apply the test to compare graphs of brain functional network interactions built from electroencephalographic (EEG) data collected during the visualization of point light displays depicting human locomotion.

preprint2015arXiv

Nonparametric statistical inference for the context tree of a stationary ergodic process

We consider the problem of estimating the context tree of a stationary ergodic process with finite alphabet without imposing additional conditions on the process. As a starting point we introduce a Hamming metric in the space of irreducible context trees and we use the properties of the weak topology in the space of ergodic stationary processes to prove that if the Hamming metric is unbounded, there exist no consistent estimators for the context tree. Even in the bounded case we show that there exist no two-sided confidence bounds. However we prove that one-sided inference is possible in this general setting and we construct a consistent estimator that is a lower bound for the context tree of the process with an explicit formula for the coverage probability. We develop an efficient algorithm to compute the lower bound and we apply the method to test a linguistic hypothesis about the context tree of codified written texts in European Portuguese.

preprint2014arXiv

Loss of memory of hidden Markov models and Lyapunov exponents

In this paper we prove that the asymptotic rate of exponential loss of memory of a finite state hidden Markov model is bounded above by the difference of the first two Lyapunov exponents of a certain product of matrices. We also show that this bound is in fact realized, namely for almost all realizations of the observed process we can find symbols where the asymptotic exponential rate of loss of memory attains the difference of the first two Lyapunov exponents. These results are derived in particular for the observed process and for the filter; that is, for the distribution of the hidden state conditioned on the observed sequence. We also prove similar results in total variation.

preprint2013arXiv

Finding the basic neighborhood in variable range Markov random fields: application in SNP association studies

The SNPs (Single Nucleotide Polymorphisms) genotyping platforms are of great value for gene mapping of complex diseases. Nowadays, the high-density of these molecular markers enables studies of dependence patterns between loci over the genome, allowing a simultaneous inference of dependence structure and disease association. In this paper we propose a method based on the theory of variable range Markov random fields to estimate the extent of dependence among SNPs allowing variable windows along the genome. The advantage of this method is that it allows the simultaneous prediction of dependence and independence regions among SNPs, without restricting a priori the range of dependence. We introduce an estimator based on the idea of penalized maximum likelihood to find the conditional dependence neighborhood of each SNP in the sample and we prove its consistency. We apply our method to autosomal SNPs genotypic data with unknown phase in the context of case-control association studies. By examining rheumatoid arthritis data from the Genetic Analysis Workshop 16 (GAW16), we show the utility of the Markov model under variable range dependence.

preprint2012arXiv

Context tree selection and linguistic rhythm retrieval from written texts

The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.

preprint2011arXiv

Context Tree Selection: A Unifying View

The present paper investigates non-asymptotic properties of two popular procedures of context tree (or Variable Length Markov Chains) estimation: Rissanen's algorithm Context and the Penalized Maximum Likelihood criterion. First showing how they are related, we prove finite horizon bounds for the probability of over- and under-estimation. Concerning overestimation, no boundedness or loss-of-memory conditions are required: the proof relies on new deviation inequalities for empirical probabilities of independent interest. The underestimation properties rely on loss-of-memory and separation conditions of the process. These results improve and generalize the bounds obtained previously. Context tree models have been introduced by Rissanen as a parsimonious generalization of Markov models. Since then, they have been widely used in applied probability and statistics.

Florencia Leonardi

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Structure recovery for partially observed discrete Markov random fields on graphs under not necessarily positive distributions

Computationally efficient change point detection for high-dimensional regression

A model selection approach for multiple sequence segmentation and dimensionality reduction

A test of hypotheses for random graph distributions built from EEG data

Nonparametric statistical inference for the context tree of a stationary ergodic process

Loss of memory of hidden Markov models and Lyapunov exponents

Finding the basic neighborhood in variable range Markov random fields: application in SNP association studies

Context tree selection and linguistic rhythm retrieval from written texts

Context Tree Selection: A Unifying View