Researcher profile

Matthieu Marbac

Matthieu Marbac contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - Emerging
6works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2021arXiv

Detecting spatial clusters in functional data: new scan statistic approaches

We have developed two scan statistics for detecting clusters of functional data indexed in space. The first method is based on an adaptation of a functional analysis of variance and the second one is based on a distribution-free spatial scan statistic for univariate data. In a simulation study, the distribution-free method always performed better than a nonparametric functional scan statistic, and the adaptation of the anova also performed better for data with a normal or a quasi-normal distribution. Our methods can detect smaller spatial clusters than the nonparametric method. Lastly, we used our scan statistics for functional data to search for spatial clusters of abnormal unemployment rates in France over the period 1998-2013 (divided into quarters).

preprint2015arXiv

A Family of Blockwise One-Factor Distributions for Modelling High-Dimensional Binary Data

We introduce a new family of one factor distributions for high-dimensional binary data. The model provides an explicit probability for each event, thus avoiding the numeric approximations often made by existing methods. Model interpretation is easy since each variable is described by two continuous parameters (corresponding to its marginal probability and to its strength of dependency with the other variables) and by one binary parameter (defining if the dependencies are positive or negative). An extension of this new model is proposed by assuming that the variables are split into independent blocks which follow the new one factor distribution. Parameter estimation is performed by the inference margin procedure where the second step is achieved by an expectation-maximization algorithm. Model selection is carried out by a deterministic approach which strongly reduces the number of competing models. This approach uses a hierarchical ascendant classification of the variables based on the empirical version of Cramer's V for selecting a narrow subset of models. The consistency of such procedure is shown. The new model is evaluated on numerical experiments and on a real data set. The procedure is implemented in the R package MvBinary available on CRAN.

preprint2015arXiv

Bayesian model selection in logistic regression for the detection of adverse drug reactions

Motivation: Spontaneous adverse event reports have a high potential for detecting adverse drug reactions. However, due to their dimension, exploring such databases requires statistical methods. In this context, disproportionality measures are used. However, by projecting the data onto contingency tables, these methods become sensitive to the problem of co-prescriptions and masking effects. Recently, logistic regressions have been used with a Lasso type penalty to perform the detection of associations between drugs and adverse events. However, the choice of the penalty value is open to criticism while it strongly influences the results. Results: In this paper, we propose to use a logistic regression whose sparsity is viewed as a model selection challenge. Since the model space is huge, a Metropolis-Hastings algorithm carries out the model selection by maximizing the BIC criterion. Thus, we avoid the calibration of penalty or threshold. During our application on the French pharmacovigilance database, the proposed method is compared to well established approaches on a reference data set, and obtains better rates of positive and negative controls. However, many signals are not detected by the proposed method. So, we conclude that this method should be used in parallel to existing measures in pharmacovigilance.

preprint2015arXiv

Model-based clustering of Gaussian copulas for mixed data

Clustering task of mixed data is a challenging problem. In a probabilistic framework, the main difficulty is due to a shortage of conventional distributions for such data. In this paper, we propose to achieve the mixed data clustering with a Gaussian copula mixture model, since copulas, and in particular the Gaussian ones, are powerful tools for easily modelling the distribution of multivariate variables. Indeed, considering a mixing of continuous, integer and ordinal variables (thus all having a cumulative distribution function), this copula mixture model defines intra-component dependencies similar to a Gaussian mixture, so with classical correlation meaning. Simultaneously, it preserves standard margins associated to continuous, integer and ordered features, namely the Gaussian, the Poisson and the ordered multinomial distributions. As an interesting by-product, the proposed mixture model generalizes many well-known ones and also provides tools of visualization based on the parameters. At a practical level, the Bayesian inference is retained and it is achieved with a Metropolis-within-Gibbs sampler. Experiments on simulated and real data sets finally illustrate the expected advantages of the proposed model for mixed data: flexible and meaningful parametrization combined with visualization features.

preprint2014arXiv

Finite mixture model of conditional dependencies modes to cluster categorical data

We propose a parsimonious extension of the classical latent class model to cluster categorical data by relaxing the class conditional independence assumption. Under this new mixture model, named Conditional Modes Model, variables are grouped into conditionally independent blocks. The corresponding block distribution is a parsimonious multinomial distribution where the few free parameters correspond to the most likely modality crossings, while the remaining probability mass is uniformly spread over the other modality crossings. Thus, the proposed model allows to bring out the intra-class dependency between variables and to summarize each class by a few characteristic modality crossings. The model selection is performed via a Metropolis-within-Gibbs sampler to overcome the computational intractability of the block structure search. As this approach involves the computation of the integrated complete-data likelihood, we propose a new method (exact for the continuous parameters and approximated for the discrete ones) which avoids the biases of the \textsc{bic} criterion pointed out by our experiments. Finally, the parameters are only estimated for the best model via an \textsc{em} algorithm. The characteristics of the new model are illustrated on simulated data and on two biological data sets. These results strengthen the idea that this simple model allows to reduce biases involved by the conditional independence assumption and gives meaningful parameters. Both applications were performed with the R package \texttt{CoModes}

preprint2014arXiv

Model-based clustering for conditionally correlated categorical data

An extension of the latent class model is presented for clustering categorical data by relaxing the classical "class conditional independence assumption" of variables. This model consists in grouping the variables into inter-independent and intra-dependent blocks, in order to consider the main intra-class correlations. The dependency between variables grouped inside the same block of a class is taken into account by mixing two extreme distributions, which are respectively the independence and the maximum dependency. When the variables are dependent given the class, this approach is expected to reduce the biases of the latent class model. Indeed, it produces a meaningful dependency model with only a few additional parameters. The parameters are estimated, by maximum likelihood, by means of an EM algorithm. Moreover, a Gibbs sampler is used for model selection in order to overcome the computational intractability of the combinatorial problems involved by the block structure search. Two applications on medical and biological data sets show the relevance of this new model. The results strengthen the view that this model is meaningful and that it reduces the biases induced by the conditional independence assumption of the latent class model.