Source author record

Emmanuel Bacry

Emmanuel Bacry appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning q-fin.ST q-fin.TR Methodology cond-mat.stat-mech math.PR Applications cs.CY Distributed, Parallel, and Cluster Computing math.DS math.ST physics.data-an physics.geo-ph Statistics Theory

Catalog footprint

What is connected

16works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

From rough to multifractal multidimensional volatility: A multidimensional Log S-fBM model

We introduce the multivariate Log S-fBM model (mLog S-fBM), extending the univariate framework proposed by Wu \textit{et al.} to the multidimensional setting. We define the multidimensional Stationary fractional Brownian motion (mS-fBM), characterized by marginals following S-fBM dynamics and a specific cross-covariance structure. It is parametrized by a correlation scale $T$, marginal-specific intermittency parameters and Hurst exponents, as well as their multidimensional counterparts: the co-intermittency matrix and the co-Hurst matrix. The mLog S-fBM is constructed by modeling volatility components as exponentials of the mS-fBM, preserving the dependence structure of the Gaussian core. We demonstrate that the model is well-defined for any co-Hurst matrix with entries in $[0, \frac{1}{2}[$, supporting vanishing co-Hurst parameters to bridge rough volatility and multifractal regimes. We generalize the small intermittency approximation technique to the multivariate setting to develop an efficient Generalized Method of Moments calibration procedure, estimating cross-covariance parameters for pairs of marginals. We validate it on synthetic data and apply it to S\&P 500 market data, modeling stock return fluctuations. Diagonal estimates of the stock Hurst matrix, corresponding to single-stock log-volatility Hurst exponents, are close to 0, indicating multifractal behavior, while co-Hurst off-diagonal entries are close to the Hurst exponent of the S\&P 500 index ($H \approx 0.12$), and co-intermittency off-diagonal entries align with univariate intermittency estimates.

preprint2022arXiv

From Rough to Multifractal volatility: the log S-fBM model

We introduce a family of random measures $M_{H,T} (d t)$, namely log S-fBM, such that, for $H>0$, $M_{H,T}(d t) = e^{ω_{H,T}(t)} d t$ where $ω_{H,T}(t)$ is a Gaussian process that can be considered as a stationary version of an $H$-fractional Brownian motion. Moreover, when $H \to 0$, one has $M_{H,T}(d t) \rightarrow {\widetilde M}_{T}(d t)$ (in the weak sense) where ${\widetilde M}_{T}(d t)$ is the celebrated log-normal multifractal random measure (MRM). Thus, this model allows us to consider, within the same framework, the two popular classes of multifractal ($H = 0$) and rough volatility ($0<H < 1/2$) models. The main properties of the log S-fBM are discussed and their estimation issues are addressed. We notably show that the direct estimation of $H$ from the scaling properties of $\ln(M_{H,T}([t, t+τ]))$, at fixed $τ$, can lead to strongly over-estimating the value of $H$. We propose a better GMM estimation method which is shown to be valid in the high-frequency asymptotic regime. When applied to a large set of empirical volatility data, we observe that stock indices have values around $H=0.1$ while individual stocks are characterized by values of $H$ that can be very close to $0$ and thus well described by a MRM. We also bring evidence that unlike the log-volatility variance $ν^2$ whose estimation appears to be poorly reliable (though used widely in the rough volatility literature), the estimation of the so-called "intermittency coefficient" $λ^2$, which is the product of $ν^2$ and the Hurst exponent $H$, appears to be far more reliable leading to values that seem to be universal for respectively all individual stocks and all stock indices.

preprint2020arXiv

SCALPEL3: a scalable open-source library for healthcare claims databases

This article introduces SCALPEL3, a scalable open-source framework for studies involving Large Observational Databases (LODs). Its design eases medical observational studies thanks to abstractions allowing concept extraction, high-level cohort manipulation, and production of data formats compatible with machine learning libraries. SCALPEL3 has successfully been used on the SNDS database (see Tuppin et al. (2017)), a huge healthcare claims database that handles the reimbursement of almost all French citizens. SCALPEL3 focuses on scalability, easy interactive analysis and helpers for data flow analysis to accelerate studies performed on LODs. It consists of three open-source libraries based on Apache Spark. SCALPEL-Flattening allows denormalization of the LOD (only SNDS for now) by joining tables sequentially in a big table. SCALPEL-Extraction provides fast concept extraction from a big table such as the one produced by SCALPEL-Flattening. Finally, SCALPEL-Analysis allows interactive cohort manipulations, monitoring statistics of cohort flows and building datasets to be used with machine learning libraries. The first two provide a Scala API while the last one provides a Python API that can be used in an interactive environment. Our code is available on GitHub. SCALPEL3 allowed to extract successfully complex concepts for studies such as Morel et al (2017) or studies with 14.5 million patients observed over three years (corresponding to more than 15 billion healthcare events and roughly 15 TeraBytes of data) in less than 49 minutes on a small 15 nodes HDFS cluster. SCALPEL3 provides a sharp interactive control of data processing through legible code, which helps to build studies with full reproducibility, leading to improved maintainability and audit of studies performed on LODs.

preprint2020arXiv

Sparse and low-rank multivariate Hawkes processes

We consider the problem of unveiling the implicit network structure of node interactions (such as user interactions in a social network), based only on high-frequency timestamps. Our inference is based on the minimization of the least-squares loss associated with a multivariate Hawkes model, penalized by $\ell_1$ and trace norm of the interaction tensor. We provide a first theoretical analysis for this problem, that includes sparsity and low-rank inducing penalizations. This result involves a new data-driven concentration inequality for matrix martingales in continuous time with observable variance, which is a result of independent interest and a broad range of possible applications since it extends to matrix martingales former results restricted to the scalar case. A consequence of our analysis is the construction of sharply tuned $\ell_1$ and trace-norm penalizations, that leads to a data-driven scaling of the variability of information available for each users. Numerical experiments illustrate the significant improvements achieved by the use of such data-driven penalizations.

preprint2020arXiv

ZiMM: a deep learning model for long term and blurry relapses with non-clinical claims data

This paper considers the problems of modeling and predicting a long-term and ``blurry'' relapse that occurs after a medical act, such as a surgery. The relapse is observed only indirectly, in a ``blurry'' fashion, through longitudinal prescriptions of drugs over a long period of time after the medical act. We introduce a new model, called ZiMM (Zero-inflated Mixture of Multinomial distributions) in order to capture long-term and blurry relapses. On top of it, we build an end-to-end deep-learning architecture called ZiMM Encoder-Decoder (ZiMM ED) that can learn from the complex, irregular, highly heterogeneous and sparse patterns of health events that are observed through a claims-only database. ZiMM ED is applied on a ``non-clinical'' claims database, that contains only timestamped reimbursement codes for drug purchases, medical procedures and hospital diagnoses, the only available clinical feature being the age of the patient. This setting is more challenging than a setting where bedside clinical signals are available. Our motivation for using such a non-clinical claims database is its exhaustivity population-wise, compared to clinical electronic health records coming from a single or a small set of hospitals. Indeed, we consider a dataset containing the claims of almost \emph{all French citizens} who had surgery for prostatic problems, with a history between 1.5 and 5 years. We consider a long-term (18 months) relapse (urination problems still occur despite surgery), which is blurry since it is observed only through the reimbursement of a specific set of drugs for urination problems. Our experiments show that ZiMM ED improves several baselines, including non-deep learning and deep-learning approaches, and that it allows working on such a dataset with minimal preprocessing work.

preprint2016arXiv

Concentration for matrix martingales in continuous time and microscopic activity of social networks

This paper gives new concentration inequalities for the spectral norm of a wide class of matrix martingales in continuous time. These results extend previously established Freedman and Bernstein inequalities for series of random matrices to the class of continuous time processes. Our analysis relies on a new supermartingale property of the trace exponential proved within the framework of stochastic calculus. We provide also several examples that illustrate the fact that our results allow us to recover easily several formerly obtained sharp bounds for discrete time matrix martingales.

preprint2016arXiv

SGD with Variance Reduction beyond Empirical Risk Minimization

We introduce a doubly stochastic proximal gradient algorithm for optimizing a finite average of smooth convex functions, whose gradients depend on numerically expensive expectations. Our main motivation is the acceleration of the optimization of the regularized Cox partial-likelihood (the core model used in survival analysis), but our algorithm can be used in different settings as well. The proposed algorithm is doubly stochastic in the sense that gradient steps are done using stochastic gradient descent (SGD) with variance reduction, where the inner expectations are approximated by a Monte-Carlo Markov-Chain (MCMC) algorithm. We derive conditions on the MCMC number of iterations guaranteeing convergence, and obtain a linear rate of convergence under strong convexity and a sublinear rate without this assumption. We illustrate the fact that our algorithm improves the state-of-the-art solver for regularized Cox partial-likelihood on several datasets from survival analysis.

preprint2015arXiv

Hawkes processes in finance

In this paper we propose an overview of the recent academic literature devoted to the applications of Hawkes processes in finance. Hawkes processes constitute a particular class of multivariate point processes that has become very popular in empirical high frequency finance this last decade. After a reminder of the main definitions and properties that characterize Hawkes processes, we review their main empirical applications to address many different problems in high frequency finance. Because of their great flexibility and versatility, we show that they have been successfully involved in issues as diverse as estimating the volatility at the level of transaction data, estimating the market stability, accounting for systemic risk contagion, devising optimal execution strategies or capturing the dynamics of the full order book.

preprint2015arXiv

Intermittent process analysis with scattering moments

Scattering moments provide nonparametric models of random processes with stationary increments. They are expected values of random variables computed with a nonexpansive operator, obtained by iteratively applying wavelet transforms and modulus nonlinearities, which preserves the variance. First- and second-order scattering moments are shown to characterize intermittency and self-similarity properties of multiscale processes. Scattering moments of Poisson processes, fractional Brownian motions, Lévy processes and multifractal random walks are shown to have characteristic decay. The Generalized Method of Simulated Moments is applied to scattering moments to estimate data generating models. Numerical applications are shown on financial time-series and on energy dissipation of turbulent flows.

preprint2015arXiv

Mean-field inference of Hawkes point processes

We propose a fast and efficient estimation method that is able to accurately recover the parameters of a d-dimensional Hawkes point-process from a set of observations. We exploit a mean-field approximation that is valid when the fluctuations of the stochastic intensity are small. We show that this is notably the case in situations when interactions are sufficiently weak, when the dimension of the system is high or when the fluctuations are self-averaging due to the large number of past events they involve. In such a regime the estimation of a Hawkes process can be mapped on a least-squares problem for which we provide an analytic solution. Though this estimator is biased, we show that its precision can be comparable to the one of the Maximum Likelihood Estimator while its computation speed is shown to be improved considerably. We give a theoretical control on the accuracy of our new approach and illustrate its efficiency using synthetic datasets, in order to assess the statistical estimation error of the parameters.

preprint2015arXiv

Second order statistics characterization of Hawkes processes and non-parametric estimation

We show that the jumps correlation matrix of a multivariate Hawkes process is related to the Hawkes kernel matrix through a system of Wiener-Hopf integral equations. A Wiener-Hopf argument allows one to prove that this system (in which the kernel matrix is the unknown) possesses a unique causal solution and consequently that the second-order properties fully characterize a Hawkes process. The numerical inversion of this system of integral equations allows us to propose a fast and efficient method, which main principles were initially sketched in [Bacry and Muzy, 2013], to perform a non-parametric estimation of the Hawkes kernel matrix. In this paper, we perform a systematic study of this non-parametric estimation procedure in the general framework of marked Hawkes processes. We describe precisely this procedure step by step. We discuss the estimation error and explain how the values for the main parameters should be chosen. Various numerical examples are given in order to illustrate the broad possibilities of this estimation procedure ranging from 1-dimensional (power-law or non positive kernels) up to 3-dimensional (circular dependence) processes. A comparison to other non-parametric estimation procedures is made. Applications to high frequency trading events in financial markets and to earthquakes occurrence dynamics are finally considered.

preprint2014arXiv

Estimation of slowly decreasing Hawkes kernels: Application to high frequency order book modelling

We present a modified version of the non parametric Hawkes kernel estimation procedure studied in arXiv:1401.0903 that is adapted to slowly decreasing kernels. We show on numerical simulations involving a reasonable number of events that this method allows us to estimate faithfully a power-law decreasing kernel over at least 6 decades. We then propose a 8-dimensional Hawkes model for all events associated with the first level of some asset order book. Applying our estimation procedure to this model, allows us to uncover the main properties of the coupled dynamics of trade, limit and cancel orders in relationship with the mid-price variations.

preprint2014arXiv

Linear processes in high-dimension: phase space and critical properties

In this work we investigate the generic properties of a stochastic linear model in the regime of high-dimensionality. We consider in particular the Vector AutoRegressive model (VAR) and the multivariate Hawkes process. We analyze both deterministic and random versions of these models, showing the existence of a stable and an unstable phase. We find that along the transition region separating the two regimes, the correlations of the process decay slowly, and we characterize the conditions under which these slow correlations are expected to become power-laws. We check our findings with numerical simulations showing remarkable agreement with our predictions. We finally argue that real systems with a strong degree of self-interaction are naturally characterized by this type of slow relaxation of the correlations.

preprint2014arXiv

Market impacts and the life cycle of investors orders

In this paper, we use a database of around 400,000 metaorders issued by investors and electronically traded on European markets in 2010 in order to study market impact at different scales. At the intraday scale we confirm a square root temporary impact in the daily participation, and we shed light on a duration factor in $1/T^γ$ with $γ\simeq 0.25$. Including this factor in the fits reinforces the square root shape of impact. We observe a power-law for the transient impact with an exponent between $0.5$ (for long metaorders) and $0.8$ (for shorter ones). Moreover we show that the market does not anticipate the size of the meta-orders. The intraday decay seems to exhibit two regimes (though hard to identify precisely): a "slow" regime right after the execution of the meta-order followed by a faster one. At the daily time scale, we show price moves after a metaorder can be split between realizations of expected returns that have triggered the investing decision and an idiosynchratic impact that slowly decays to zero. Moreover we propose a class of toy models based on Hawkes processes (the Hawkes Impact Models, HIM) to illustrate our reasoning. We show how the Impulsive-HIM model, despite its simplicity, embeds appealing features like transience and decay of impact. The latter is parametrized by a parameter $C$ having a macroscopic interpretation: the ratio of contrarian reaction (i.e. impact decay) and of the "herding" reaction (i.e. impact amplification).

preprint2012arXiv

Scaling limits for Hawkes processes and application to financial statistics

We prove a law of large numbers and a functional central limit theorem for multivariate Hawkes processes observed over a time interval $[0,T]$ in the limit $T \rightarrow \infty$. We further exhibit the asymptotic behaviour of the covariation of the increments of the components of a multivariate Hawkes process, when the observations are imposed by a discrete scheme with mesh $Δ$ over $[0,T]$ up to some further time shift $τ$. The behaviour of this functional depends on the relative size of $Δ$ and $τ$ with respect to $T$ and enables to give a full account of the second-order structure. As an application, we develop our results in the context of financial statistics. We introduced in a previous work a microscopic stochastic model for the variations of a multivariate financial asset, based on Hawkes processes and that is confined to live on a tick grid. We derive and characterise the exact macroscopic diffusion limit of this model and show in particular its ability to reproduce important empirical stylised fact such as the Epps effect and the lead-lag effect. Moreover, our approach enable to track these effects across scales in rigorous mathematical terms.

preprint2010arXiv

The nature of price returns during periods of high market activity

By studying all the trades and best bids/asks of ultra high frequency snapshots recorded from the order books of a basket of 10 futures assets, we bring qualitative empirical evidence that the impact of a single trade depends on the intertrade time lags. We find that when the trading rate becomes faster, the return variance per trade or the impact, as measured by the price variation in the direction of the trade, strongly increases. We provide evidence that these properties persist at coarser time scales. We also show that the spread value is an increasing function of the activity. This suggests that order books are more likely empty when the trading rate is high.

Emmanuel Bacry

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

From rough to multifractal multidimensional volatility: A multidimensional Log S-fBM model

From Rough to Multifractal volatility: the log S-fBM model

SCALPEL3: a scalable open-source library for healthcare claims databases

Sparse and low-rank multivariate Hawkes processes

ZiMM: a deep learning model for long term and blurry relapses with non-clinical claims data

Concentration for matrix martingales in continuous time and microscopic activity of social networks

SGD with Variance Reduction beyond Empirical Risk Minimization

Hawkes processes in finance

Intermittent process analysis with scattering moments

Mean-field inference of Hawkes point processes

Second order statistics characterization of Hawkes processes and non-parametric estimation

Estimation of slowly decreasing Hawkes kernels: Application to high frequency order book modelling

Linear processes in high-dimension: phase space and critical properties

Market impacts and the life cycle of investors orders

Scaling limits for Hawkes processes and application to financial statistics

The nature of price returns during periods of high market activity