Source author record

Daniel Jones

Daniel Jones appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Cryptography and Security Machine Learning Computation and Language Molecular Networks q-fin.ST

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Bayesian Estimation of Differential Privacy

Algorithms such as Differentially Private SGD enable training machine learning models with formal privacy guarantees. However, there is a discrepancy between the protection that such algorithms guarantee in theory and the protection they afford in practice. An emerging strand of work empirically estimates the protection afforded by differentially private training as a confidence interval for the privacy budget $\varepsilon$ spent on training a model. Existing approaches derive confidence intervals for $\varepsilon$ from confidence intervals for the false positive and false negative rates of membership inference attacks. Unfortunately, obtaining narrow high-confidence intervals for $ε$ using this method requires an impractically large sample size and training as many models as samples. We propose a novel Bayesian method that greatly reduces sample size, and adapt and validate a heuristic to draw more than one sample per trained model. Our Bayesian method exploits the hypothesis testing interpretation of differential privacy to obtain a posterior for $\varepsilon$ (not just a confidence interval) from the joint posterior of the false positive and false negative rates of membership inference attacks. For the same sample size and confidence, we derive confidence intervals for $\varepsilon$ around 40% narrower than prior work. The heuristic, which we adapt from label-only DP, can be used to further reduce the number of trained models needed to get enough samples by up to 2 orders of magnitude.

preprint2021arXiv

Training Data Leakage Analysis in Language Models

Recent advances in neural network based language models lead to successful deployments of such models, improving user experience in various applications. It has been demonstrated that strong performance of language models comes along with the ability to memorize rare training samples, which poses serious privacy threats in case the model is trained on confidential user content. In this work, we introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model. We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data. Our metrics further enable comparing different models trained on the same data in terms of privacy. We demonstrate our approach through extensive numerical studies on both RNN and Transformer based models. We further illustrate how the proposed metrics can be utilized to investigate the efficacy of mitigations like differentially private training or API hardening.

preprint2014arXiv

Modelling of dependence in high-dimensional financial time series by cluster-derived canonical vines

We extend existing models in the financial literature by introducing a cluster-derived canonical vine (CDCV) copula model for capturing high dimensional dependence between financial time series. This model utilises a simplified market-sector vine copula framework similar to those introduced by Heinen and Valdesogo (2008) and Brechmann and Czado (2013), which can be applied by conditioning asset time series on a market-sector hierarchy of indexes. While this has been shown by the aforementioned authors to control the excessive parameterisation of vine copulas in high dimensions, their models have relied on the provision of externally sourced market and sector indexes, limiting their wider applicability due to the imposition of restrictions on the number and composition of such sectors. By implementing the CDCV model, we demonstrate that such reliance on external indexes is redundant as we can achieve equivalent or improved performance by deriving a hierarchy of indexes directly from a clustering of the asset time series, thus abstracting the modelling process from the underlying data.

preprint2010arXiv

Effect of promoter architecture on the cell-to-cell variability in gene expression

According to recent experimental evidence, the architecture of a promoter, defined as the number, strength and regulatory role of the operators that control the promoter, plays a major role in determining the level of cell-to-cell variability in gene expression. These quantitative experiments call for a corresponding modeling effort that addresses the question of how changes in promoter architecture affect noise in gene expression in a systematic rather than case-by-case fashion. In this article, we make such a systematic investigation, based on a simple microscopic model of gene regulation that incorporates stochastic effects. In particular, we show how operator strength and operator multiplicity affect this variability. We examine different modes of transcription factor binding to complex promoters (cooperative, independent, simultaneous) and how each of these affects the level of variability in transcription product from cell-to-cell. We propose that direct comparison between in vivo single-cell experiments and theoretical predictions for the moments of the probability distribution of mRNA number per cell can discriminate between different kinetic models of gene regulation.