Researcher profile

Boris Beranger

Boris Beranger contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2020arXiv

Composite likelihood methods for histogram-valued random variables

Symbolic data analysis has been proposed as a technique for summarising large and complex datasets into a much smaller and tractable number of distributions -- such as random rectangles or histograms -- each describing a portion of the larger dataset. Recent work has developed likelihood-based methods that permit fitting models for the underlying data while only observing the distributional summaries. However, while powerful, when working with random histograms this approach rapidly becomes computationally intractable as the dimension of the underlying data increases. We introduce a composite-likelihood variation of this likelihood-based approach for the analysis of random histograms in $K$ dimensions, through the construction of lower-dimensional marginal histograms. The performance of this approach is examined through simulated and real data analysis of max-stable models for spatial extremes using millions of observed datapoints in more than $K=100$ dimensions. Large computational savings are available compared to existing model fitting approaches.

preprint2020arXiv

Likelihood-based inference for modelling packet transit from thinned flow summaries

The substantial growth of network traffic speed and volume presents practical challenges to network data analysis. Packet thinning and flow aggregation protocols such as NetFlow reduce the size of datasets by providing structured data summaries, but conversely this impedes statistical inference. Methods which aim to model patterns of traffic propagation typically do not account for the packet thinning and summarisation process into the analysis, and are often simplistic, e.g.~method-of-moments. As a result, they can be of limited practical use. We introduce a likelihood-based analysis which fully incorporates packet thinning and NetFlow summarisation into the analysis. As a result, inferences can be made for models on the level of individual packets while only observing thinned flow summary information. We establish consistency of the resulting maximum likelihood estimator, derive bounds on the volume of traffic which should be observed to achieve required levels of estimator accuracy, and identify an ideal family of models. The robust performance of the estimator is examined through simulated analyses and an application on a publicly available trace dataset containing over 36m packets over a 1 minute period.

preprint2020arXiv

Logistic regression models for aggregated data

Logistic regression models are a popular and effective method to predict the probability of categorical response data. However inference for these models can become computationally prohibitive for large datasets. Here we adapt ideas from symbolic data analysis to summarise the collection of predictor variables into histogram form, and perform inference on this summary dataset. We develop ideas based on composite likelihoods to derive an efficient one-versus-rest approximate composite likelihood model for histogram-based random variables, constructed from low-dimensional marginal histograms obtained from the full histogram. We demonstrate that this procedure can achieve comparable classification rates compared to the standard full data multinomial analysis and against state-of-the-art subsampling algorithms for logistic regression, but at a substantially lower computational cost. Performance is explored through simulated examples, and analyses of large supersymmetry and satellite crop classification datasets.

preprint2020arXiv

New models for symbolic data analysis

Symbolic data analysis (SDA) is an emerging area of statistics concerned with understanding and modelling data that takes distributional form (i.e. symbols), such as random lists, intervals and histograms. It was developed under the premise that the statistical unit of interest is the symbol, and that inference is required at this level. Here we consider a different perspective, which opens a new research direction in the field of SDA. We assume that, as with a standard statistical analysis, inference is required at the level of individual-level data. However, the individual-level data are aggregated into symbols - group-based distributional-valued summaries - prior to the analysis. In this way, large and complex datasets can be reduced to a smaller number of distributional summaries, that may be analysed more efficiently than the original dataset. As such, we develop SDA techniques as a new approach for the analysis of big data. In particular we introduce a new general method for constructing likelihood functions for symbolic data based on a desired probability model for the underlying measurement-level data, while only observing the distributional summaries. This approach opens the door for new classes of symbol design and construction, in addition to developing SDA as a viable tool to enable and improve upon classical data analyses, particularly for very large and complex datasets. We illustrate this new direction for SDA research through several real and simulated data analyses.

preprint2017arXiv

Exploratory data analysis for moderate extreme values using non-parametric kernel methods

In many settings it is critical to accurately model the extreme tail behaviour of a random process. Non-parametric density estimation methods are commonly implemented as exploratory data analysis techniques for this purpose as they possess excellent visualisation properties, and can naturally avoid the model specification biases implied by using parametric estimators. In particular, kernel-based estimators place minimal assumptions on the data, and provide improved visualisation over scatterplots and histograms. However kernel density estimators are known to perform poorly when estimating extreme tail behaviour, which is important when interest is in process behaviour above some large threshold, and they can over-emphasise bumps in the density for heavy tailed data. In this article we develop a transformation kernel density estimator, and demonstrate that its mean integrated squared error (MISE) efficiency is equivalent to that of standard, non-tail focused kernel density estimators. Estimator performance is illustrated in numerical studies, and in an expanded analysis of the ability of well known global climate models to reproduce observed temperature extremes in Sydney, Australia.