Source author record

Boris Beranger

Boris Beranger appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Computation Machine Learning Applications math.ST Statistics Theory

Catalog footprint

What is connected

7works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Composite likelihood methods for histogram-valued random variables

Symbolic data analysis has been proposed as a technique for summarising large and complex datasets into a much smaller and tractable number of distributions -- such as random rectangles or histograms -- each describing a portion of the larger dataset. Recent work has developed likelihood-based methods that permit fitting models for the underlying data while only observing the distributional summaries. However, while powerful, when working with random histograms this approach rapidly becomes computationally intractable as the dimension of the underlying data increases. We introduce a composite-likelihood variation of this likelihood-based approach for the analysis of random histograms in $K$ dimensions, through the construction of lower-dimensional marginal histograms. The performance of this approach is examined through simulated and real data analysis of max-stable models for spatial extremes using millions of observed datapoints in more than $K=100$ dimensions. Large computational savings are available compared to existing model fitting approaches.

preprint2020arXiv

Likelihood-based inference for modelling packet transit from thinned flow summaries

The substantial growth of network traffic speed and volume presents practical challenges to network data analysis. Packet thinning and flow aggregation protocols such as NetFlow reduce the size of datasets by providing structured data summaries, but conversely this impedes statistical inference. Methods which aim to model patterns of traffic propagation typically do not account for the packet thinning and summarisation process into the analysis, and are often simplistic, e.g.~method-of-moments. As a result, they can be of limited practical use. We introduce a likelihood-based analysis which fully incorporates packet thinning and NetFlow summarisation into the analysis. As a result, inferences can be made for models on the level of individual packets while only observing thinned flow summary information. We establish consistency of the resulting maximum likelihood estimator, derive bounds on the volume of traffic which should be observed to achieve required levels of estimator accuracy, and identify an ideal family of models. The robust performance of the estimator is examined through simulated analyses and an application on a publicly available trace dataset containing over 36m packets over a 1 minute period.

preprint2020arXiv

Logistic regression models for aggregated data

Logistic regression models are a popular and effective method to predict the probability of categorical response data. However inference for these models can become computationally prohibitive for large datasets. Here we adapt ideas from symbolic data analysis to summarise the collection of predictor variables into histogram form, and perform inference on this summary dataset. We develop ideas based on composite likelihoods to derive an efficient one-versus-rest approximate composite likelihood model for histogram-based random variables, constructed from low-dimensional marginal histograms obtained from the full histogram. We demonstrate that this procedure can achieve comparable classification rates compared to the standard full data multinomial analysis and against state-of-the-art subsampling algorithms for logistic regression, but at a substantially lower computational cost. Performance is explored through simulated examples, and analyses of large supersymmetry and satellite crop classification datasets.

preprint2020arXiv

New models for symbolic data analysis

Symbolic data analysis (SDA) is an emerging area of statistics concerned with understanding and modelling data that takes distributional form (i.e. symbols), such as random lists, intervals and histograms. It was developed under the premise that the statistical unit of interest is the symbol, and that inference is required at this level. Here we consider a different perspective, which opens a new research direction in the field of SDA. We assume that, as with a standard statistical analysis, inference is required at the level of individual-level data. However, the individual-level data are aggregated into symbols - group-based distributional-valued summaries - prior to the analysis. In this way, large and complex datasets can be reduced to a smaller number of distributional summaries, that may be analysed more efficiently than the original dataset. As such, we develop SDA techniques as a new approach for the analysis of big data. In particular we introduce a new general method for constructing likelihood functions for symbolic data based on a desired probability model for the underlying measurement-level data, while only observing the distributional summaries. This approach opens the door for new classes of symbol design and construction, in addition to developing SDA as a viable tool to enable and improve upon classical data analyses, particularly for very large and complex datasets. We illustrate this new direction for SDA research through several real and simulated data analyses.

preprint2017arXiv

Exploratory data analysis for moderate extreme values using non-parametric kernel methods

In many settings it is critical to accurately model the extreme tail behaviour of a random process. Non-parametric density estimation methods are commonly implemented as exploratory data analysis techniques for this purpose as they possess excellent visualisation properties, and can naturally avoid the model specification biases implied by using parametric estimators. In particular, kernel-based estimators place minimal assumptions on the data, and provide improved visualisation over scatterplots and histograms. However kernel density estimators are known to perform poorly when estimating extreme tail behaviour, which is important when interest is in process behaviour above some large threshold, and they can over-emphasise bumps in the density for heavy tailed data. In this article we develop a transformation kernel density estimator, and demonstrate that its mean integrated squared error (MISE) efficiency is equivalent to that of standard, non-tail focused kernel density estimators. Estimator performance is illustrated in numerical studies, and in an expanded analysis of the ability of well known global climate models to reproduce observed temperature extremes in Sydney, Australia.

preprint2016arXiv

Models for extremal dependence derived from skew-symmetric families

Skew-symmetric families of distributions such as the skew-normal and skew-$t$ represent supersets of the normal and $t$ distributions, and they exhibit richer classes of extremal behaviour. By defining a non-stationary skew-normal process, which allows the easy handling of positive definite, non-stationary covariance functions, we derive a new family of max-stable processes - the extremal-skew-$t$ process. This process is a superset of non-stationary processes that include the stationary extremal-$t$ processes. We provide the spectral representation and the resulting angular densities of the extremal-skew-$t$ process, and illustrate its practical implementation (Includes Supporting Information).

preprint2015arXiv

Extreme Dependence Models

Extreme values of real phenomena are events that occur with low frequency, but can have a large impact on real life. These are, in many practical problems, high-dimensional by nature (e.g. Tawn, 1990; Coles and Tawn, 1991). To study these events is of fundamental importance. For this purpose, probabilistic models and statistical methods are in high demand. There are several approaches to modelling multivariate extremes as described in Falk et al. (2011), linked to some extent. We describe an approach for deriving multivariate extreme value models and we illustrate the main features of some flexible extremal dependence models. We compare them by showing their utility with a real data application, in particular analyzing the extremal dependence among several pollutants recorded in the city of Leeds, UK.

Boris Beranger

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Composite likelihood methods for histogram-valued random variables

Likelihood-based inference for modelling packet transit from thinned flow summaries

Logistic regression models for aggregated data

New models for symbolic data analysis

Exploratory data analysis for moderate extreme values using non-parametric kernel methods

Models for extremal dependence derived from skew-symmetric families

Extreme Dependence Models