Source author record

Martin Jullum

Martin Jullum appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Applications Methodology econ.EM

Catalog footprint

What is connected

8works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Exabel's Factor Model

Factor models have become a common and valued tool for understanding the risks associated with an investing strategy. In this report we describe Exabel's factor model, we quantify the fraction of the variability of the returns explained by the different factors, and we show some examples of annual returns of portfolios with different factor exposure.

preprint2022arXiv

Performance evaluation of volatility estimation methods for Exabel

Quantifying both historic and future volatility is key in portfolio risk management. This note presents and compares estimation strategies for volatility estimation in an estimation universe consisting on 28 629 unique companies from February 2010 to April 2021, with 858 different portfolios. The estimation methods are compared in terms of how they rank the volatility of the different subsets of portfolios. The overall best performing approach estimates volatility from direct entity returns using a GARCH model for variance estimation.

preprint2022arXiv

Statistical embedding: Beyond principal components

There has been an intense recent activity in embedding of very high dimensional and nonlinear data structures, much of it in the data science and machine learning literature. We survey this activity in four parts. In the first part we cover nonlinear methods such as principal curves, multidimensional scaling, local linear methods, ISOMAP, graph based methods and diffusion mapping, kernel based methods and random projections. The second part is concerned with topological embedding methods, in particular mapping topological properties into persistence diagrams and the Mapper algorithm. Another type of data sets with a tremendous growth is very high-dimensional network data. The task considered in part three is how to embed such data in a vector space of moderate dimension to make the data amenable to traditional techniques such as cluster and classification techniques. Arguably this is the part where the contrast between algorithmic machine learning methods and statistical modeling, the so-called stochastic block modeling, is at its greatest. In the paper, we discuss the pros and cons for the two approaches. The final part of the survey deals with embedding in $\mathbb{R}^ 2$, i.e. visualization. Three methods are presented: $t$-SNE, UMAP and LargeVis based on methods in parts one, two and three, respectively. The methods are illustrated and compared on two simulated data sets; one consisting of a triplet of noisy Ranunculoid curves, and one consisting of networks of increasing complexity generated with stochastic block models and with two types of nodes.

preprint2022arXiv

Using Shapley Values and Variational Autoencoders to Explain Predictive Models with Dependent Mixed Features

Shapley values are today extensively used as a model-agnostic explanation framework to explain complex predictive machine learning models. Shapley values have desirable theoretical properties and a sound mathematical foundation in the field of cooperative game theory. Precise Shapley value estimates for dependent data rely on accurate modeling of the dependencies between all feature combinations. In this paper, we use a variational autoencoder with arbitrary conditioning (VAEAC) to model all feature dependencies simultaneously. We demonstrate through comprehensive simulation studies that our VAEAC approach to Shapley value estimation outperforms the state-of-the-art methods for a wide range of settings for both continuous and mixed dependent features. For high-dimensional settings, our VAEAC approach with a non-uniform masking scheme significantly outperforms competing methods. Finally, we apply our VAEAC approach to estimate Shapley value explanations for the Abalone data set from the UCI Machine Learning Repository.

preprint2021arXiv

Explaining predictive models using Shapley values and non-parametric vine copulas

The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. If the features in reality are dependent this may lead to incorrect explanations. Hence, there have recently been attempts of appropriately modelling/estimating the dependence between the features. Although the proposed methods clearly outperform the traditional approach assuming independence, they have their weaknesses. In this paper we propose two new approaches for modelling the dependence between the features. Both approaches are based on vine copulas, which are flexible tools for modelling multivariate non-Gaussian distributions able to characterise a wide range of complex dependencies. The performance of the proposed methods is evaluated on simulated data sets and a real data set. The experiments demonstrate that the vine copula approaches give more accurate approximations to the true Shapley values than its competitors.

preprint2020arXiv

Explaining individual predictions when features are dependent: More accurate approximations to Shapley values

Explaining complex or seemingly simple machine learning models is an important practical problem. We want to explain individual predictions from a complex machine learning model by learning simple, interpretable explanations. Shapley values is a game theoretic concept that can be used for this purpose. The Shapley value framework has a series of desirable theoretical properties, and can in principle handle any predictive model. Kernel SHAP is a computationally efficient approximation to Shapley values in higher dimensions. Like several other existing methods, this approach assumes that the features are independent, which may give very wrong explanations. This is the case even if a simple linear model is used for predictions. In this paper, we extend the Kernel SHAP method to handle dependent features. We provide several examples of linear and non-linear models with various degrees of feature dependence, where our method gives more accurate approximations to the true Shapley values. We also propose a method for aggregating individual Shapley values, such that the prediction can be explained by groups of dependent variables.

preprint2020arXiv

Explaining predictive models with mixed features using Shapley values and conditional inference trees

It is becoming increasingly important to explain complex, black-box machine learning models. Although there is an expanding literature on this topic, Shapley values stand out as a sound method to explain predictions from any type of machine learning model. The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. This methodology was then extended to explain dependent features with an underlying continuous distribution. In this paper, we propose a method to explain mixed (i.e. continuous, discrete, ordinal, and categorical) dependent features by modeling the dependence structure of the features using conditional inference trees. We demonstrate our proposed method against the current industry standards in various simulation studies and find that our method often outperforms the other approaches. Finally, we apply our method to a real financial data set used in the 2018 FICO Explainable Machine Learning Challenge and show how our explanations compare to the FICO challenge Recognition Award winning team.

preprint2019arXiv

Estimating seal pup production in the Greenland Sea using Bayesian hierarchical modeling

The Greenland Sea is an important breeding ground for harp and hooded seals. Estimates of the annual seal pup production are critical factors in the abundance estimation needed for management of the species. These estimates are usually based on counts from aerial photographic surveys. However, only a minor part of the whelping region can be photographed, due to its large extent. To estimate the total seal pup production, we propose a Bayesian hierarchical modeling approach motivated by viewing the seal pup appearances as a realization of a log-Gaussian Cox process using covariate information from satellite imagery as a proxy for ice thickness. For inference, we utilize the stochastic partial differential equation (SPDE) module of the integrated nested Laplace approximation (INLA) framework. In a case study using survey data from 2012, we compare our results with existing methodology in a comprehensive cross-validation study. The results of the study indicate that our method improves local estimation performance, and that the increased prediction uncertainty of our method is required to obtain calibrated count predictions. This suggests that the sampling density of the survey design may not be sufficient to obtain reliable estimates of the seal pup production.

Martin Jullum

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Exabel's Factor Model

Performance evaluation of volatility estimation methods for Exabel

Statistical embedding: Beyond principal components

Using Shapley Values and Variational Autoencoders to Explain Predictive Models with Dependent Mixed Features

Explaining predictive models using Shapley values and non-parametric vine copulas

Explaining individual predictions when features are dependent: More accurate approximations to Shapley values

Explaining predictive models with mixed features using Shapley values and conditional inference trees

Estimating seal pup production in the Greenland Sea using Bayesian hierarchical modeling