Source author record

Jonas Wallin

Jonas Wallin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Computation Methodology math.ST Statistics Theory Machine Learning math.PR physics.soc-ph Populations and Evolution Quantitative Methods

Catalog footprint

What is connected

12works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Scalable Ultra-High-Dimensional Quantile Regression with Genomic Applications

Modern datasets arising from social media, genomics, and biomedical informatics are often heterogeneous and (ultra) high-dimensional, creating substantial challenges for conventional modeling techniques. Quantile regression (QR) not only offers a flexible way to capture heterogeneous effects across the conditional distribution of an outcome, but also naturally produces prediction intervals that help quantify uncertainty in future predictions. However, classical QR methods can face serious memory and computational constraints in large-scale settings. These limitations motivate the use of parallel computing to maintain tractability. While extensive work has examined sample-splitting strategies in settings where the number of observations $n$ greatly exceeds the number of features $p$, the equally important (ultra) high-dimensional regime ($p >> n$) has been comparatively underexplored. To address this gap, we introduce a feature-splitting proximal point algorithm, FS-QRPPA, for penalized QR in high-dimensional regime. Leveraging recent developments in variational analysis, we establish a Q-linear convergence rate for FS-QRPPA and demonstrate its superior scalability in large-scale genomic applications from the UK Biobank relative to existing methods. Moreover, FS-QRPPA yields more accurate coefficient estimates and better coverage for prediction intervals than current approaches. We provide a parallel implementation in the R package fsQRPPA, making penalized QR tractable on large-scale datasets.

preprint2022arXiv

Local scale invariance and robustness of proper scoring rules

Averages of proper scoring rules are often used to rank probabilistic forecasts. In many cases, the individual terms in these averages are based on observations and forecasts from different distributions. We show that some of the most popular proper scoring rules, such as the continuous ranked probability score (CRPS), give more importance to observations with large uncertainty which can lead to unintuitive rankings. To describe this issue, we define the concept of local scale invariance for scoring rules. A new class of generalized proper kernel scoring rules is derived and as a member of this class we propose the scaled CRPS (SCRPS). This new proper scoring rule is locally scale invariant and therefore works in the case of varying uncertainty. Like CRPS it is computationally available for output from ensemble forecasts, and does not require the ability to evaluate densities of forecasts. We further define robustness of scoring rules, show why this also is an important concept for average scores, and derive new proper scoring rules that are robust against outliers. The theoretical findings are illustrated in three different applications from spatial statistics, stochastic volatility models, and regression for count data.

preprint2020arXiv

Generalized bounds for active subspaces

In this article, we consider scenarios in which traditional estimates for the active subspace method based on probabilistic Poincaré inequalities are not valid due to unbounded Poincaré constants. Consequently, we propose a framework that allows to derive generalized estimates in the sense that it enables to control the trade-off between the size of the Poincaré constant and a weaker order of the final error bound. In particular, we investigate independently exponentially distributed random variables in dimension two or larger and give explicit expressions for corresponding Poincaré constants showing their dependence on the dimension of the problem. Finally, we suggest possibilities for future work that aim for extending the class of distributions applicable to the active subspace method as we regard this as an opportunity to enlarge its usability.

preprint2020arXiv

Nowcasting Covid-19 statistics reported withdelay: a case-study of Sweden

The new corona virus disease -- COVID-2019 -- is rapidly spreading through the world. The availability of unbiased timely statistics of trends in disease events are a key to effective responses. But due to reporting delays, the most recently reported numbers are frequently underestimating of the total number of infections, hospitalizations and deaths creating an illusion of a downward trend. Here we describe a statistical methodology for predicting true daily quantities and their uncertainty, estimated using historical reporting delays. The methodology takes into account the observed distribution pattern of the lag. It is derived from the removal method, a well-established estimation framework in the field of ecology.

preprint2019arXiv

Multivariate type G Matérn stochastic partial differential equation random fields

For many applications with multivariate data, random field models capturing departures from Gaussianity within realisations are appropriate. For this reason, we formulate a new class of multivariate non-Gaussian models based on systems of stochastic partial differential equations with additive type G noise whose marginal covariance functions are of Matérn type. We consider four increasingly flexible constructions of the noise, where the first two are similar to existing copula-based models. In contrast to these, the latter two constructions can model non-Gaussian spatial data without replicates. Computationally efficient methods for likelihood-based parameter estimation and probabilistic prediction are proposed, and the flexibility of the suggested models is illustrated by numerical examples and two statistical applications.

preprint2016arXiv

Efficient adaptive MCMC through precision estimation

A novel adaptive Markov chain Monte Carlo algorithm is presented. The algorithm utilizes sparsity in the partial correlation structure of a density to efficiently estimate the covariance matrix through the Cholesky factor of the precision matrix. The algorithm also utilizes the sparsity to sample efficiently from both MALA and Metropolis Hasting random walk proposals. Further, an algorithm that estimates the partial correlation structure of a density is proposed. Combining this with the Cholesky factor estimation algorithm results in an efficient black-box AMCMC method that can be used for general densities with unknown dependency structure. The method is compared with regular empirical covariance adaption for two examples. In both examples, the proposed method's covariance estimates converge faster to the true covariance matrix and the computational cost for each iteration is lower.

preprint2016arXiv

Estimating the unobservable moose - converting index to population size using a Bayesian Hierarchical state space model

Indirect information on population size, like pellet counts or volunteer counts, is the main source of information in most ecological studies and applied population management situations. Often, such observations are treaded as if they were actual measurements of population size. This assumption results in incorrect conclusions about a population's size and its dynamics. We propose a model with a temporal varying link, denoted countability, between indirect observations and actual population size. We show that, when indirect measurement has high precision (for instance many observation hours) the assumption of temporal varying countability can have a crucial effect on the estimated population dynamic. We apply the model on two local moose populations in Sweden. The estimated population dynamics is found to explain 30-50 percent of the total variability in the observation data; thus, countability accounts for most of the variation. This unreliability of the estimated dynamics has a substantial negative impact on the ability to manage populations; for example, reducing (increasing) the number of animals that needs to be harvested in order to sustain the population above (below) a fixed level. Finally, large difference in countability between two study areas implies a substantial spatial variation in the countability; this variation in itself is highly worthy of study.

preprint2016arXiv

Online estimation of driving events and fatigue damage on vehicles

Driving events, such as maneuvers at slow speed and turns, are important for durability assessments of vehicle components. By counting the number of driving events, one can estimate the fatigue damage caused by the same kind of events. Through knowledge of the distribution of driving events for a group of customers, the vehicles producers can tailor the design, of vehicles, for the group. In this article, we propose an algorithm that can be applied on-board a vehicle to online estimate the expected number of driving events occurring, and thus be used to estimate the distribution of driving events for a certain group of customers. Since the driving events are not observed directly, the algorithm uses a hidden Markov model to extract the events. The parameters of the HMM are estimated using an online EM algorithm. The introduction of the online EM is crucial for practical usage, on-board vehicles, due to that its complexity of an iteration is fixed. Typically, the EM algorithm is used to find the, fixed, parameters that maximizes the likelihood. By introducing a fixed forgetting factor in the online EM, an adaptive algorithm is acquired. This is important in practice since the driving conditions changes over time and a single trip can contain different road types such as city and highway, making the assumption of fixed parameters unrealistic. Finally, we also derive a method to online compute the expected damage.

preprint2016arXiv

Spatially adaptive covariance tapering

Covariance tapering is a popular approach for reducing the computational cost of spatial prediction and parameter estimation for Gaussian process models. However, tapering can have poor performance when the process is sampled at spatially irregular locations or when non-stationary covariance models are used. This work introduces an adaptive tapering method in order to improve the performance of tapering in these problematic cases. This is achieved by introducing a computationally convenient class of compactly supported non-stationary covariance functions, combined with a new method for choosing spatially varying taper ranges. Numerical experiments are used to show that the performance of both kriging prediction and parameter estimation can be improved by allowing for spatially varying taper ranges. However, although adaptive tapering outperforms regular tapering, simply dividing the data into blocks and ignoring the dependence between the blocks is often a better method for parameter estimation.

preprint2016arXiv

Whole-brain substitute CT generation using Markov random field mixture models

Computed tomography (CT) equivalent information is needed for attenuation correction in PET imaging and for dose planning in radiotherapy. Prior work has shown that Gaussian mixture models can be used to generate a substitute CT (s-CT) image from a specific set of MRI modalities. This work introduces a more flexible class of mixture models for s-CT generation, that incorporates spatial dependency in the data through a Markov random field prior on the latent field of class memberships associated with a mixture model. Furthermore, the mixture distributions are extended from Gaussian to normal inverse Gaussian (NIG), allowing heavier tails and skewness. The amount of data needed to train a model for s-CT generation is of the order of 100 million voxels. The computational efficiency of the parameter estimation and prediction methods are hence paramount, especially when spatial dependency is included in the models. A stochastic Expectation Maximization (EM) gradient algorithm is proposed in order to tackle this challenge. The advantages of the spatial model and NIG distributions are evaluated with a cross-validation study based on data from 14 patients. The study show that the proposed model enhances the predictive quality of the s-CT images by reducing the mean absolute error with 17.9%. Also, the distribution of CT values conditioned on the MR images are better explained by the proposed model as evaluated using continuous ranked probability scores.

preprint2015arXiv

Latent modeling of flow cytometry cell populations

Flow cytometry is a widespread single-cell measurement technology with a multitude of clinical and research applications. Interpretation of flow cytometry data is hard; the instrumentation is delicate and can not render absolute measurements, hence samples can only be interpreted in relation to each other while at the same time comparisons are confounded by inter-sample variation. Despite this, current automated flow cytometry data analysis methods either treat samples individually or ignore the variation by for example pooling the data. In this article we introduce a Bayesian hierarchical model for studying latent relations between cell populations in flow cytometry samples, thereby systematizing inter-sample variation. The model is applied to a data set containing replicated flow cytometry measurements of samples from healthy individuals, with informative priors capturing expert knowledge. It is shown that the technical variation in the inferred cell population sizes is small in comparison to the intrinsic biological variation. The large size of flow cytometry data, where a single sample can contain measurements on hundreds of thousands of cells, necessitates computationally efficient methods. To address this, we have implemented a parallel Markov Chain Monte Carlo scheme for sampling the posterior distribution.

preprint2013arXiv

Non-Gaussian Matérn fields with an application to precipitation modeling

The recently proposed non-Gaussian Matérn random field models, generated through Stochastic Partial differential equations (SPDEs), are extended by considering the class of Generalized Hyperbolic processes as noise forcings. The models are also extended to the standard geostatistical setting where irregularly spaced observations are modeled using measurement errors and covariates. A maximum likelihood estimation technique based on the Monte Carlo Expectation Maximization (MCEM) algorithm is presented, and it is shown how the model can be used to do predictions at unobserved locations. Finally, an application to precipitation data over the United States for two month in 1997 is presented, and the performance of the non-Gaussian models is compared with standard Gaussian and transformed Gaussian models through cross-validation.

Jonas Wallin

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Scalable Ultra-High-Dimensional Quantile Regression with Genomic Applications

Local scale invariance and robustness of proper scoring rules

Generalized bounds for active subspaces

Nowcasting Covid-19 statistics reported withdelay: a case-study of Sweden

Multivariate type G Matérn stochastic partial differential equation random fields

Efficient adaptive MCMC through precision estimation

Estimating the unobservable moose - converting index to population size using a Bayesian Hierarchical state space model

Online estimation of driving events and fatigue damage on vehicles

Spatially adaptive covariance tapering

Whole-brain substitute CT generation using Markov random field mixture models

Latent modeling of flow cytometry cell populations

Non-Gaussian Matérn fields with an application to precipitation modeling