Source author record

Johannes Schmidt-Hieber

Johannes Schmidt-Hieber appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Methodology Machine Learning Applications Genomics Neural and Evolutionary Computing

Catalog footprint

What is connected

20works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing ICNNs and MLPs in terms of predictive performance for convex regression and interpolation tasks. We further apply HyCNNs to learn high-dimensional optimal transport maps for synthetic examples and for single-cell RNA sequencing data, where they oftentimes outperform ICNN-based neural optimal transport methods and other baselines across a wide range of settings.

preprint2026arXiv

Spike-timing-dependent Hebbian learning as noisy gradient descent

Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.

preprint2023arXiv

Local convergence rates of the nonparametric least squares estimator with applications to transfer learning

Convergence properties of empirical risk minimizers can be conveniently expressed in terms of the associated population risk. To derive bounds for the performance of the estimator under covariate shift, however, pointwise convergence rates are required. Under weak assumptions on the design distribution, it is shown that least squares estimators (LSE) over 1-Lipschitz functions are also minimax rate optimal with respect to a weighted uniform norm, where the weighting accounts in a natural way for the non-uniformity of the design distribution. This implies that although least squares is a global criterion, the LSE adapts locally to the size of the design density. We develop a new indirect proof technique that establishes the local convergence behavior based on a carefully chosen local perturbation of the LSE. The obtained local rates are then applied to analyze the LSE for transfer learning under covariate shift.

preprint2022arXiv

Posterior contraction for deep Gaussian process priors

We study posterior contraction rates for a class of deep Gaussian process priors applied to the nonparametric regression problem under a general composition assumption on the regression function. It is shown that the contraction rates can achieve the minimax convergence rate (up to $\log n$ factors), while being adaptive to the underlying structure and smoothness of the target function. The proposed framework extends the Bayesian nonparametric theory for Gaussian process priors.

preprint2021arXiv

The Kolmogorov-Arnold representation theorem revisited

There is a longstanding debate whether the Kolmogorov-Arnold representation theorem can explain the use of more than one hidden layer in neural networks. The Kolmogorov-Arnold representation decomposes a multivariate function into an interior and an outer function and therefore has indeed a similar structure as a neural network with two hidden layers. But there are distinctive differences. One of the main obstacles is that the outer function depends on the represented function and can be wildly varying even if the represented function is smooth. We derive modifications of the Kolmogorov-Arnold representation that transfer smoothness properties of the represented function to the outer function and can be well approximated by ReLU networks. It appears that instead of two hidden layers, a more natural interpretation of the Kolmogorov-Arnold representation is that of a deep neural network where most of the layers are required to approximate the interior function.

preprint2020arXiv

Nonparametric regression using deep neural networks with ReLU activation function

Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to $\log n$-factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential network parameters exceeding the sample size. The analysis gives some insights into why multilayer feedforward neural networks perform well in practice. Interestingly, for ReLU activation function the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that for nonparametric regression, scaling the network depth with the sample size is natural. It is also shown that under the composition assumption wavelet estimators can only achieve suboptimal rates.

preprint2020arXiv

On frequentist coverage of Bayesian credible sets for estimation of the mean under constraints

Frequentist coverage of $(1-α)$-highest posterior density (HPD) credible sets is studied in a signal plus noise model under a large class of noise distributions. We consider a specific class of spike-and-slab prior distributions. Different regimes are identified and we derive closed form expressions for the $(1-α)$-HPD on each of these regimes. Similar to the earlier work by Marchand and Strawderman, it is shown that under suitable conditions, the frequentist coverage can drop to $1-3α/2.$

preprint2020arXiv

Posterior contraction rates for support boundary recovery

Given a sample of a Poisson point process with intensity $λ_f(x,y) = n \mathbf{1}(f(x) \leq y),$ we study recovery of the boundary function $f$ from a nonparametric Bayes perspective. Because of the irregularity of this model, the analysis is non-standard. We establish a general result for the posterior contraction rate with respect to the $L^1$-norm based on entropy and one-sided small probability bounds. From this, specific posterior contraction results are derived for Gaussian process priors and priors based on random wavelet series.

preprint2016arXiv

Minimax theory for a class of non-linear statistical inverse problems

We study a class of statistical inverse problems with non-linear pointwise operators motivated by concrete statistical applications. A two-step procedure is proposed, where the first step smoothes the data and inverts the non-linearity. This reduces the initial non-linear problem to a linear inverse problem with deterministic noise, which is then solved in a second step. The noise reduction step is based on wavelet thresholding and is shown to be minimax optimal (up to logarithmic factors) in a pointwise function-dependent sense. Our analysis is based on a modified notion of Hölder smoothness scales that are natural in this setting.

preprint2015arXiv

Bayesian linear regression with sparse priors

We study full Bayesian procedures for high-dimensional linear regression under sparsity constraints. The prior is a mixture of point masses at zero and continuous distributions. Under compatibility conditions on the design matrix, the posterior distribution is shown to contract at the optimal rate for recovery of the unknown sparse vector, and to give optimal prediction of the response vector. It is also shown to select the correct sparse model, or at least the coefficients that are significantly different from zero. The asymptotic shape of the posterior distribution is characterized and employed to the construction and study of credible sets for uncertainty quantification.

preprint2015arXiv

Conditions for Posterior Contraction in the Sparse Normal Means Problem

The first Bayesian results for the sparse normal means problem were proven for spike-and-slab priors. However, these priors are less convenient from a computational point of view. In the meanwhile, a large number of continuous shrinkage priors has been proposed. Many of these shrinkage priors can be written as a scale mixture of normals, which makes them particularly easy to implement. We propose general conditions on the prior on the local variance in scale mixtures of normals, such that posterior contraction at the minimax rate is assured. The conditions require tails at least as heavy as Laplace, but not too heavy, and a large amount of mass around zero relative to the tails, more so as the sparsity increases. These conditions give some general guidelines for choosing a shrinkage prior for estimation under a nearly black sparsity assumption. We verify these conditions for the class of priors considered by Ghosh and Chakrabarti (2015), which includes the horseshoe and the normal-exponential gamma priors, and for the horseshoe+, the inverse-Gaussian prior, the normal-gamma prior, and the spike-and-slab Lasso, and thus extend the number of shrinkage priors which are known to lead to posterior contraction at the minimax estimation rate.

preprint2015arXiv

On adaptive posterior concentration rates

We investigate the problem of deriving posterior concentration rates under different loss functions in nonparametric Bayes. We first provide a lower bound on posterior coverages of shrinking neighbourhoods that relates the metric or loss under which the shrinking neighbourhood is considered, and an intrinsic pre-metric linked to frequentist separation rates. In the Gaussian white noise model, we construct feasible priors based on a spike and slab procedure reminiscent of wavelet thresholding that achieve adaptive rates of contraction under $L^2$ or $L^{\infty}$ metrics when the underlying parameter belongs to a collection of Hölder balls and that moreover achieve our lower bound. We analyse the consequences in terms of asymptotic behaviour of posterior credible balls as well as frequentist minimax adaptive estimation. Our results are appended with an upper bound for the contraction rate under an arbitrary loss in a generic regular experiment. The upper bound is attained for certain sieve priors and enables to extend our results to density estimation.

preprint2015arXiv

On an estimator achieving the adaptive rate in nonparametric regression under $L^p$-loss for all $1\leq p \leq \infty$

Consider nonparametric function estimation under $L^p$-loss. The minimax rate for estimation of the regression function over a Hölder ball with smoothness index $β$ is $n^{-β/(2β+1)}$ if $1\leq p<\infty$ and $(n/\log n)^{-β/(2β+1)}$ if $p=\infty.$ There are many known procedures that either attain this rate for $p=\infty$ but are suboptimal by a $\log n$ factor in the case $p<\infty$ or the other way around. In this article, we construct an estimator that simultaneously achieves the optimal rates under $L^p$-risk for all $1\leq p\leq \infty$ without prior knowledge of $β.$ In contrast to classical wavelet thresholding methods that kill small empirical wavelet coefficients and keep large ones, it is essential for simultaneous adaptation that on each resolution level, the largest empirical wavelet coefficients are truncated. This leads to a completely different point of view on wavelet thresholding. The crucial part in the construction of the estimator is the size of the truncation level which is linked to the unknown smoothness index. Although estimation of the smoothness index is known to be a difficult task, there is a data-driven choice of the truncation level that is sufficiently precise for our purpose.

preprint2014arXiv

Asymptotic equivalence for regression under fractional noise

Consider estimation of the regression function based on a model with equidistant design and measurement errors generated from a fractional Gaussian noise process. In previous literature, this model has been heuristically linked to an experiment, where the anti-derivative of the regression function is continuously observed under additive perturbation by a fractional Brownian motion. Based on a reformulation of the problem using reproducing kernel Hilbert spaces, we derive abstract approximation conditions on function spaces under which asymptotic equivalence between these models can be established and show that the conditions are satisfied for certain Sobolev balls exceeding some minimal smoothness. Furthermore, we construct a sequence space representation and provide necessary conditions for asymptotic equivalence to hold.

preprint2014arXiv

Asymptotically efficient estimation of a scale parameter in Gaussian time series and closed-form expressions for the Fisher information

Mimicking the maximum likelihood estimator, we construct first order Cramer-Rao efficient and explicitly computable estimators for the scale parameter $σ^2$ in the model $Z_{i,n}=σn^{-β}X_i+Y_i,i=1,\ldots,n,β>0$ with independent, stationary Gaussian processes $(X_i)_{i\in\mathbb{N}}$, $(Y_i)_{i\in\mathbb{N}}$, and $(X_i)_{i\in\mathbb{N}}$ exhibits possibly long-range dependence. In a second part, closed-form expressions for the asymptotic behavior of the corresponding Fisher information are derived. Our main finding is that depending on the behavior of the spectral densities at zero, the Fisher information has asymptotically two different scaling regimes, which are separated by a sharp phase transition. The most prominent example included in our analysis is the Fisher information for the scaling factor of a high-frequency sample of fractional Brownian motion under additive noise.

preprint2013arXiv

Spot volatility estimation for high-frequency data: adaptive estimation in practice

We develop further the spot volatility estimator introduced in Hoffmann, Munk and Schmidt-Hieber (2012) from a practical point of view and make it useful for the analysis of high-frequency financial data. In a first part, we adjust the estimator substantially in order to achieve good finite sample performance and to overcome difficulties arising from violations of the additive microstructure noise model (e.g. jumps, rounding errors). These modifications are justified by simulations. The second part is devoted to investigate the behavior of volatility in response to macroeconomic events. We give evidence that the spot volatility of Euro-BUND futures is considerably higher during press conferences of the European Central Bank. As an outlook, we present an estimator for the spot covolatility of two different prices.

preprint2012arXiv

Multiscale Methods for Shape Constraints in Deconvolution: Confidence Statements for Qualitative Features

We derive multiscale statistics for deconvolution in order to detect qualitative features of the unknown density. An important example covered within this framework is to test for local monotonicity on all scales simultaneously. We investigate the moderately ill-posed setting, where the Fourier transform of the error density in the deconvolution model is of polynomial decay. For multiscale testing, we consider a calibration, motivated by the modulus of continuity of Brownian motion. We investigate the performance of our results from both the theoretical and simulation based point of view. A major consequence of our work is that the detection of qualitative features of a density in a deconvolution problem is a doable task although the minimax rates for pointwise estimation are very slow.

preprint2011arXiv

Adaptive wavelet estimation of the diffusion coefficient under additive error measurements

We study nonparametric estimation of the diffusion coefficient from discrete data, when the observations are blurred by additional noise. Such issues have been developed over the last 10 years in several application fields and in particular in high frequency financial data modelling, however mainly from a parametric and semiparametric point of view. This paper addresses the nonparametric estimation of the path of the (possibly stochastic) diffusion coefficient in a relatively general setting. By developing pre-averaging techniques combined with wavelet thresholding, we construct adaptive estimators that achieve a nearly optimal rate within a large scale of smoothness constraints of Besov type. Since the diffusion coefficient is usually genuinely random, we propose a new criterion to assess the quality of estimation; we retrieve the usual minimax theory when this approach is restricted to a deterministic diffusion coefficient. In particular, we take advantage of recent results of Reiss [33] of asymptotic equivalence between a Gaussian diffusion with additive noise and Gaussian white noise model, in order to prove a sharp lower bound.

preprint2010arXiv

Lower bounds for volatility estimation in microstructure noise models

In this paper we derive lower bounds in minimax sense for estimation of the instantaneous volatility if the diffusion type part cannot be observed directly but under some additional Gaussian noise. Three different models are considered. Our technique is based on a general inequality for Kullback-Leibler divergence of multivariate normal random variables and spectral analysis of the processes. The derived lower bounds are indeed optimal. Upper bounds can be found in Munk and Schmidt-Hieber [18]. Our major finding is that the Gaussian microstructure noise introduces an additional degree of ill-posedness for each model, respectively.

preprint2010arXiv

Nonparametric estimation of the volatility function in a high-frequency model corrupted by noise

We consider the models Y_{i,n}=\int_0^{i/n} σ(s)dW_s+τ(i/n)ε_{i,n}, and \tilde Y_{i,n}=σ(i/n)W_{i/n}+τ(i/n)ε_{i,n}, i=1,...,n, where W_t denotes a standard Brownian motion and ε_{i,n} are centered i.i.d. random variables with E(ε_{i,n}^2)=1 and finite fourth moment. Furthermore, σand τare unknown deterministic functions and W_t and (ε_{1,n},...,ε_{n,n}) are assumed to be independent processes. Based on a spectral decomposition of the covariance structures we derive series estimators for σ^2 and τ^2 and investigate their rate of convergence of the MISE in dependence of their smoothness. To this end specific basis functions and their corresponding Sobolev ellipsoids are introduced and we show that our estimators are optimal in minimax sense. Our work is motivated by microstructure noise models. Our major finding is that the microstructure noise ε_{i,n} introduces an additionally degree of ill-posedness of 1/2; irrespectively of the tail behavior of ε_{i,n}. The method is illustrated by a small numerical study.

Johannes Schmidt-Hieber

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

Spike-timing-dependent Hebbian learning as noisy gradient descent

Local convergence rates of the nonparametric least squares estimator with applications to transfer learning

Posterior contraction for deep Gaussian process priors

The Kolmogorov-Arnold representation theorem revisited

Nonparametric regression using deep neural networks with ReLU activation function

On frequentist coverage of Bayesian credible sets for estimation of the mean under constraints

Posterior contraction rates for support boundary recovery

Minimax theory for a class of non-linear statistical inverse problems

Bayesian linear regression with sparse priors

Conditions for Posterior Contraction in the Sparse Normal Means Problem

On adaptive posterior concentration rates

On an estimator achieving the adaptive rate in nonparametric regression under $L^p$-loss for all $1\leq p \leq \infty$

Asymptotic equivalence for regression under fractional noise

Asymptotically efficient estimation of a scale parameter in Gaussian time series and closed-form expressions for the Fisher information

Spot volatility estimation for high-frequency data: adaptive estimation in practice

Multiscale Methods for Shape Constraints in Deconvolution: Confidence Statements for Qualitative Features

Adaptive wavelet estimation of the diffusion coefficient under additive error measurements

Lower bounds for volatility estimation in microstructure noise models

Nonparametric estimation of the volatility function in a high-frequency model corrupted by noise