Source author record

Jonathan W. Siegel

Jonathan W. Siegel appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.OC Information Theory math.CA math.IT math.NA math.ST Numerical Analysis Statistics Theory

Catalog footprint

What is connected

7works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy Pooling

In many practical applications it is important to build symmetries into neural network architectures. Consider the important case of permutation symmetry on point clouds consisting of $n$ points in $d$ dimensions. In this case the network learns a function on a set of $n$ points in $\mathbb{R}^d$, and a natural paradigm for constructing invariant networks is Janossy pooling, which generalizes the popular Deep Sets architecture. We study the universality of this approach, in particular the important question of how large the embedding dimension must be to guarantee universality of this architecture. Specifically, using a novel technique, we prove new lower bounds on the required size of this embedding dimension. For Deep Sets, this gives the correct minimal dimension up to a constant factor for all $d > 1$. For $k$-ary Janossy pooling, we prove the first non-trivial lower bound on the required embedding dimension when $k > 1$.

preprint2022arXiv

A Priori Analysis of Stable Neural Network Solutions to Numerical PDEs

Methods for solving PDEs using neural networks have recently become a very important topic. We provide an a priori error analysis for such methods which is based on the $\mathcal{K}_1(\mathbb{D})$-norm of the solution. We show that the resulting constrained optimization problem can be efficiently solved using a greedy algorithm, which replaces stochastic gradient descent. Following this, we show that the error arising from discretizing the energy integrals is bounded both in the deterministic case, i.e. when using numerical quadrature, and also in the stochastic case, i.e. when sampling points to approximate the integrals. In the later case, we use a Rademacher complexity analysis, and in the former we use standard numerical quadrature bounds. This extends existing results to methods which use a general dictionary of functions to learn solutions to PDEs and importantly gives a consistent analysis which incorporates the optimization, approximation, and generalization aspects of the problem. In addition, the Rademacher complexity analysis is simplified and generalized, which enables application to a wide range of problems.

preprint2022arXiv

Characterization of the Variation Spaces Corresponding to Shallow Neural Networks

We study the variation space corresponding to a dictionary of functions in $L^2(Ω)$ for a bounded domain $Ω\subset \mathbb{R}^d$. Specifically, we compare the variation space, which is defined in terms of a convex hull with related notions based on integral representations. This allows us to show that three important notions relating to the approximation theory of shallow neural networks, the Barron space, the spectral Barron space, and the Radon BV space, are actually variation spaces with respect to certain natural dictionaries.

preprint2022arXiv

Extended Regularized Dual Averaging Methods for Stochastic Optimization

We introduce a new algorithm, extended regularized dual averaging (XRDA), for solving regularized stochastic optimization problems, which generalizes the regularized dual averaging (RDA) method. The main novelty of the method is that it allows a flexible control of the backward step size. For instance, the backward step size used in RDA grows without bound, while for XRDA the backward step size can be kept bounded. We demonstrate experimentally that additional control over the backward step size can significantly improve the convergence rate of the algorithm while preserving desired properties of the iterates, such as sparsity. Theoretically, we show that the XRDA method achieves the same convergence rate as RDA for general convex objectives.

preprint2022arXiv

Optimal Convergence Rates for the Orthogonal Greedy Algorithm

We analyze the orthogonal greedy algorithm when applied to dictionaries $\mathbb{D}$ whose convex hull has small entropy. We show that if the metric entropy of the convex hull of $\mathbb{D}$ decays at a rate of $O(n^{-\frac{1}{2}-α})$ for $α> 0$, then the orthogonal greedy algorithm converges at the same rate on the variation space of $\mathbb{D}$. This improves upon the well-known $O(n^{-\frac{1}{2}})$ convergence rate of the orthogonal greedy algorithm in many cases, most notably for dictionaries corresponding to shallow neural networks. These results hold under no additional assumptions on the dictionary beyond the decay rate of the entropy of its convex hull. In addition, they are robust to noise in the target function and can be extended to convergence rates on the interpolation spaces of the variation norm. We show empirically that the predicted rates are obtained for the dictionary corresponding to shallow neural networks with Heaviside activation function in two dimensions. Finally, we show that these improved rates are sharp and prove a negative result showing that the iterates generated by the orthogonal greedy algorithm cannot in general be bounded in the variation norm of $\mathbb{D}$.

preprint2021arXiv

Accelerated Optimization With Orthogonality Constraints

We develop a generalization of Nesterov's accelerated gradient descent method which is designed to deal with orthogonality constraints. To demonstrate the effectiveness of our method, we perform numerical experiments which demonstrate that the number of iterations scales with the square root of the condition number, and also compare with existing state-of-the-art quasi-Newton methods on the Stiefel manifold. Our experiments show that our method outperforms existing state-of-the-art quasi-Newton methods on some large, ill-conditioned problems.

preprint2021arXiv

Approximation Rates for Neural Networks with General Activation Functions

We prove some new results concerning the approximation rate of neural networks with general activation functions. Our first result concerns the rate of approximation of a two layer neural network with a polynomially-decaying non-sigmoidal activation function. We extend the dimension independent approximation rates previously obtained to this new class of activation functions. Our second result gives a weaker, but still dimension independent, approximation rate for a larger class of activation functions, removing the polynomial decay assumption. This result applies to any bounded, integrable activation function. Finally, we show that a stratified sampling approach can be used to improve the approximation rate for polynomially decaying activation functions under mild additional assumptions.

Jonathan W. Siegel

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy Pooling

A Priori Analysis of Stable Neural Network Solutions to Numerical PDEs

Characterization of the Variation Spaces Corresponding to Shallow Neural Networks

Extended Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Convergence Rates for the Orthogonal Greedy Algorithm

Accelerated Optimization With Orthogonality Constraints

Approximation Rates for Neural Networks with General Activation Functions