Source author record

Thomas Hofmann

Thomas Hofmann appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computation and Language Artificial Intelligence Information Retrieval math.OC cond-mat.mes-hall astro-ph.CO cond-mat.mtrl-sci Social and Information Networks

Catalog footprint

What is connected

21works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

State-space models (SSMs) face a fundamental trade-off between efficiency and expressivity that is mainly dictated by the structure of the model's transition matrix. Unstructured transition matrices enable maximal expressivity, as measured by their ability to model finite-state automaton (FSA) transitions, but come at a prohibitively high compute and memory cost. In contrast, most structured transition matrix forms are highly efficient both in runtime and memory consumption, but suffer from limited expressivity. Building on recent work on structured sparse SSMs, we propose Flash PD-SSM, a novel SSM that achieves comparable throughput to widely-used structured SSMs with significantly better expressivity guarantees. Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale. First, we validate Flash PD-SSM against a suite of alternative models on synthetic mechanistic and state-tracking tasks, finding that its theoretical expressivity is achieved in practice. Second, on multivariate time-series tasks involving sequences of length over 17,000, we find that Flash PD-SSM defines a new state-of-the-art (SoTA) accuracy among competing SSM methods. Finally, we demonstrate that Flash PD-SSM is an effective drop-in replacement for hybrid LLMs, yielding improvements both in natural language state-tracking and in common language modeling scenarios. The model exhibits increased throughput and decreased memory consumption compared to SSMs widely used in frontier language models.

preprint2026arXiv

On the Emergence of Induction Heads for In-Context Learning

Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that only 3 dimensions account for the emergence of an induction head. By further studying the training dynamics inside this 3-dimensional subspace, we find that the time until the emergence of an induction head follows a tight asymptotic bound that is quadratic in the input context length.

preprint2022arXiv

A Full $w$CDM Analysis of KiDS-1000 Weak Lensing Maps using Deep Learning

We present a full forward-modeled $w$CDM analysis of the KiDS-1000 weak lensing maps using graph-convolutional neural networks (GCNN). Utilizing the $\texttt{CosmoGrid}$, a novel massive simulation suite spanning six different cosmological parameters, we generate almost one million tomographic mock surveys on the sphere. Due to the large data set size and survey area, we perform a spherical analysis while limiting our map resolution to $\texttt{HEALPix}$ $n_\mathrm{side}=512$. We marginalize over systematics such as photometric redshift errors, multiplicative calibration and additive shear bias. Furthermore, we use a map-level implementation of the non-linear intrinsic alignment model along with a novel treatment of baryonic feedback to incorporate additional astrophysical nuisance parameters. We also perform a spherical power spectrum analysis for comparison. The constraints of the cosmological parameters are generated using a likelihood free inference method called Gaussian Process Approximate Bayesian Computation (GPABC). Finally, we check that our pipeline is robust against choices of the simulation parameters. We find constraints on the degeneracy parameter of $S_8 \equiv σ_8\sqrt{Ω_M/0.3} = 0.78^{+0.06}_{-0.06}$ for our power spectrum analysis and $S_8 = 0.79^{+0.05}_{-0.05}$ for our GCNN analysis, improving the former by 16%. This is consistent with earlier analyses of the 2-point function, albeit slightly higher. Baryonic corrections generally broaden the constraints on the degeneracy parameter by about 10%. These results offer great prospects for full machine learning based analyses of on-going and future weak lensing surveys.

preprint2022arXiv

Boosting Search Engines with Interactive Agents

This paper presents first successful steps in designing search agents that learn meta-strategies for iterative query refinement in information-seeking tasks. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that learns interactive search strategies from scratch. Our search agents obtain retrieval and answer quality performance comparable to recent neural methods, using only a traditional term-based BM25 ranking function and interpretable discrete reranking and filtering actions.

preprint2022arXiv

Generalization Through The Lens Of Leave-One-Out Error

Despite the tremendous empirical success of deep learning models to solve various learning tasks, our theoretical understanding of their generalization ability is very limited. Classical generalization bounds based on tools such as the VC dimension or Rademacher complexity, are so far unsuitable for deep models and it is doubtful that these techniques can yield tight bounds even in the most idealistic settings (Nagarajan & Kolter, 2019). In this work, we instead revisit the concept of leave-one-out (LOO) error to measure the generalization ability of deep models in the so-called kernel regime. While popular in statistics, the LOO error has been largely overlooked in the context of deep learning. By building upon the recently established connection between neural networks and kernel learning, we leverage the closed-form expression for the leave-one-out error, giving us access to an efficient proxy for the test error. We show both theoretically and empirically that the leave-one-out error is capable of capturing various phenomena in generalization theory, such as double descent, random labels or transfer learning. Our work therefore demonstrates that the leave-one-out error provides a tractable way to estimate the generalization ability of deep neural networks in the kernel regime, opening the door to potential, new research directions in the field of generalization.

preprint2022arXiv

Phenomenology of Double Descent in Finite-Width Neural Networks

`Double descent' delineates the generalization behaviour of models depending on the regime they belong to: under- or over-parameterized. The current theoretical understanding behind the occurrence of this phenomenon is primarily based on linear and kernel regression models -- with informal parallels to neural networks via the Neural Tangent Kernel. Therefore such analyses do not adequately capture the mechanisms behind double descent in finite-width neural networks, as well as, disregard crucial components -- such as the choice of the loss function. We address these shortcomings by leveraging influence functions in order to derive suitable expressions of the population loss and its lower bound, while imposing minimal assumptions on the form of the parametric model. Our derived bounds bear an intimate connection with the spectrum of the Hessian at the optimum, and importantly, exhibit a double descent behaviour at the interpolation threshold. Building on our analysis, we further investigate how the loss function affects double descent -- and thus uncover interesting properties of neural networks and their Hessian spectra near the interpolation threshold.

preprint2021arXiv

Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation that questions this connection: both the explicit and the semi-implicit (a.k.a symplectic) Euler discretizations on this ODE lead to an accelerated algorithm for convex programming. Although semi-implicit methods are well-known in numerical analysis to enjoy many desirable features for the integration of physical systems, our findings show that these properties do not necessarily relate to acceleration.

preprint2020arXiv

A domain agnostic measure for monitoring and evaluating GANs

Generative Adversarial Networks (GANs) have shown remarkable results in modeling complex distributions, but their evaluation remains an unsettled issue. Evaluations are essential for: (i) relative assessment of different models and (ii) monitoring the progress of a single model throughout training. The latter cannot be determined by simply inspecting the generator and discriminator loss curves as they behave non-intuitively. We leverage the notion of duality gap from game theory to propose a measure that addresses both (i) and (ii) at a low computational cost. Extensive experiments show the effectiveness of this measure to rank different GAN models and capture the typical GAN failure scenarios, including mode collapse and non-convergent behaviours. This evaluation metric also provides meaningful monitoring on the progression of the loss during training. It highly correlates with FID on natural image datasets, and with domain specific scores for text, sound and cosmology data where FID is not directly suitable. In particular, our proposed metric requires no labels or a pretrained classifier, making it domain agnostic.

preprint2020arXiv

Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks

Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random matrices, it is not surprising to find that the rank of the intermediate representations in unnormalized networks collapses quickly with depth. In this work we highlight the fact that batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks. Leveraging tools from Markov chain theory, we derive a meaningful lower rank bound in deep linear networks. Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets. Finally, we conduct an extensive set of experiments on real-world data sets, which confirm that rank stability is indeed a crucial condition for training modern-day deep neural architectures.

preprint2020arXiv

BERT as a Teacher: Contextual Embeddings for Sequence-Level Reward

Measuring the quality of a generated sequence against a set of references is a central problem in many learning frameworks, be it to compute a score, to assign a reward, or to perform discrimination. Despite great advances in model architectures, metrics that scale independently of the number of references are still based on n-gram estimates. We show that the underlying operations, counting words and comparing counts, can be lifted to embedding words and comparing embeddings. An in-depth analysis of BERT embeddings shows empirically that contextual embeddings can be employed to capture the required dependencies while maintaining the necessary scalability through appropriate pruning and smoothing techniques. We cast unconditional generation as a reinforcement learning problem and show that our reward function indeed provides a more effective learning signal than n-gram reward in this challenging setting.

preprint2016arXiv

DynaNewton - Accelerating Newton's Method for Machine Learning

Newton's method is a fundamental technique in optimization with quadratic convergence within a neighborhood around the optimum. However reaching this neighborhood is often slow and dominates the computational costs. We exploit two properties specific to empirical risk minimization problems to accelerate Newton's method, namely, subsampling training data and increasing strong convexity through regularization. We propose a novel continuation method, where we define a family of objectives over increasing sample sizes and with decreasing regularization strength. Solutions on this path are tracked such that the minimizer of the previous objective is guaranteed to be within the quadratic convergence region of the next objective to be optimized. Thereby every Newton iteration is guaranteed to achieve super-linear contractions with regard to the chosen objective, which becomes a moving target. We provide a theoretical analysis that motivates our algorithm, called DynaNewton, and characterizes its speed of convergence. Experiments on a wide range of data sets and problems consistently confirm the predicted computational savings.

preprint2016arXiv

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

Many fundamental problems in natural language processing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fundamental component of many NLP tasks such as text understanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, context dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e.,~linkable token spans) are disambiguated jointly across an entire document by combining a document-level prior of entity co-occurrences with local information captured from mentions and their surrounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parameters to be learned. Our method does not require extensive feature engineering, nor an expensive training procedure. We use loopy belief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our approach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state-of-the-art methods.

preprint2016arXiv

Semantic Place Descriptors for Classification and Map Discovery

Urban environments develop complex, non-obvious structures that are often hard to represent in the form of maps or guides. Finding the right place to go often requires intimate familiarity with the location in question and cannot easily be deduced by visitors. In this work, we exploit large-scale samples of usage information, in the form of mobile phone traces and geo-tagged Twitter messages in order to automatically explore and annotate city maps via kernel density estimation. Our experiments are based on one year's worth of mobile phone activity collected by Nokia's Mobile Data Challenge (MDC). We show that usage information can be a strong predictor of semantic place categories, allowing us to automatically annotate maps based on the behavior of the local user base.

preprint2016arXiv

Starting Small -- Learning with Adaptive Sample Sizes

For many machine learning problems, data is abundant and it may be prohibitive to make multiple passes through the full training set. In this context, we investigate strategies for dynamically increasing the effective sample size, when using iterative methods such as stochastic gradient descent. Our interest is motivated by the rise of variance-reduced methods, which achieve linear convergence rates that scale favorably for smaller sample sizes. Exploiting this feature, we show -- theoretically and empirically -- how to obtain significant speed-ups with a novel algorithm that reaches statistical accuracy on an $n$-sample in $2n$, instead of $n \log n$ steps.

preprint2016arXiv

Variance Reduced Stochastic Gradient Descent with Neighbors

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its slow convergence can be a computational bottleneck. Variance reduction techniques such as SAG, SVRG and SAGA have been proposed to overcome this weakness, achieving linear convergence. However, these methods are either based on computations of full gradients at pivot points, or on keeping per data point corrections in memory. Therefore speed-ups relative to SGD may need a minimal number of epochs in order to materialize. This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. As a side-product we provide a unified convergence analysis for a family of variance reduction algorithms, which we call memorization algorithms. We provide experimental results supporting our theory.

preprint2015arXiv

A Variance Reduced Stochastic Newton Method

Quasi-Newton methods are widely used in practise for convex loss minimization problems. These methods exhibit good empirical performance on a wide variety of tasks and enjoy super-linear convergence to the optimal solution. For large-scale learning problems, stochastic Quasi-Newton methods have been recently proposed. However, these typically only achieve sub-linear convergence rates and have not been shown to consistently perform well in practice since noisy Hessian approximations can exacerbate the effect of high-variance stochastic gradient estimates. In this work we propose Vite, a novel stochastic Quasi-Newton algorithm that uses an existing first-order technique to reduce this variance. Without exploiting the specific form of the approximate Hessian, we show that Vite reaches the optimum at a geometric rate with a constant step-size when dealing with smooth strongly convex functions. Empirically, we demonstrate improvements over existing stochastic Quasi-Newton and variance reduced stochastic gradient methods.

preprint2014arXiv

Chemical and Crystallographic Characterization of the Tip Apex in Scanning Probe Microscopy

The apex atom of a W scanning probe tip reveals a non-spherical charge distribution as probed by a CO molecule bonded to a Cu(111) surface [Welker et al. Science, 336, 444 (2012)]. Three high-symmetry images were observed and related to three low-index crystallographic directions of the W bcc crystal. Open questions remained, however, including the verification that the tip was indeed W-terminated, and whether this method can be easily applied to distinguish other atomic species. In this work, we investigate bulk Cu and Fe tips. In both cases we can associate our data with the fcc (Cu) and bcc (Fe) crystal structures. A model is presented, based on the partial filling of d orbitals, to relate the AFM images to the angular orientation of the tip structure.

preprint2014arXiv

Communication-Efficient Distributed Dual Coordinate Ascent

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, CoCoA converges to the same .001-accurate solution quality on average 25x as quickly.

preprint2013arXiv

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two-mode and co-occurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overfitting, we propose a widely applicable generalization of maximum likelihood model fitting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments.

preprint2012arXiv

Exponential Families for Conditional Random Fields

In this paper we de ne conditional random elds in reproducing kernel Hilbert spaces and show connections to Gaussian Process classi cation. More speci cally, we prove decomposition results for undirected graphical models and we give constructions for kernels. Finally we present e cient means of solving the optimization problem using reduced rank decompositions and we show how stationarity can be exploited e ciently in the optimization process.

preprint2010arXiv

Preparation of light-atom tips for Scanning Probe Microscopy by explosive delamination

To obtain maximal resolution in STM and AFM, the size of the protruding tip orbital has to be minimized. Beryllium as tip material is a promising candidate for enhanced resolution because a beryllium atom has just four electrons, leading to a small covalent radius of only 96 pm. Besides that, beryllium is conductive and has a high elastic modulus, which is a necessity for a stable tip apex. However beryllium tips that are prepared ex situ, are covered with a robust oxide layer, which cannot be removed by just heating the tip. Here we present a successful preparation method that combines the heating of the tip by field emission and a mild collision with a clean metal plate. That method yields a clean, oxide-free tip surface as proven by a work function of as deduced from a current-distance curve. Additionally, a STM image of the Si-(111)-(7x7) is presented to prove the single-atom termination of the beryllium tip.

Thomas Hofmann

What is connected

Connect this record

See the researcher in context

Building this map preview

21 published item(s)

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

On the Emergence of Induction Heads for In-Context Learning

A Full $w$CDM Analysis of KiDS-1000 Weak Lensing Maps using Deep Learning

Boosting Search Engines with Interactive Agents

Generalization Through The Lens Of Leave-One-Out Error

Phenomenology of Double Descent in Finite-Width Neural Networks

Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

A domain agnostic measure for monitoring and evaluating GANs

Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks

BERT as a Teacher: Contextual Embeddings for Sequence-Level Reward

DynaNewton - Accelerating Newton's Method for Machine Learning

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

Semantic Place Descriptors for Classification and Map Discovery

Starting Small -- Learning with Adaptive Sample Sizes

Variance Reduced Stochastic Gradient Descent with Neighbors

A Variance Reduced Stochastic Newton Method

Chemical and Crystallographic Characterization of the Tip Apex in Scanning Probe Microscopy

Communication-Efficient Distributed Dual Coordinate Ascent

Probabilistic Latent Semantic Analysis

Exponential Families for Conditional Random Fields

Preparation of light-atom tips for Scanning Probe Microscopy by explosive delamination