Source author record

Mingsong Yan

Mingsong Yan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.NA math.OC Artificial Intelligence Machine Learning Numerical Analysis

Catalog footprint

What is connected

2works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/ε))$ blocks and MLP width $O(\sqrt{N/ε})$ that achieves $ε$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer's layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.

preprint2022arXiv

Parameter Choices for Sparse Regularization with the $\ell_1$ Norm

We consider a regularization problem whose objective function consists of a convex fidelity term and a regularization term determined by the $\ell_1$ norm composed with a linear transform. Empirical results show that the regularization with the $\ell_1$ norm can promote sparsity of a regularized solution. It is the goal of this paper to understand theoretically the effect of the regularization parameter on the sparsity of the regularized solutions. We establish a characterization of the sparsity under the transform matrix of the solution. When the fidelity term has a special structure and the transform matrix coincides with a identity matrix, the resulting characterization can be taken as a regularization parameter choice strategy with which the regularization problem has a solution having a sparsity of a certain level. We study choices of the regularization parameter so that the regularization term alleviates the ill-posedness and promote sparsity of the resulting regularized solution. Numerical experiments demonstrate that choices of the regularization parameters can balance the sparsity of the solutions of the regularization problem and its approximation to the minimizer of the fidelity function.