Source author record

Weijie Su

Weijie Su appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.ST Methodology Statistics Theory Artificial Intelligence math.OC Information Theory math.IT Computation Computation and Language Computer Science and Game Theory Computer Vision Data Structures and Algorithms math.CA

Catalog footprint

What is connected

18works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

High-Dimensional Statistics: Reflections on Progress and Open Problems

Over the past two decades, the field of high-dimensional statistics has experienced substantial progress, driven largely by technological advances that have dramatically reduced the cost and effort for data collection and storage across a broad range of domains, including biology, medicine, astronomy, and the social and environmental sciences. Modern datasets are increasingly complex, often exhibiting rich dependency, heterogeneity, and other features that challenge traditional statistical methods. In response, high-dimensional statistics has evolved to address more sophisticated estimation and inference problems. This evolution has, in turn, fostered deep connections with and contributions to a wide range of research areas, including optimization, concentration of measure, random matrix theory, information theory, and theoretical computer science. Given the rapid pace of recent developments in high-dimensional statistics, our goal is to synthesize representative advances, highlight common themes and open problems, and point to important works that offer entry points into the field.

preprint2026arXiv

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.

preprint2026arXiv

Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

Large language models (LLMs) are pretrained by minimizing the cross-entropy loss for next-token prediction. In this paper, we study whether this optimization strategy can induce geometric structure in the learned model weights and context embeddings. We approach this problem by analyzing a constrained layer-peeled optimization program, which serves as a mathematically tractable surrogate for LLMs by treating the output projection matrix and last-layer context embeddings as optimization variables. Our analysis of this nonconvex optimization program demonstrates that symmetries in the target next-token distributions are transferred to the global minimizers of the layer-peeled model in a precise group-theoretic sense. Specifically, we prove that when the target tokens exhibit a cyclic-shift symmetry (such as the seven days of the week or the twelve months of the year), the optimal logit matrix is exactly circulant, and the Gram matrices of both the output projections and the context embeddings form circulant geometries as well. Next, for exchangeable target distributions invariant under the symmetric group and, more generally, under two-transitive group actions, we show that the global optimal output projection matrix forms a simplex equiangular tight frame, while the optimal logit matrix and context embeddings inherit the permutation symmetries present in the input data. A key technical step is to reduce the constrained nonconvex factorized problem to an explicit logit-level convex characterization for cyclic symmetry and to a symmetry-based lower bound for permutation symmetry, together with a sharp characterization of the optimal factorization. Finally, we empirically demonstrate that open-source LLMs naturally exhibit symmetries consistent with our theoretical predictions, despite being trained without any explicit regularization promoting such geometric structure.

preprint2026arXiv

When Does Model Collapse Occur in Structured Interactive Learning?

The proliferation of generative artificial intelligence has given rise to an interactive learning environment, where model parameters are continuously updated using not only data generated by natural processes, but also synthetic outputs produced by other models. This paradigm introduces two major challenges: (1) training data are no longer drawn exclusively from the target population, undermining a core assumption of classical statistical learning, and (2) model training processes become inherently correlated, as models interact with one another through repeated exposure to each other's synthetic outputs in a potentially complex manner. Establishing reliable statistical inference in such structured interactive learning environments therefore remains an important open problem. In particular, there is growing concern about model collapse, a phenomenon in which the performance of generative models progressively degrades as they are trained on synthetic data produced by earlier model generations. Prior work on model collapse primarily focuses on a single model trained on its own output, failing to capture model performance in multi-model interactive settings. In this work, we fill this gap by investigating the performance of generative models in an interactive learning environment with general interaction patterns. In particular, we formalize model interactions using directed graphs and show that the occurrence of model collapse depends critically on the topology of the interaction graph. We further derive an explicit necessary and sufficient condition characterizing when model collapse occurs, and establish finite-sample results for linear regression and asymptotic guarantees for general M-estimators. We support our theoretical findings through extensive numerical experiments.

preprint2023arXiv

Eliciting Honest Information From Authors Using Sequential Review

In the setting of conference peer review, the conference aims to accept high-quality papers and reject low-quality papers based on noisy review scores. A recent work proposes the isotonic mechanism, which can elicit the ranking of paper qualities from an author with multiple submissions to help improve the conference's decisions. However, the isotonic mechanism relies on the assumption that the author's utility is both an increasing and a convex function with respect to the review score, which is often violated in peer review settings (e.g.~when authors aim to maximize the number of accepted papers). In this paper, we propose a sequential review mechanism that can truthfully elicit the ranking information from authors while only assuming the agent's utility is increasing with respect to the true quality of her accepted papers. The key idea is to review the papers of an author in a sequence based on the provided ranking and conditioning the review of the next paper on the review scores of the previous papers. Advantages of the sequential review mechanism include 1) eliciting truthful ranking information in a more realistic setting than prior work; 2) improving the quality of accepted papers, reducing the reviewing workload and increasing the average quality of papers being reviewed; 3) incentivizing authors to write fewer papers of higher quality.

preprint2021arXiv

Benign Overfitting and Noisy Features

Modern machine learning often operates in the regime where the number of parameters is much higher than the number of data points, with zero training loss and yet good generalization, thereby contradicting the classical bias-variance trade-off. This \textit{benign overfitting} phenomenon has recently been characterized using so called \textit{double descent} curves where the risk undergoes another descent (in addition to the classical U-shaped learning curve when the number of parameters is small) as we increase the number of parameters beyond a certain threshold. In this paper, we examine the conditions under which \textit{Benign Overfitting} occurs in the random feature (RF) models, i.e. in a two-layer neural network with fixed first layer weights. We adopt a new view of random feature and show that \textit{benign overfitting} arises due to the noise which resides in such features (the noise may already be present in the data and propagate to the features or it may be added by the user to the features directly) and plays an important implicit regularization role in the phenomenon.

preprint2020arXiv

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{https://github.com/jackroos/VL-BERT}.

preprint2019arXiv

Statistical Inference for the Population Landscape via Moment Adjusted Stochastic Gradients

Modern statistical inference tasks often require iterative optimization methods to compute the solution. Convergence analysis from an optimization viewpoint only informs us how well the solution is approximated numerically but overlooks the sampling nature of the data. In contrast, recognizing the randomness in the data, statisticians are keen to provide uncertainty quantification, or confidence, for the solution obtained using iterative optimization methods. This paper makes progress along this direction by introducing the moment-adjusted stochastic gradient descents, a new stochastic optimization method for statistical inference. We establish non-asymptotic theory that characterizes the statistical distribution for certain iterative methods with optimization guarantees. On the statistical front, the theory allows for model mis-specification, with very mild conditions on the data. For optimization, the theory is flexible for both convex and non-convex cases. Remarkably, the moment-adjusting idea motivated from "error standardization" in statistics achieves a similar effect as acceleration in first-order optimization methods used to fit generalized linear models. We also demonstrate this acceleration effect in the non-convex setting through numerical experiments.

preprint2016arXiv

False Discoveries Occur Early on the Lasso Path

In regression settings where explanatory variables have very low correlations and there are relatively few effects, each of large magnitude, we expect the Lasso to find the important variables with few errors, if any. This paper shows that in a regime of linear sparsity---meaning that the fraction of variables with a non-vanishing effect tends to a constant, however small---this cannot really be the case, even when the design variables are stochastically independent. We demonstrate that true features and null features are always interspersed on the Lasso path, and that this phenomenon occurs no matter how strong the effect sizes are. We derive a sharp asymptotic trade-off between false and true positive rates or, equivalently, between measures of type I and type II errors along the Lasso path. This trade-off states that if we ever want to achieve a type II error (false negative rate) under a critical value, then anywhere on the Lasso path the type I error (false positive rate) will need to exceed a given threshold so that we can never have both errors at a low level at the same time. Our analysis uses tools from approximate message passing (AMP) theory as well as novel elements to deal with a possibly adaptive selection of the Lasso regularizing parameter.

preprint2016arXiv

Group SLOPE - adaptive selection of groups of predictors

Sorted L-One Penalized Estimation (SLOPE) is a relatively new convex optimization procedure which allows for adaptive selection of regressors under sparse high dimensional designs. Here we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. Such groups can be formed by clustering strongly correlated predictors or groups of dummy variables corresponding to different levels of the same qualitative predictor. We formulate the respective convex optimization problem, gSLOPE (group SLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations. Finally, we illustrate the advantages of gSLOPE in the context of Genome Wide Association Studies. R package grpSLOPE with implementation of our method is available on CRAN.

preprint2015arXiv

A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights

We derive a second-order ordinary differential equation (ODE) which is the limit of Nesterov's accelerated gradient method. This ODE exhibits approximate equivalence to Nesterov's scheme and thus can serve as a tool for analysis. We show that the continuous time ODE allows for a better understanding of Nesterov's scheme. As a byproduct, we obtain a family of schemes with similar convergence rates. The ODE interpretation also suggests restarting Nesterov's scheme leading to an algorithm, which can be rigorously proven to converge at a linear rate whenever the objective is strongly convex.

preprint2015arXiv

Communication-Efficient False Discovery Rate Control via Knockoff Aggregation

The false discovery rate (FDR)---the expected fraction of spurious discoveries among all the discoveries---provides a popular statistical assessment of the reproducibility of scientific studies in various disciplines. In this work, we introduce a new method for controlling the FDR in meta-analysis of many decentralized linear models. Our method targets the scenario where many research groups---possibly the number of which is random---are independently testing a common set of hypotheses and then sending summary statistics to a coordinating center in an online manner. Built on the knockoffs framework introduced by Barber and Candes (2015), our procedure starts by applying the knockoff filter to each linear model and then aggregates the summary statistics via one-shot communication in a novel way. This method gives exact FDR control non-asymptotically without any knowledge of the noise variances or making any assumption about sparsity of the signal. In certain settings, it has a communication complexity that is optimal up to a logarithmic factor.

preprint2015arXiv

Familywise Error Rate Control via Knockoffs

We present a novel method for controlling the $k$-familywise error rate ($k$-FWER) in the linear regression setting using the knockoffs framework first introduced by Barber and Candès. Our procedure, which we also refer to as knockoffs, can be applied with any design matrix with at least as many observations as variables, and does not require knowing the noise variance. Unlike other multiple testing procedures which act directly on $p$-values, knockoffs is specifically tailored to linear regression and implicitly accounts for the statistical relationships between hypothesis tests of different coefficients. We prove that knockoffs controls the $k$-FWER exactly in finite samples and show in simulations that it provides superior power to alternative procedures over a range of linear regression problems. We also discuss extensions to controlling other Type I error rates such as the false exceedance rate, and use it to identify candidates for mutations conferring drug-resistance in HIV.

preprint2015arXiv

Group SLOPE - adaptive selection of groups of predictors

Sorted L-One Penalized Estimation is a relatively new convex optimization procedure which allows for adaptive selection of regressors under sparse high dimensional designs. Here we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. This approach is particularly useful when variables in the same group are strongly correlated and thus true predictors are difficult to distinguish from their correlated "neighbors"'. We formulate the respective convex optimization problem, gSLOPE (group SLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations.

preprint2015arXiv

Private False Discovery Rate Control

We provide the first differentially private algorithms for controlling the false discovery rate (FDR) in multiple hypothesis testing, with essentially no loss in power under certain conditions. Our general approach is to adapt a well-known variant of the Benjamini-Hochberg procedure (BHq), making each step differentially private. This destroys the classical proof of FDR control. To prove FDR control of our method, (a) we develop a new proof of the original (non-private) BHq algorithm and its robust variants -- a proof requiring only the assumption that the true null test statistics are independent, allowing for arbitrary correlations between the true nulls and false nulls. This assumption is fairly weak compared to those previously shown in the vast literature on this topic, and explains in part the empirical robustness of BHq. Then (b) we relate the FDR control properties of the differentially private version to the control properties of the non-private version. \end{enumerate} We also present a low-distortion "one-shot" differentially private primitive for "top $k$" problems, e.g., "Which are the $k$ most popular hobbies?" (which we apply to: "Which hypotheses have the $k$ most significant $p$-values?"), and use it to get a faster privacy-preserving instantiation of our general approach at little cost in accuracy. The proof of privacy for the one-shot top~$k$ algorithm introduces a new technique of independent interest.

preprint2015arXiv

SLOPE - Adaptive variable selection via convex optimization

We introduce a new estimator for the vector of coefficients $β$ in the linear model $y=Xβ+z$, where $X$ has dimensions $n\times p$ with $p$ possibly larger than $n$. SLOPE, short for Sorted L-One Penalized Estimation, is the solution to \[\min_{b\in\mathbb{R}^p}\frac{1}{2}\Vert y-Xb\Vert _{\ell_2}^2+λ_1\vert b\vert _{(1)}+λ_2\vert b\vert_{(2)}+\cdots+λ_p\vert b\vert_{(p)},\] where $λ_1\geλ_2\ge\cdots\geλ_p\ge0$ and $\vert b\vert_{(1)}\ge\vert b\vert_{(2)}\ge\cdots\ge\vert b\vert_{(p)}$ are the decreasing absolute values of the entries of $b$. This is a convex program and we demonstrate a solution algorithm whose computational complexity is roughly comparable to that of classical $\ell_1$ procedures such as the Lasso. Here, the regularizer is a sorted $\ell_1$ norm, which penalizes the regression coefficients according to their rank: the higher the rank - that is, stronger the signal - the larger the penalty. This is similar to the Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300] procedure (BH) which compares more significant $p$-values with more stringent thresholds. One notable choice of the sequence $\{λ_i\}$ is given by the BH critical values $λ_{\mathrm {BH}}(i)=z(1-i\cdot q/2p)$, where $q\in(0,1)$ and $z(α)$ is the quantile of a standard normal distribution. SLOPE aims to provide finite sample guarantees on the selected model; of special interest is the false discovery rate (FDR), defined as the expected proportion of irrelevant regressors among all selected predictors. Under orthogonal designs, SLOPE with $λ_{\mathrm{BH}}$ provably controls FDR at level $q$. Moreover, it also appears to have appreciable inferential properties under more general designs $X$ while having substantial power, as demonstrated in a series of experiments running on both simulated and real data.

preprint2015arXiv

SLOPE is Adaptive to Unknown Sparsity and Asymptotically Minimax

We consider high-dimensional sparse regression problems in which we observe $y = X β+ z$, where $X$ is an $n \times p$ design matrix and $z$ is an $n$-dimensional vector of independent Gaussian errors, each with variance $σ^2$. Our focus is on the recently introduced SLOPE estimator ((Bogdan et al., 2014)), which regularizes the least-squares estimates with the rank-dependent penalty $\sum_{1 \le i \le p} λ_i |\hat β|_{(i)}$, where $|\hat β|_{(i)}$ is the $i$th largest magnitude of the fitted coefficients. Under Gaussian designs, where the entries of $X$ are i.i.d.~$\mathcal{N}(0, 1/n)$, we show that SLOPE, with weights $λ_i$ just about equal to $σ\cdot Φ^{-1}(1-iq/(2p))$ ($Φ^{-1}(α)$ is the $α$th quantile of a standard normal and $q$ is a fixed number in $(0,1)$) achieves a squared error of estimation obeying \[ \sup_{\| β\|_0 \le k} \,\, \mathbb{P} \left(\| \hatβ_{\text{SLOPE}} - β\|^2 > (1+ε) \, 2σ^2 k \log(p/k) \right) \longrightarrow 0 \] as the dimension $p$ increases to $\infty$, and where $ε> 0$ is an arbitrary small constant. This holds under a weak assumption on the $\ell_0$-sparsity level, namely, $k/p \rightarrow 0$ and $(k\log p)/n \rightarrow 0$, and is sharp in the sense that this is the best possible error any estimator can achieve. A remarkable feature is that SLOPE does not require any knowledge of the degree of sparsity, and yet automatically adapts to yield optimal total squared errors over a wide range of $\ell_0$-sparsity classes. We are not aware of any other estimator with this property.

preprint2013arXiv

Statistical estimation and testing via the sorted L1 norm

We introduce a novel method for sparse regression and variable selection, which is inspired by modern ideas in multiple testing. Imagine we have observations from the linear model y = X beta + z, then we suggest estimating the regression coefficients by means of a new estimator called SLOPE, which is the solution to minimize 0.5 ||y - Xb\|_2^2 + lambda_1 |b|_(1) + lambda_2 |b|_(2) + ... + lambda_p |b|_(p); here, lambda_1 >= λ_2 >= ... >= λ_p >= 0 and |b|_(1) >= |b|_(2) >= ... >= |b|_(p) is the order statistic of the magnitudes of b. The regularizer is a sorted L1 norm which penalizes the regression coefficients according to their rank: the higher the rank, the larger the penalty. This is similar to the famous BHq procedure [Benjamini and Hochberg, 1995], which compares the value of a test statistic taken from a family to a critical threshold that depends on its rank in the family. SLOPE is a convex program and we demonstrate an efficient algorithm for computing the solution. We prove that for orthogonal designs with p variables, taking lambda_i = F^{-1}(1-q_i) (F is the cdf of the errors), q_i = iq/(2p), controls the false discovery rate (FDR) for variable selection. When the design matrix is nonorthogonal there are inherent limitations on the FDR level and the power which can be obtained with model selection methods based on L1-like penalties. However, whenever the columns of the design matrix are not strongly correlated, we demonstrate empirically that it is possible to select the parameters lambda_i as to obtain FDR control at a reasonable level as long as the number of nonzero coefficients is not too large. At the same time, the procedure exhibits increased power over the lasso, which treats all coefficients equally. The paper illustrates further estimation properties of the new selection rule through comprehensive simulation studies.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.20151:author:3:weijie-su

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.18106:author:2:weijie-su

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.05076:author:9:weijie-su

Imported May 20, 2026Synced May 20, 2026

arxivconfidence 95%

external id: arxiv:2605.12756:author:3:weijie-su

Imported May 20, 2026Synced May 20, 2026

3 works

Emmanuel Candes

Researcher

Emmanuel Candes contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Malgorzata Bogdan

Researcher

Malgorzata Bogdan contributes to research discovery and scholarly infrastructure.

Open to collaborate

2 works

Damian Brzyski

Researcher

Damian Brzyski contributes to research discovery and scholarly infrastructure.

Open to collaborate

2 works

Ewout van den Berg

Researcher

Ewout van den Berg contributes to research discovery and scholarly infrastructure.

Open to collaborate

Weijie Su

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

High-Dimensional Statistics: Reflections on Progress and Open Problems

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

When Does Model Collapse Occur in Structured Interactive Learning?

Eliciting Honest Information From Authors Using Sequential Review

Benign Overfitting and Noisy Features

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Statistical Inference for the Population Landscape via Moment Adjusted Stochastic Gradients

False Discoveries Occur Early on the Lasso Path

Group SLOPE - adaptive selection of groups of predictors

A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights

Communication-Efficient False Discovery Rate Control via Knockoff Aggregation

Familywise Error Rate Control via Knockoffs

Group SLOPE - adaptive selection of groups of predictors

Private False Discovery Rate Control

SLOPE - Adaptive variable selection via convex optimization

SLOPE is Adaptive to Unknown Sparsity and Asymptotically Minimax

Statistical estimation and testing via the sorted L1 norm