Researcher profile

Weijie J. Su

Weijie J. Su contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
16topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2022arXiv

A Power Analysis for Model-X Knockoffs with $\ell_{p}$-Regularized Statistics

Variable selection properties of procedures utilizing penalized-likelihood estimates is a central topic in the study of high dimensional linear regression problems. Existing literature emphasizes the quality of ranking of the variables by such procedures as reflected in the receiver operating characteristic curve or in prediction performance. Specifically, recent works have harnessed modern theory of approximate message-passing (AMP) to obtain, in a particular setting, exact asymptotic predictions of the type I-type II error tradeoff for selection procedures that rely on $\ell_{p}$-regularized estimators. In practice, effective ranking by itself is often not sufficient because some calibration for Type I error is required. In this work we study theoretically the power of selection procedures that similarly rank the features by the size of an $\ell_{p}$-regularized estimator, but further use Model-X knockoffs to control the false discovery rate in the realistic situation where no prior information about the signal is available. In analyzing the power of the resulting procedure, we extend existing results in AMP theory to handle the pairing between original variables and their knockoffs. This is used to derive exact asymptotic predictions for power. We apply the general results to compare the power of the knockoffs versions of Lasso and thresholded-Lasso selection, and demonstrate that in the i.i.d. covariate setting under consideration, tuning by cross-validation on the augmented design matrix is nearly optimal. We further demonstrate how the techniques allow to analyze also the Type S error, and a corresponding notion of power, when selections are supplemented with a decision on the sign of the coefficient.

preprint2022arXiv

An Unconstrained Layer-Peeled Perspective on Neural Collapse

Neural collapse is a highly symmetric geometric pattern of neural networks that emerges during the terminal phase of training, with profound implications on the generalization performance and robustness of the trained networks. To understand how the last-layer features and classifiers exhibit this recently discovered implicit bias, in this paper, we introduce a surrogate model called the unconstrained layer-peeled model (ULPM). We prove that gradient flow on this model converges to critical points of a minimum-norm separation problem exhibiting neural collapse in its global minimizer. Moreover, we show that the ULPM with the cross-entropy loss has a benign global landscape for its loss function, which allows us to prove that all the critical points are strict saddle points except the global minimizers that exhibit the neural collapse phenomenon. Empirically, we show that our results also hold during the training of neural networks in real-world tasks when explicit regularization or weight decay is not used.

preprint2022arXiv

Characterizing the SLOPE Trade-off: A Variational Perspective and the Donoho-Tanner Limit

Sorted l1 regularization has been incorporated into many methods for solving high-dimensional statistical estimation problems, including the SLOPE estimator in linear regression. In this paper, we study how this relatively new regularization technique improves variable selection by characterizing the optimal SLOPE trade-off between the false discovery proportion (FDP) and true positive proportion (TPP) or, equivalently, between measures of type I error and power. Assuming a regime of linear sparsity and working under Gaussian random designs, we obtain an upper bound on the optimal trade-off for SLOPE, showing its capability of breaking the Donoho-Tanner power limit. To put it into perspective, this limit is the highest possible power that the Lasso, which is perhaps the most popular l1-based method, can achieve even with arbitrarily strong effect sizes. Next, we derive a tight lower bound that delineates the fundamental limit of sorted l1 regularization in optimally trading the FDP off for the TPP. Finally, we show that on any problem instance, SLOPE with a certain regularization sequence outperforms the Lasso, in the sense of having a smaller FDP, larger TPP and smaller l2 estimation risk simultaneously. Our proofs are based on a novel technique that reduces a calculus of variations problem to a class of infinite-dimensional convex optimization problems and a very recent result from approximate message passing theory.

preprint2022arXiv

FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data

Algorithmic fairness plays an important role in machine learning and imposing fairness constraints during learning is a common approach. However, many datasets are imbalanced in certain label classes (e.g. "healthy") and sensitive subgroups (e.g. "older patients"). Empirically, this imbalance leads to a lack of generalizability not only of classification, but also of fairness properties, especially in over-parameterized models. For example, fairness-aware training may ensure equalized odds (EO) on the training data, but EO is far from being satisfied on new users. In this paper, we propose a theoretically-principled, yet Flexible approach that is Imbalance-Fairness-Aware (FIFA). Specifically, FIFA encourages both classification and fairness generalization and can be flexibly combined with many existing fair learning methods with logits-based losses. While our main focus is on EO, FIFA can be directly applied to achieve equalized opportunity (EqOpt); and under certain conditions, it can also be applied to other fairness notions. We demonstrate the power of FIFA by combining it with a popular fair classification algorithm, and the resulting algorithm achieves significantly better fairness generalization on several real-world datasets.

preprint2022arXiv

ROCK: Causal Inference Principles for Reasoning about Commonsense Causality

Commonsense causality reasoning (CCR) aims at identifying plausible causes and effects in natural language descriptions that are deemed reasonable by an average person. Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly, and is potentially susceptible to confounding co-occurrences. Motivated by classical causal principles, we articulate the central question of CCR and draw parallels between human subjects in observational studies and natural languages to adopt CCR to the potential-outcomes framework, which is the first such attempt for commonsense tasks. We propose a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality, which utilizes temporal signals as incidental supervision, and balances confounding effects using temporal propensities that are analogous to propensity scores. The ROCK implementation is modular and zero-shot, and demonstrates good CCR capabilities.

preprint2022arXiv

The Price of Competition: Effect Size Heterogeneity Matters in High Dimensions

In high-dimensional sparse regression, would increasing the signal-to-noise ratio while fixing the sparsity level always lead to better model selection? For high-dimensional sparse regression problems, surprisingly, in this paper we answer this question in the negative in the regime of linear sparsity for the Lasso method, relying on a new concept we term effect size heterogeneity. Roughly speaking, a regression coefficient vector has high effect size heterogeneity if its nonzero entries have significantly different magnitudes. From the viewpoint of this new measure, we prove that the false and true positive rates achieve the optimal trade-off uniformly along the Lasso path when this measure is maximal in a certain sense, and the worst trade-off is achieved when it is minimal in the sense that all nonzero effect sizes are roughly equal. Moreover, we demonstrate that the first false selection occurs much earlier when effect size heterogeneity is minimal than when it is maximal. The underlying cause of these two phenomena is, metaphorically speaking, the ``competition'' among variables with effect sizes of the same magnitude in entering the model. Taken together, our findings suggest that effect size heterogeneity shall serve as an important complementary measure to the sparsity of regression coefficients in the analysis of high-dimensional regression problems. Our proofs use techniques from approximate message passing theory as well as a novel technique for estimating the rank of the first false variable.

preprint2022arXiv

Weighted Training for Cross-Task Learning

In this paper, we introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning based on minimizing a representation-based task distance between the source and target tasks. We show that TAWT is easy to implement, is computationally efficient, requires little hyperparameter tuning, and enjoys non-asymptotic learning-theoretic guarantees. The effectiveness of TAWT is corroborated through extensive experiments with BERT on four sequence tagging tasks in natural language processing (NLP), including part-of-speech (PoS) tagging, chunking, predicate detection, and named entity recognition (NER). As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning, such as the choice of the source data and the impact of fine-tuning.

preprint2022arXiv

You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism

I consider a setting where reviewers offer very noisy scores for several items for the selection of high-quality ones (e.g., peer review of large conference proceedings), whereas the owner of these items knows the true underlying scores but prefers not to provide this information. To address this withholding of information, in this paper, I introduce the Isotonic Mechanism, a simple and efficient approach to improving imprecise raw scores by leveraging certain information that the owner is incentivized to provide. This mechanism takes the ranking of the items from best to worst provided by the owner as input, in addition to the raw scores provided by the reviewers. It reports the adjusted scores for the items by solving a convex optimization problem. Under certain conditions, I show that the owner's optimal strategy is to honestly report the true ranking of the items to her best knowledge in order to maximize the expected utility. Moreover, I prove that the adjusted scores provided by this owner-assisted mechanism are significantly more accurate than the raw scores provided by the reviewers. This paper concludes with several extensions of the Isotonic Mechanism and some refinements of the mechanism for practical consideration.

preprint2021arXiv

Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse in Imbalanced Training

In this paper, we introduce the \textit{Layer-Peeled Model}, a nonconvex yet analytically tractable optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this new model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts of the network. We demonstrate that the Layer-Peeled Model, albeit simple, inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep learning training. First, when working on class-balanced datasets, we prove that any solution to this model forms a simplex equiangular tight frame, which in part explains the recently discovered phenomenon of neural collapse \cite{papyan2020prevalence}. More importantly, when moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto unknown phenomenon that we term \textit{Minority Collapse}, which fundamentally limits the performance of deep learning models on the minority classes. In addition, we use the Layer-Peeled Model to gain insights into how to mitigate Minority Collapse. Interestingly, this phenomenon is first predicted by the Layer-Peeled Model before being confirmed by our computational experiments.

preprint2021arXiv

Federated $f$-Differential Privacy

Federated learning (FL) is a training paradigm where the clients collaboratively learn models by repeatedly sharing information without compromising much on the privacy of their local sensitive data. In this paper, we introduce federated $f$-differential privacy, a new notion specifically tailored to the federated setting, based on the framework of Gaussian differential privacy. Federated $f$-differential privacy operates on record level: it provides the privacy guarantee on each individual record of one client's data against adversaries. We then propose a generic private federated learning framework {PriFedSync} that accommodates a large family of state-of-the-art FL algorithms, which provably achieves federated $f$-differential privacy. Finally, we empirically demonstrate the trade-off between privacy guarantee and prediction performance for models trained by {PriFedSync} in computer vision tasks.

preprint2020arXiv

Deep Learning with Gaussian Differential Privacy

Deep learning models are often trained on datasets that contain sensitive information such as individuals' shopping transactions, personal contacts, and medical records. An increasingly important line of work therefore has sought to train neural networks subject to privacy constraints that are specified by differential privacy or its divergence-based relaxations. These privacy definitions, however, have weaknesses in handling certain important primitives (composition and subsampling), thereby giving loose or complicated privacy analyses of training neural networks. In this paper, we consider a recently proposed privacy definition termed \textit{$f$-differential privacy} [18] for a refined privacy analysis of training neural networks. Leveraging the appealing properties of $f$-differential privacy in handling composition and subsampling, this paper derives analytically tractable expressions for the privacy guarantees of both stochastic gradient descent and Adam used in training deep neural networks, without the need of developing sophisticated techniques as [3] did. Our results demonstrate that the $f$-differential privacy framework allows for a new privacy analysis that improves on the prior analysis~[3], which in turn suggests tuning certain parameters of neural networks for a better prediction accuracy without violating the privacy budget. These theoretically derived improvements are confirmed by our experiments in a range of tasks in image classification, text classification, and recommender systems. Python code to calculate the privacy cost for these experiments is publicly available in the \texttt{TensorFlow Privacy} library.

preprint2020arXiv

On Learning Rates and Schrödinger Operators

The learning rate is perhaps the single most important parameter in the training of neural networks and, more broadly, in stochastic (nonconvex) optimization. Accordingly, there are numerous effective, but poorly understood, techniques for tuning the learning rate, including learning rate decay, which starts with a large initial learning rate that is gradually decreased. In this paper, we present a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD). Our analysis is based on the use of a learning-rate-dependent stochastic differential equation (lr-dependent SDE) that serves as a surrogate for SGD. For a broad class of objective functions, we establish a linear rate of convergence for this continuous-time formulation of SGD, highlighting the fundamental importance of the learning rate in SGD, and contrasting to gradient descent and stochastic gradient Langevin dynamics. Moreover, we obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Witten-Laplacian, a special case of the Schrödinger operator associated with the lr-dependent SDE. Strikingly, this expression clearly reveals the dependence of the linear convergence rate on the learning rate -- the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions, whereas it stays constant for strongly convex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.

preprint2020arXiv

Sharp Composition Bounds for Gaussian Differential Privacy via Edgeworth Expansion

Datasets containing sensitive information are often sequentially analyzed by many algorithms. This raises a fundamental question in differential privacy regarding how the overall privacy bound degrades under composition. To address this question, we introduce a family of analytical and sharp privacy bounds under composition using the Edgeworth expansion in the framework of the recently proposed f-differential privacy. In contrast to the existing composition theorems using the central limit theorem, our new privacy bounds under composition gain improved tightness by leveraging the refined approximation accuracy of the Edgeworth expansion. Our approach is easy to implement and computationally efficient for any number of compositions. The superiority of these new bounds is confirmed by an asymptotic error analysis and an application to quantifying the overall privacy guarantees of noisy stochastic gradient descent used in training private deep neural networks.

preprint2020arXiv

The Local Elasticity of Neural Networks

This paper presents a phenomenon in neural networks that we refer to as \textit{local elasticity}. Roughly speaking, a classifier is said to be locally elastic if its prediction at a feature vector $\bx'$ is \textit{not} significantly perturbed, after the classifier is updated via stochastic gradient descent at a (labeled) feature vector $\bx$ that is \textit{dissimilar} to $\bx'$ in a certain sense. This phenomenon is shown to persist for neural networks with nonlinear activation functions through extensive simulations on real-life and synthetic datasets, whereas this is not observed in linear classifiers. In addition, we offer a geometric interpretation of local elasticity using the neural tangent kernel \citep{jacot2018neural}. Building on top of local elasticity, we obtain pairwise similarity measures between feature vectors, which can be used for clustering in conjunction with $K$-means. The effectiveness of the clustering algorithm on the MNIST and CIFAR-10 datasets in turn corroborates the hypothesis of local elasticity of neural networks on real-life data. Finally, we discuss some implications of local elasticity to shed light on several intriguing aspects of deep neural networks.