Researcher profile

Lihua Lei

Lihua Lei contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2025arXiv

Compound Estimation for Binomials

Many applications involve estimating the mean of multiple binomial outcomes as a common problem -- assessing intergenerational mobility of census tracts, estimating prevalence of infectious diseases across countries, and measuring click-through rates for different demographic groups. The most standard approach is to report the plain average of each outcome. Despite simplicity, the estimates are noisy when the sample sizes or mean parameters are small. In contrast, the Empirical Bayes (EB) methods are able to boost the average accuracy by borrowing information across tasks. Nevertheless, the EB methods require a Bayesian model where the parameters are sampled from a prior distribution which, unlike the commonly-studied Gaussian case, is unidentified due to discreteness of binomial measurements. Even if the prior distribution is known, the computation is difficult when the sample sizes are heterogeneous as there is no simple joint conjugate prior for the sample size and mean parameter. In this paper, we consider the compound decision framework which treats the sample size and mean parameters as fixed quantities. We develop an approximate Stein's Unbiased Risk Estimator (SURE) for the average mean squared error given any class of estimators. For a class of machine learning-assisted linear shrinkage estimators, we establish asymptotic optimality, regret bounds, and valid inference. Unlike existing work, we work with the binomials directly without resorting to Gaussian approximations. This allows us to work with small sample sizes and/or mean parameters in both one-sample and two-sample settings. We demonstrate our approach using three datasets on firm discrimination, education outcomes, and innovation rates.

preprint2022arXiv

What Estimators Are Unbiased For Linear Models?

The recent thought-provoking paper by Hansen [2022, Econometrica] proved that the Gauss-Markov theorem continues to hold without the requirement that competing estimators are linear in the vector of outcomes. Despite the elegant proof, it was shown by the authors and other researchers that the main result in the earlier version of Hansen's paper does not extend the classic Gauss-Markov theorem because no nonlinear unbiased estimator exists under his conditions. To address the issue, Hansen [2022] added statements in the latest version with new conditions under which nonlinear unbiased estimators exist. Motivated by the lively discussion, we study a fundamental problem: what estimators are unbiased for a given class of linear models? We first review a line of highly relevant work dating back to the 1960s, which, unfortunately, have not drawn enough attention. Then, we introduce notation that allows us to restate and unify results from earlier work and Hansen [2022]. The new framework also allows us to highlight differences among previous conclusions. Lastly, we establish new representation theorems for unbiased estimators under different restrictions on the linear model, allowing the coefficients and covariance matrix to take only a finite number of values, the higher moments of the estimator and the dependent variable to exist, and the error distribution to be discrete, absolutely continuous, or dominated by another probability measure. Our results substantially generalize the claims of parallel commentaries on Hansen [2022] and a remarkable result by Koopmann [1982].

preprint2020arXiv

Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization

Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the "geometrization" technique introduced by Lei & Jordan 2016 and the \texttt{SARAH} algorithm of Nguyen et al., 2017, we propose the Geometrized \texttt{SARAH} algorithm for non-convex finite-sum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak-Łojasiewicz (PL) constant if present. In addition, it achieves the best-available convergence rate for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.

preprint2020arXiv

An Assumption-Free Exact Test For Fixed-Design Linear Models With Exchangeable Errors

We propose the Cyclic Permutation Test (CPT) to test general linear hypotheses for linear models. This test is non-randomized and valid in finite samples with exact Type I error $α$ for an arbitrary fixed design matrix and arbitrary exchangeable errors, whenever $1 / α$ is an integer and $n / p \ge 1 / α- 1$. The test involves applying the marginal rank test to $1 / α$ linear statistics of the outcome vector, where the coefficient vectors are determined by solving a linear system such that the joint distribution of the linear statistics is invariant with respect to a non-standard cyclic permutation group under the null hypothesis.The power can be further enhanced by solving a secondary non-linear travelling salesman problem, for which the genetic algorithm can find a reasonably good solution. Extensive simulation studies show that the CPT has comparable power to existing tests. When testing for a single contrast of coefficients, an exact confidence interval can be obtained by inverting the test. Furthermore, we provide a selective yet extensive literature review of the century-long efforts on this problem, highlighting the novelty of our test.

preprint2020arXiv

Conditional calibration for false discovery rate control under dependence

We introduce a new class of methods for finite-sample false discovery rate (FDR) control in multiple testing problems with dependent test statistics where the dependence is fully or partially known. Our approach separately calibrates a data-dependent p-value rejection threshold for each hypothesis, relaxing or tightening the threshold as appropriate to target exact FDR control. In addition to our general framework we propose a concrete algorithm, the dependence-adjusted Benjamini-Hochberg (dBH) procedure, which adaptively thresholds the q-value for each hypothesis. Under positive regression dependence the dBH procedure uniformly dominates the standard BH procedure, and in general it uniformly dominates the Benjamini-Yekutieli (BY) procedure (also known as BH with log correction). Simulations and real data examples illustrate power gains over competing approaches to FDR control under dependence.

preprint2020arXiv

Hierarchical community detection by recursive partitioning

The problem of community detection in networks is usually formulated as finding a single partition of the network into some "correct" number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple top-down recursive partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. This class of algorithms is model-free, computationally efficient, and requires no tuning other than selecting a stopping rule. We show that there are regimes where this approach outperforms K-way spectral clustering, and propose a natural framework for analyzing the algorithm's theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We apply the algorithm to a gene network based on gene co-occurrence in 1580 research papers on anemia, and identify six clusters of genes in a meaningful hierarchy. We also illustrate the algorithm on a dataset of statistics papers.

preprint2020arXiv

On the Adaptivity of Stochastic Gradient-Based Optimization

Stochastic-gradient-based optimization has been a core enabling methodology in applications to large-scale problems in machine learning and related areas. Despite the progress, the gap between theory and practice remains significant, with theoreticians pursuing mathematical optimality at a cost of obtaining specialized procedures in different regimes (e.g., modulus of strong convexity, magnitude of target accuracy, signal-to-noise ratio), and with practitioners not readily able to know which regime is appropriate to their problem, and seeking broadly applicable algorithms that are reasonably close to optimality. To bridge these perspectives it is necessary to study algorithms that are adaptive to different regimes. We present the stochastically controlled stochastic gradient (SCSG) method for composite convex finite-sum optimization problems and show that SCSG is adaptive to both strong convexity and target accuracy. The adaptivity is achieved by batch variance reduction with adaptive batch sizes and a novel technique, which we referred to as geometrization, which sets the length of each epoch as a geometric random variable. The algorithm achieves strictly better theoretical complexity than other existing adaptive algorithms, while the tuning parameters of the algorithm only depend on the smoothness parameter of the objective.

preprint2020arXiv

Overlap in Observational Studies with High-Dimensional Covariates

Estimating causal effects under exogeneity hinges on two key assumptions: unconfoundedness and overlap. Researchers often argue that unconfoundedness is more plausible when more covariates are included in the analysis. Less discussed is the fact that covariate overlap is more difficult to satisfy in this setting. In this paper, we explore the implications of overlap in observational studies with high-dimensional covariates and formalize curse-of-dimensionality argument, suggesting that these assumptions are stronger than investigators likely realize. Our key innovation is to explore how strict overlap restricts global discrepancies between the covariate distributions in the treated and control populations. Exploiting results from information theory, we derive explicit bounds on the average imbalance in covariate means under strict overlap and show that these bounds become more restrictive as the dimension grows large. We discuss how these implications interact with assumptions and procedures commonly deployed in observational causal inference, including sparsity and trimming.

preprint2020arXiv

Regression adjustment in completely randomized experiments with a diverging number of covariates

Randomized experiments have become important tools in empirical research. In a completely randomized treatment-control experiment, the simple difference in means of the outcome is unbiased for the average treatment effect, and covariate adjustment can further improve the efficiency without assuming a correctly specified outcome model. In modern applications, experimenters often have access to many covariates, motivating the need for a theory of covariate adjustment under the asymptotic regime with a diverging number of covariates. We study the asymptotic properties of covariate adjustment under the potential outcomes model and propose a bias-corrected estimator that is consistent and asymptotically normal under weaker conditions. Our theory is purely randomization-based without imposing any parametric outcome model assumptions. To prove the theoretical results, we develop novel vector and matrix concentration inequalities for sampling without replacement.

preprint2020arXiv

STAR: A general interactive framework for FDR control under structural constraints

We propose a general framework based on selectively traversed accumulation rules (STAR) for interactive multiple testing with generic structural constraints on the rejection set. It combines accumulation tests from ordered multiple testing with data-carving ideas from post-selection inference, allowing for highly flexible adaptation to generic structural information. Our procedure defines an interactive protocol for gradually pruning a candidate rejection set, beginning with the set of all hypotheses and shrinking with each step. By restricting the information at each step via a technique we call masking, our protocol enables interaction while controlling the false discovery rate (FDR) in finite samples for any data-adaptive update rule that the analyst may choose. We suggest update rules for a variety of applications with complex structural constraints, show that STAR performs well for problems ranging from convex region detection to FDR control on directed acyclic graphs, and show how to extend it to regression problems where knockoff statistics are available in lieu of $p$-values.

preprint2020arXiv

Unified $\ell_{2\rightarrow\infty}$ Eigenspace Perturbation Theory for Symmetric Random Matrices

Modern applications in statistics, computer science and network science have seen tremendous values of finer matrix spectral perturbation theory. In this paper, we derive a generic $\ell_{2\rightarrow\infty}$ eigenspace perturbation bound for symmetric random matrices, with independent or dependent entries and fairly flexible entry distributions. In particular, we apply our generic bound to binary random matrices with independent entries or with certain dependency structures, including the unnormalized Laplacian of inhomogenous random graphs and $m$-dependent matrices. Through a detailed comparison, we found that for binary random matrices with independent entries, our $\ell_{2\rightarrow\infty}$ bound is tighter than all existing bounds that we are aware of, while our condition is weaker than all but one of them in a less common regime. We apply our perturbation bounds in three problems and improve the state of the art: concentration of the spectral norm of sparse random graphs, exact recovery of communities in stochastic block models and partial consistency of divisive hierarchical clustering. Finally we discuss the extensions of our theory to random matrices with more complex dependency structures and non-binary entries, asymmetric rectangular matrices and induced perturbation theory in other metrics.

preprint2020arXiv

Variance Reduction with Sparse Gradients

Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD, these methods require at least double the number of operations per update to model parameters. To reduce the computational cost of these methods, we introduce a new sparsity operator: The random-top-k operator. Our operator reduces computational complexity by estimating gradient sparsity exhibited in a variety of applications by combining the top-k operator and the randomized coordinate descent operator. With this operator, large batch gradients offer an extra benefit beyond variance reduction: A reliable estimate of gradient sparsity. Theoretically, our algorithm is at least as good as the best algorithm (SpiderBoost), and further excels in performance whenever the random-top-k operator captures gradient sparsity. Empirically, our algorithm consistently outperforms SpiderBoost using various models on various tasks including image classification, natural language processing, and sparse matrix factorization. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration.