Source author record

Tianyi Lin

Tianyi Lin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.OC Machine Learning Computational Complexity Computation Data Structures and Algorithms

Catalog footprint

What is connected

18works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Explicit Second-Order Min-Max Optimization: Practical Algorithms and Complexity Analysis

We propose and analyze several inexact regularized Newton-type methods for finding a global saddle point of \emph{convex-concave} unconstrained min-max optimization problems. Compared to first-order methods, our understanding of second-order methods for min-max optimization is relatively limited, as obtaining global rates of convergence with second-order information can be much more involved. In this paper, we examine how second-order information is used to speed up extra-gradient methods, even under inexactness. In particular, we show that the proposed methods generate iterates that remain within a bounded set and that the averaged iterates converge to an $ε$-saddle point within $O(ε^{-2/3})$ iterations in terms of a restricted gap function. We also provide a simple routine for solving the subproblem at each iteration, requiring a single Schur decomposition and $O(\log\log(1/ε))$ calls to a linear system solver in a quasi-upper-triangular system. Thus, our method improves the existing line-search-based second-order min-max optimization methods by shaving off an $O(\log\log(1/ε))$ factor in the required number of Schur decompositions. Finally, we conduct experiments on synthetic and real data to demonstrate the efficiency of the proposed methods.

preprint2026arXiv

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only support hyperparameter transfer across model sizes but exploit input-output matrix norm geometry. At the same time, stochastic gradient noises in deep learning are often far from sub-Gaussian and may exhibit heavy tails. These crucial observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences remain underexplored. In particular, it is unclear what dimension dependence is unavoidable for scale-invariant methods with general input-output matrix norm, and whether higher-order smoothness can accelerate training under heavy-tailed noise. We study these questions through nonconvex smooth stochastic optimization over $\mathbb{R}^{m\times n}$ with general norms, where the goal is to achieve an $ε$-stationary point under $p^{\mathrm{th}}$-moment heavy-tailed noise. Our first contribution is a dimension-dependent lower bound: when $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ is large enough, any scale-invariant first-order method with spectral norm requires $Ω(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$ oracle calls. We prove that a batched Scion method with spectral norm achieves the matching upper bound of $O(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$. To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to $O(\min\{m, n\}ε^{-\frac{5p-3}{2p-2}})$ when the norm is spectral and the Hessian is Lipschitz. Finally, we incorporate practical heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility in training neural networks.

preprint2023arXiv

Projection Robust Wasserstein Distance and Riemannian Optimization

Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP \textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.

preprint2022arXiv

Accelerating Adaptive Cubic Regularization of Newton's Method via Random Sampling

In this paper, we consider an unconstrained optimization model where the objective is a sum of a large number of possibly nonconvex functions, though overall the objective is assumed to be smooth and convex. Our bid to solving such model uses the framework of cubic regularization of Newton's method. As well known, the crux in cubic regularization is its utilization of the Hessian information, which may be computationally expensive for large-scale problems. To tackle this, we resort to approximating the Hessian matrix via sub-sampling. In particular, we propose to compute an approximated Hessian matrix by either \textit{uniformly}\/ or \textit{non-uniformly}\/ sub-sampling the components of the objective. Based upon such sampling strategy, we develop accelerated adaptive cubic regularization approaches and provide theoretical guarantees on global iteration complexity of $O(ε^{-1/3})$ with high probability, which matches that of the original accelerated cubic regularization methods \cite{Jiang-2017-Unified} using the \textit{full}\/ Hessian information. Interestingly, we show that in the worst case scenario our algorithm still achieves an $O\left(\log(ε^{-1})ε^{-5/6}\right)$ iteration complexity bound. The performances of the proposed methods on the regularized logistic regression problems show a clear effect of acceleration in terms of the epoch counts on several real data sets.

preprint2022arXiv

Fast Distributionally Robust Learning with Variance Reduced Min-Max Optimization

Distributionally robust supervised learning (DRSL) is emerging as a key paradigm for building reliable machine learning systems for real-world applications -- reflecting the need for classifiers and predictive models that are robust to the distribution shifts that arise from phenomena such as selection bias or nonstationarity. Existing algorithms for solving Wasserstein DRSL -- one of the most popular DRSL frameworks based around robustness to perturbations in the Wasserstein distance -- have serious limitations that limit their use in large-scale problems -- in particular they involve solving complex subproblems and they fail to make use of stochastic gradients. We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable stochastic extra-gradient algorithms which provably achieve faster convergence rates than existing approaches. We demonstrate their effectiveness on synthetic and real data when compared to existing DRSL approaches. Key to our results is the use of variance reduction and random reshuffling to accelerate stochastic min-max optimization, the analysis of which may be of independent interest.

preprint2022arXiv

Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm

We study the fixed-support Wasserstein barycenter problem (FS-WBP), which consists in computing the Wasserstein barycenter of $m$ discrete probability measures supported on a finite metric space of size $n$. We show first that the constraint matrix arising from the standard linear programming (LP) representation of the FS-WBP is \textit{not totally unimodular} when $m \geq 3$ and $n \geq 3$. This result resolves an open question pertaining to the relationship between the FS-WBP and the minimum-cost flow (MCF) problem since it proves that the FS-WBP in the standard LP form is not an MCF problem when $m \geq 3$ and $n \geq 3$. We also develop a provably fast \textit{deterministic} variant of the celebrated iterative Bregman projection (IBP) algorithm, named \textsc{FastIBP}, with a complexity bound of $\tilde{O}(mn^{7/3}\varepsilon^{-4/3})$, where $\varepsilon \in (0, 1)$ is the desired tolerance. This complexity bound is better than the best known complexity bound of $\tilde{O}(mn^2\varepsilon^{-2})$ for the IBP algorithm in terms of $\varepsilon$, and that of $\tilde{O}(mn^{5/2}\varepsilon^{-1})$ from accelerated alternating minimization algorithm or accelerated primal-dual adaptive gradient algorithm in terms of $n$. Finally, we conduct extensive experiments with both synthetic data and real images and demonstrate the favorable performance of the \textsc{FastIBP} algorithm in practice.

preprint2022arXiv

On Structured Filtering-Clustering: Global Error Bound and Optimal First-Order Algorithms

The filtering-clustering models, including trend filtering and convex clustering, have become an important source of ideas and modeling tools in machine learning and related fields. The statistical guarantee of optimal solutions in these models has been extensively studied yet the investigations on the computational aspect have remained limited. In particular, practitioners often employ the first-order algorithms in real-world applications and are impressed by their superior performance regardless of ill-conditioned structures of difference operator matrices, thus leaving open the problem of understanding the convergence property of first-order algorithms. This paper settles this open problem and contributes to the broad interplay between statistics and optimization by identifying a \textit{global error bound} condition, which is satisfied by a large class of dual filtering-clustering problems, and designing a class of \textit{generalized dual gradient ascent} algorithm, which is \textit{optimal} first-order algorithms in deterministic, finite-sum and online settings. Our results are new and help explain why the filtering-clustering models can be efficiently solved by first-order algorithms. We also provide the detailed convergence rate analysis for the proposed algorithms in different settings, shedding light on their potential to solve the filtering-clustering models efficiently. We also conduct experiments on real datasets and the numerical results demonstrate the effectiveness of our algorithms.

preprint2022arXiv

On the Efficiency of Entropic Regularized Algorithms for Optimal Transport

We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as \textit{Greenkhorn}, from $\widetilde{O}(n^2\varepsilon^{-3})$ to $\widetilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by~\citet{Altschuler-2017-Near}. Second, we propose a new algorithm, which we refer to as \textit{APDAMD} and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm~\citep{Dvurechensky-2018-Computational} with a prespecified mirror mapping $ϕ$. We prove that APDAMD achieves the complexity bound of $\widetilde{O}(n^2\sqrtδ\varepsilon^{-1})$ in which $δ>0$ stands for the regularity of $ϕ$. In addition, we show by a counterexample that the complexity bound of $\widetilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\widetilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a \textit{deterministic} accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\widetilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM)~\citep{Guminov-2021-Combination} in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice.

preprint2022arXiv

Online Nonsubmodular Minimization with Delayed Costs: From Full Information to Bandit Feedback

Motivated by applications to online learning in sparse estimation and Bayesian optimization, we consider the problem of online unconstrained nonsubmodular minimization with delayed costs in both full information and bandit feedback settings. In contrast to previous works on online unconstrained submodular minimization, we focus on a class of nonsubmodular functions with special structure, and prove regret guarantees for several variants of the online and approximate online bandit gradient descent algorithms in static and delayed scenarios. We derive bounds for the agent's regret in the full information and bandit feedback setting, even if the delay between choosing a decision and receiving the incurred cost is unbounded. Key to our approach is the notion of $(α, β)$-regret and the extension of the generic convex relaxation model from~\citet{El-2020-Optimal}, the analysis of which is of independent interest. We conduct and showcase several simulation studies to demonstrate the efficacy of our algorithms.

preprint2020arXiv

A Unified Adaptive Tensor Approximation Scheme to Accelerate Composite Convex Optimization

In this paper, we propose a unified two-phase scheme to accelerate any high-order regularized tensor approximation approach on the smooth part of a composite convex optimization model. The proposed scheme has the advantage of not needing to assume any prior knowledge of the Lipschitz constants for the gradient, the Hessian and/or high-order derivatives. This is achieved by tuning the parameters used in the algorithm \textit{adaptively} in its process of progression, which has been successfully incorporated in high-order nonconvex optimization (CartisGouldToint2018, Birgin-Gardenghi-Martinez-Santos-Toint-2017). In general, we show that the adaptive high-order method has an iteration bound of $O\left( 1 / ε^{1/(p+1)} \right)$ if the first $p$-th order derivative information is used in the approximation, which has the same iteration complexity as in that of the nonadaptive version in (Baes-2009, Nesterov-2018) where the Lipschitz constants are assumed to be known and the subproblems are assumed to be solved exactly. Thus, our results partially address the problem of incorporating adaptive strategies into the high-order {\it accelerated} methods raised by Nesterov in (Nesterov-2018), although our strategies cannot assure the convexity of the auxiliary problem and such adaptive strategies are already popular in high-order nonconvex optimization (CartisGouldToint2018, Birgin-Gardenghi-Martinez-Santos-Toint-2017). Our numerical experiment results show a clear effect of real acceleration displayed in the adaptive Newton's method with cubic regularization on a set of regularized logistic regression instances.

preprint2020arXiv

An ADMM-Based Interior-Point Method for Large-Scale Linear Programming

We propose a new framework to implement interior point method (IPM) to solve very large linear programs (LP). Traditional IPMs typically use Newton's method to approximately solve a subproblem that aims to minimize a log-barrier penalty function at each iteration. Due its connection to Newton's method, IPM is often classified as second-order method -- a genre that is attached with stability and accuracy at the expense of scalability. Indeed, computing a Newton step amounts to solving a large linear system, which can be efficiently implemented if the input data are reasonably-sized and/or sparse and/or well-structured. However, in case the above premises fail, then the challenge still stands on the way for a traditional IPM. To deal with this challenge, one approach is to apply the iterative procedure, such as preconditioned conjugate gradient method, to solve the linear system. Since the linear system is different each iteration, it is difficult to find good pre-conditioner to achieve the overall solution efficiency. In this paper, an alternative approach is proposed. Instead of applying Newton's method, we resort to the alternating direction method of multipliers (ADMM) to approximately minimize the log-barrier penalty function at each iteration, under the framework of primal-dual path-following for a homogeneous self-dual embedded LP model. The resulting algorithm is an ADMM-Based Interior Point Method, abbreviated as ABIP in this paper. The new method inherits stability from IPM, and scalability from ADMM. Because of its self-dual embedding structure, ABIP is set to solve any LP without requiring prior knowledge about its feasibility. We conduct extensive numerical experiments testing ABIP with large-scale LPs from NETLIB and machine learning applications. The results demonstrate that ABIP compares favorably with existing LP solvers including SDPT3, MOSEK, DSDP-CG and SCS.

preprint2020arXiv

Improved Sample Complexity for Stochastic Compositional Variance Reduced Gradient

Convex composition optimization is an emerging topic that covers a wide range of applications arising from stochastic optimal control, reinforcement learning and multi-stage stochastic programming. Existing algorithms suffer from unsatisfactory sample complexity and practical issues since they ignore the convexity structure in the algorithmic design. In this paper, we develop a new stochastic compositional variance-reduced gradient algorithm with the sample complexity of $O((m+n)\log(1/ε)+1/ε^3)$ where $m+n$ is the total number of samples. Our algorithm is near-optimal as the dependence on $m+n$ is optimal up to a logarithmic factor. Experimental results on real-world datasets demonstrate the effectiveness and efficiency of the new algorithm.

preprint2020arXiv

New Proximal Newton-Type Methods for Convex Optimization

In this paper, we propose new proximal Newton-type methods for convex optimization problems in composite form. The applications include model predictive control (MPC) and embedded MPC. Our new methods are computationally attractive since they do not require evaluating the Hessian at each iteration while keeping fast convergence rate. More specifically, we prove the global convergence is guaranteed and the superlinear convergence is achieved in the vicinity of an optimal solution. We also develop several practical variants by incorporating quasi-Newton and inexact subproblem solving schemes and provide theoretical guarantee for them under certain conditions. Experimental results on real-world datasets demonstrate the effectiveness and efficiency of new methods.

preprint2017arXiv

Distributed Linearized Alternating Direction Method of Multipliers for Composite Convex Consensus Optimization

Given an undirected graph $\mathcal{G}=(\mathcal{N},\mathcal{E})$ of agents $\mathcal{N}=\{1,\ldots,N\}$ connected with edges in $\mathcal{E}$, we study how to compute an optimal decision on which there is consensus among agents and that minimizes the sum of agent-specific private convex composite functions $\{Φ_i\}_{i\in\mathcal{N}}$ while respecting privacy requirements, where $Φ_i\triangleq ξ_i + f_i$ belongs to agent-$i$. Assuming only agents connected by an edge can communicate, we propose a distributed proximal gradient method DPGA for consensus optimization over both unweighted and weighted static (undirected) communication networks. In one iteration, each agent-$i$ computes the prox map of $ξ_i$ and gradient of $f_i$, and this is followed by local communication with neighboring agents. We also study its stochastic gradient variant, SDPGA, which can only access to noisy estimates of $\nabla f_i$ at each agent-$i$. This computational model abstracts a number of applications in distributed sensing, machine learning and statistical inference. We show ergodic convergence in both sub-optimality error and consensus violation for DPGA and SDPGA with rates $\mathcal{O}(1/t)$ and $\mathcal{O}(1/\sqrt{t})$, respectively.

preprint2015arXiv

An Extragradient-Based Alternating Direction Method for Convex Minimization

In this paper, we consider the problem of minimizing the sum of two convex functions subject to linear linking constraints. The classical alternating direction type methods usually assume that the two convex functions have relatively easy proximal mappings. However, many problems arising from statistics, image processing and other fields have the structure that while one of the two functions has easy proximal mapping, the other function is smoothly convex but does not have an easy proximal mapping. Therefore, the classical alternating direction methods cannot be applied. To deal with the difficulty, we propose in this paper an alternating direction method based on extragradients. Under the assumption that the smooth function has a Lipschitz continuous gradient, we prove that the proposed method returns an $ε$-optimal solution within $O(1/ε)$ iterations. We apply the proposed method to solve a new statistical model called fused logistic regression. Our numerical experiments show that the proposed method performs very well when solving the test problems. We also test the performance of the proposed method through solving the lasso problem arising from statistics and compare the result with several existing efficient solvers for this problem; the results are very encouraging indeed.

preprint2015arXiv

Iteration Complexity Analysis of Multi-Block ADMM for a Family of Convex Minimization without Strong Convexity

The alternating direction method of multipliers (ADMM) is widely used in solving structured convex optimization problems due to its superior practical performance. On the theoretical side however, a counterexample was shown in [7] indicating that the multi-block ADMM for minimizing the sum of $N$ $(N\geq 3)$ convex functions with $N$ block variables linked by linear constraints may diverge. It is therefore of great interest to investigate further sufficient conditions on the input side which can guarantee convergence for the multi-block ADMM. The existing results typically require the strong convexity on parts of the objective. In this paper, we present convergence and convergence rate results for the multi-block ADMM applied to solve certain $N$-block $(N\geq 3)$ convex minimization problems without requiring strong convexity. Specifically, we prove the following two results: (1) the multi-block ADMM returns an $ε$-optimal solution within $O(1/ε^2)$ iterations by solving an associated perturbation to the original problem; (2) the multi-block ADMM returns an $ε$-optimal solution within $O(1/ε)$ iterations when it is applied to solve a certain sharing problem, under the condition that the augmented Lagrangian function satisfies the Kurdyka-Lojasiewicz property, which essentially covers most convex optimization models except for some pathological cases.

preprint2015arXiv

On the Global Linear Convergence of the ADMM with Multi-Block Variables

The alternating direction method of multipliers (ADMM) has been widely used for solving structured convex optimization problems. In particular, the ADMM can solve convex programs that minimize the sum of $N$ convex functions with $N$-block variables linked by some linear constraints. While the convergence of the ADMM for $N=2$ was well established in the literature, it remained an open problem for a long time whether or not the ADMM for $N \ge 3$ is still convergent. Recently, it was shown in [3] that without further conditions the ADMM for $N\ge 3$ may actually fail to converge. In this paper, we show that under some easily verifiable and reasonable conditions the global linear convergence of the ADMM when $N\geq 3$ can still be assured, which is important since the ADMM is a popular method for solving large scale multi-block optimization models and is known to perform very well in practice even when $N\ge 3$. Our study aims to offer an explanation for this phenomenon.

preprint2015arXiv

On the Sublinear Convergence Rate of Multi-Block ADMM

The alternating direction method of multipliers (ADMM) is widely used in solving structured convex optimization problems. Despite of its success in practice, the convergence properties of the standard ADMM for minimizing the sum of $N$ $(N\geq 3)$ convex functions with $N$ block variables linked by linear constraints, have remained unclear for a very long time. In this paper, we present convergence and convergence rate results for the standard ADMM applied to solve $N$-block $(N\geq 3)$ convex minimization problem, under the condition that one of these functions is convex (not necessarily strongly convex) and the other $N-1$ functions are strongly convex. Specifically, in that case the ADMM is proven to converge with rate $O(1/t)$ in a certain ergodic sense, and $o(1/t)$ in non-ergodic sense, where $t$ denotes the number of iterations. As a by-product, we also provide a simple proof for the $O(1/t)$ convergence rate of two-block ADMM in terms of both objective error and constraint violation, without assuming any condition on the penalty parameter and strong convexity on the functions.

Tianyi Lin

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Explicit Second-Order Min-Max Optimization: Practical Algorithms and Complexity Analysis

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

Projection Robust Wasserstein Distance and Riemannian Optimization

Accelerating Adaptive Cubic Regularization of Newton's Method via Random Sampling

Fast Distributionally Robust Learning with Variance Reduced Min-Max Optimization

Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm

On Structured Filtering-Clustering: Global Error Bound and Optimal First-Order Algorithms

On the Efficiency of Entropic Regularized Algorithms for Optimal Transport

Online Nonsubmodular Minimization with Delayed Costs: From Full Information to Bandit Feedback

A Unified Adaptive Tensor Approximation Scheme to Accelerate Composite Convex Optimization

An ADMM-Based Interior-Point Method for Large-Scale Linear Programming

Improved Sample Complexity for Stochastic Compositional Variance Reduced Gradient

New Proximal Newton-Type Methods for Convex Optimization

Distributed Linearized Alternating Direction Method of Multipliers for Composite Convex Consensus Optimization

An Extragradient-Based Alternating Direction Method for Convex Minimization

Iteration Complexity Analysis of Multi-Block ADMM for a Family of Convex Minimization without Strong Convexity

On the Global Linear Convergence of the ADMM with Multi-Block Variables

On the Sublinear Convergence Rate of Multi-Block ADMM