Source author record

Zheng Qu

Zheng Qu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.OC Machine Learning math.NA Numerical Analysis math.OA Systems and Control Computational Complexity Data Structures and Algorithms Information Theory math.CA math.DG math.FA math.IT math.MG math.PR math.QA Multiagent Systems nlin.AO

Catalog footprint

What is connected

20works

18topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Dynamic N:M Fine-grained Structured Sparse Attention Mechanism

Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. Tremendous efforts have been made to alleviate this problem, and many of them successfully reduce the asymptotic complexity to linear. Nevertheless, most of them fail to achieve practical speedup over the original full attention under moderate sequence lengths and are unfriendly to finetuning. In this paper, we present DFSS, an attention mechanism that dynamically prunes the full attention weight matrix to N:M fine-grained structured sparse pattern. We provide both theoretical and empirical evidence that demonstrates DFSS is a good approximation of the full attention mechanism. We propose a dedicated CUDA kernel design that completely eliminates the dynamic pruning overhead and achieves speedups under arbitrary sequence length. We evaluate the 1:2 and 2:4 sparsity under different configurations and achieve 1.27~ 1.89x speedups over the full-attention mechanism. It only takes a couple of finetuning epochs from the pretrained model to achieve on par accuracy with full attention mechanism on tasks from various domains under different sequence lengths from 384 to 4096.

preprint2021arXiv

An adaptive proximal point algorithm framework and application to large-scale optimization

We investigate the proximal point algorithm (PPA) and its inexact extensions under an error bound condition, which guarantees a global linear convergence if the proximal regularization parameter is larger than the error bound condition parameter. We propose an adaptive generalized proximal point algorithm (AGPPA), which adaptively updates the proximal regularization parameters based on some implementable criteria. We show that AGPPA achieves linear convergence without any knowledge of the error bound condition parameter, and the rate only differs from the optimal one by a logarithm term. We apply AGPPA on convex minimization problem and analyze the iteration complexity bound of the resulting algorithm. Our framework and the complexity results apply to arbitrary linearly convergent inner solver and allows a hybrid with any locally fast convergent method. We illustrate the performance of AGPPA by applying it to solve large-scale linear programming (LP) problem. The resulting complexity bound has a weaker dependence on the Hoffman constant and scales with the dimension better than linearized ADMM. In numerical experiments, our algorithm demonstrates improved performance in obtaining solution of medium accuracy on large-scale LP problem.

preprint2016arXiv

Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

Accelerated coordinate descent is widely used in optimization due to its cheap per-iteration cost and scalability to large-scale problems. Up to a primal-dual transformation, it is also the same as accelerated stochastic gradient descent that is one of the central methods used in machine learning. In this paper, we improve the best known running time of accelerated coordinate descent by a factor up to $\sqrt{n}$. Our improvement is based on a clean, novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of its smoothness parameter. Our proof technique also deviates from the classical estimation sequence technique used in prior work. Our speed-up applies to important problems such as empirical risk minimization and solving linear systems, both in theory and in practice.

preprint2016arXiv

Restarting accelerated gradient methods with a rough strong convexity estimate

We propose new restarting strategies for accelerated gradient and accelerated coordinate descent methods. Our main contribution is to show that the restarted method has a geometric rate of convergence for any restarting frequency, and so it allows us to take profit of restarting even when we do not know the strong convexity coefficient. The scheme can be combined with adaptive restarting, leading to the first provable convergence for adaptive restarting schemes with accelerated gradient methods. Finally, we illustrate the properties of the algorithm on a regularized logistic regression problem and on a Lasso problem.

preprint2015arXiv

Coordinate Descent with Arbitrary Sampling I: Algorithms and Complexity

We study the problem of minimizing the sum of a smooth convex function and a convex block-separable regularizer and propose a new randomized coordinate descent method, which we call ALPHA. Our method at every iteration updates a random subset of coordinates, following an arbitrary distribution. No coordinate descent methods capable to handle an arbitrary sampling have been studied in the literature before for this problem. ALPHA is a remarkably flexible algorithm: in special cases, it reduces to deterministic and randomized methods such as gradient descent, coordinate descent, parallel coordinate descent and distributed coordinate descent -- both in nonaccelerated and accelerated variants. The variants with arbitrary (or importance) sampling are new. We provide a complexity analysis of ALPHA, from which we deduce as a direct corollary complexity bounds for its many variants, all matching or improving best known bounds.

preprint2015arXiv

Coordinate Descent with Arbitrary Sampling II: Expected Separable Overapproximation

The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objective function and the sampling, capturing in a compact way certain smoothness properties of the function in a random subspace spanned by the sampled coordinates. ESO inequalities were previously established for special classes of samplings only, almost invariably for uniform samplings. In this paper we develop a systematic technique for deriving these inequalities for a large class of functions and for arbitrary samplings. We demonstrate that one can recover existing ESO results using our general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function.

preprint2015arXiv

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

We propose a new algorithm for minimizing regularized empirical loss: Stochastic Dual Newton Ascent (SDNA). Our method is dual in nature: in each iteration we update a random subset of the dual variables. However, unlike existing methods such as stochastic dual coordinate ascent, SDNA is capable of utilizing all curvature information contained in the examples, which leads to striking improvements in both theory and practice - sometimes by orders of magnitude. In the special case when an L2-regularizer is used in the primal, the dual problem is a concave quadratic maximization problem plus a separable term. In this regime, SDNA in each step solves a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function; whence the name of the method. If, in addition, the loss functions are quadratic, our method can be interpreted as a novel variant of the recently introduced Iterative Hessian Sketch.

preprint2015arXiv

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the iterative process. AdaSDCA achieves provably better complexity bound than SDCA with the best fixed probability distribution, known as importance sampling. However, it is of a theoretical character as it is expensive to implement. We also propose AdaSDCA+: a practical variant which in our experiments outperforms existing non-adaptive methods.

preprint2014arXiv

Adaptive Output Feedback based on Closed-loop Reference Models

This note presents the design and analysis of an adaptive controller for a class of linear plants in the presence of output feedback. This controller makes use of a closed-loop reference model as an observer, and guarantees global stability and asymptotic output tracking.

preprint2014arXiv

Bundle-based pruning in the max-plus curse of dimensionality free method

Recently a new class of techniques termed the max-plus curse of dimensionality-free methods have been developed to solve nonlinear optimal control problems. In these methods the discretization in state space is avoided by using a max-plus basis expansion of the value function. This requires storing only the coefficients of the basis functions used for representation. However, the number of basis functions grows exponentially with respect to the number of time steps of propagation to the time horizon of the control problem. This so called "curse of complexity" can be managed by applying a pruning procedure which selects the subset of basis functions that contribute most to the approximation of the value function. The pruning procedures described thus far in the literature rely on the solution of a sequence of high dimensional optimization problems which can become computationally expensive. In this paper we show that if the max-plus basis functions are linear and the region of interest in state space is convex, the pruning problem can be efficiently solved by the bundle method. This approach combining the bundle method and semidefinite formulations is applied to the quantum gate synthesis problem, in which the state space is the special unitary group (which is non-convex). This is based on the observation that the convexification of the unitary group leads to an exact relaxation. The results are studied and validated via examples.

preprint2014arXiv

Checking the strict positivity of Kraus maps is NP-hard

Basic properties in Perron-Frobenius theory are strict positivity, primitivityand irreducibility. Whereas for nonnegative matrices, these properties are equivalent to elementary graph properties which can be checked in polynomial time, we show that for Kraus maps- the noncommutative generalization of stochastic matrices - checking strict positivity (whether the map sends the cone to its interior) is NP-hard. The proof proceeds by reducing to the latter problem the existence of a non-zero solution of a special system of bilinear equations. The complexity of irreducibility and primitivity is also discussed in the noncommutative setting.

preprint2014arXiv

Dobrushin ergodicity coefficient for Markov operators on cones, and beyond

The analysis of classical consensus algorithms relies on contraction properties of adjoints of Markov operators, with respect to Hilbert's projective metric or to a related family of seminorms (Hopf's oscillation or Hilbert's seminorm). We generalize these properties to abstract consensus operators over normal cones, which include the unital completely positive maps (Kraus operators) arising in quantum information theory. In particular, we show that the contraction rate of such operators, with respect to the Hopf oscillation seminorm, is given by an analogue of Dobrushin's ergodicity coefficient. We derive from this result a characterization of the contraction rate of a non-linear flow, with respect to Hopf's oscillation seminorm and to Hilbert's projective metric.

preprint2014arXiv

Dobrushin's ergodicity coefficient for Markov operators on cones

We give a characterization of the contraction ratio of bounded linear maps in Banach space with respect to Hopf's oscillation seminorm, which is the infinitesimal distance associated to Hilbert's projective metric, in terms of the extreme points of a certain abstract "simplex". The formula is then applied to abstract Markov operators defined on arbitrary cones, which extend the row stochastic matrices acting on the standard positive cone and the completely positive unital maps acting on the cone of positive semidefinite matrices. When applying our characterization to a stochastic matrix, we recover the formula of Dobrushin's ergodicity coefficient. When applying our result to a completely positive unital map, we therefore obtain a noncommutative version of Dobrushin's ergodicity coefficient, which gives the contraction ratio of the map (representing a quantum channel or a "noncommutative Markov chain") with respect to the diameter of the spectrum. The contraction ratio of the dual operator (Kraus map) with respect to the total variation distance will be shown to be given by the same coefficient. We derive from the noncommutative Dobrushin's ergodicity coefficient an algebraic characterization of the convergence of a noncommutative consensus system or equivalently the ergodicity of a noncommutative Markov chain.

preprint2014arXiv

Fast Distributed Coordinate Descent for Non-Strongly Convex Losses

We propose an efficient distributed randomized coordinate descent method for minimizing regularized non-strongly convex loss functions. The method attains the optimal $O(1/k^2)$ convergence rate, where $k$ is the iteration counter. The core of the work is the theoretical study of stepsize parameters. We have implemented the method on Archer - the largest supercomputer in the UK - and show that the method is capable of solving a (synthetic) LASSO optimization problem with 50 billion variables.

preprint2014arXiv

Randomized Dual Coordinate Ascent with Arbitrary Sampling

We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primal-dual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primal-dual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard mini-batching, our bounds predict initial data-independent speedup as well as additional data-driven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient mini-batch importance sampling. The distributed variant of Quartz is the first distributed SDCA-like method with an analysis for non-separable data.

preprint2014arXiv

Semi-Stochastic Coordinate Descent

We propose a novel stochastic gradient method---semi-stochastic coordinate descent (S2CD)---for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: $f(x)=\tfrac{1}{n}\sum_i f_i(x)$. Our method first performs a deterministic step (computation of the gradient of $f$ at the starting point), followed by a large number of stochastic steps. The process is repeated a few times, with the last stochastic iterate becoming the new starting point where the deterministic step is taken. The novelty of our method is in how the stochastic steps are performed. In each such step, we pick a random function $f_i$ and a random coordinate $j$---both using nonuniform distributions---and update a single coordinate of the decision vector only, based on the computation of the $j^{th}$ partial derivative of $f_i$ at two different points. Each random step of the method constitutes an unbiased estimate of the gradient of $f$ and moreover, the squared norm of the steps goes to zero in expectation, meaning that the stochastic estimate of the gradient progressively improves. The complexity of the method is the sum of two terms: $O(n\log(1/ε))$ evaluations of gradients $\nabla f_i$ and $O(\hatκ\log(1/ε))$ evaluations of partial derivatives $\nabla_j f_i$, where $\hatκ$ is a novel condition number.

preprint2013arXiv

Contraction of Riccati flows applied to the convergence analysis of a max-plus curse of dimensionality free method

Max-plus based methods have been recently explored for solution of first-order Hamilton-Jacobi-Bellman equations by several authors. In particular, McEneaney's curse-of-dimensionality free method applies to the equations where the Hamiltonian takes the form of a (pointwise) maximum of linear/quadratic forms. In previous works of McEneaney and Kluberg, the approximation error of the method was shown to be $O(1/(Nτ))+O(\sqrtτ)$ where $τ$ is the time discretization step and $N$ is the number of iterations. Here we use a recently established contraction result for the indefinite Riccati flow in Thompson's metric to show that under different technical assumptions, still covering an important class of problems, the error is only of order $O(e^{-αNτ})+O(τ)$. This also allows us to obtain improved estimates of the execution time and to tune the precision of the pruning procedure, which in practice is a critical element of the method.

preprint2013arXiv

Squaring-Up Method In the Presence of Transmission Zeros

This paper presents a method to square up a generic MIMO system that already possesses transmission zeros. The proposed method is developed based on and therefore can be incorporated into the existing method that has been proven effective on a system without transmission zeros. It has been shown that for the generic system considering here, the squaring-up problem can be transformed into a state-feedback problem with uncontrollable modes.

preprint2012arXiv

The contraction rate in Thompson metric of order-preserving flows on a cone - application to generalized Riccati equations

We give a formula for the Lipschitz constant in Thompson's part metric of any order-preserving flow on the interior of a (possibly infinite dimensional) closed convex pointed cone. This provides an explicit form of a characterization of Nussbaum concerning non order-preserving flows. As an application of this formula, we show that the flow of the generalized Riccati equation arising in stochastic linear quadratic control is a local contraction on the cone of positive definite matrices and characterize its Lipschitz constant by a matrix inequality. We also show that the same flow is no longer a contraction in other natural Finsler metrics on this cone, including the standard invariant Riemannian metric. This is motivated by a series of contraction properties concerning the standard Riccati equation, established by Bougerol, Liverani, Wojtowski, Lawson, Lee and Lim: we show that some of these properties do, and that some other do not, carry over to the generalized Riccati equation.

preprint2011arXiv

Curse of dimensionality reduction in max-plus based approximation methods: theoretical estimates and improved pruning algorithms

Max-plus based methods have been recently developed to approximate the value function of possibly high dimensional optimal control problems. A critical step of these methods consists in approximating a function by a supremum of a small number of functions (max-plus "basis functions") taken from a prescribed dictionary. We study several variants of this approximation problem, which we show to be continuous versions of the facility location and $k$-center combinatorial optimization problems, in which the connection costs arise from a Bregman distance. We give theoretical error estimates, quantifying the number of basis functions needed to reach a prescribed accuracy. We derive from our approach a refinement of the curse of dimensionality free method introduced previously by McEneaney, with a higher accuracy for a comparable computational cost.

Zheng Qu

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

Dynamic N:M Fine-grained Structured Sparse Attention Mechanism

An adaptive proximal point algorithm framework and application to large-scale optimization

Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

Restarting accelerated gradient methods with a rough strong convexity estimate

Coordinate Descent with Arbitrary Sampling I: Algorithms and Complexity

Coordinate Descent with Arbitrary Sampling II: Expected Separable Overapproximation

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Adaptive Output Feedback based on Closed-loop Reference Models

Bundle-based pruning in the max-plus curse of dimensionality free method

Checking the strict positivity of Kraus maps is NP-hard

Dobrushin ergodicity coefficient for Markov operators on cones, and beyond

Dobrushin's ergodicity coefficient for Markov operators on cones

Fast Distributed Coordinate Descent for Non-Strongly Convex Losses

Randomized Dual Coordinate Ascent with Arbitrary Sampling

Semi-Stochastic Coordinate Descent

Contraction of Riccati flows applied to the convergence analysis of a max-plus curse of dimensionality free method

Squaring-Up Method In the Presence of Transmission Zeros

The contraction rate in Thompson metric of order-preserving flows on a cone - application to generalized Riccati equations

Curse of dimensionality reduction in max-plus based approximation methods: theoretical estimates and improved pruning algorithms