Source author record

Volkan Cevher

Volkan Cevher appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

75works

27topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

In this work, we develop proximal preconditioned gradient methods with a focus on spectral gradient methods providing a proximal extension to the Muon and Scion optimizers. We introduce a family of stochastic algorithms that can handle a wide variety of convex and nonconvex constraints and study its convergence under heavy-tailed noise, through a novel analysis tailored to the geometry of the proposed methods. We further propose a variance-reduced version, which achieves faster convergence under standard noise assumptions. Finally, we show that the polynomial iterations used in Muon are more accurately captured by a nonlinear preconditioner than by the ideal matrix sign, leading to a convergence analysis that more faithfully reflects practical implementations.

preprint2026arXiv

GRASP: Deterministic argument ranking in interaction graphs

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

preprint2026arXiv

Optimistic Dual Averaging Unifies Modern Optimizers

We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.

preprint2026arXiv

Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation

This work investigates multi-objective imitation learning: the problem of recovering policies that lie on the Pareto front given demonstrations from multiple Pareto-optimal experts in a Multi-Objective Markov Decision Process (MOMDP). Standard imitation approaches are ill-equipped for this regime, as naively aggregating conflicting expert trajectories can result in dominated policies. To address this, we introduce Multi-Output Augmented Behavioral Cloning (MA-BC), an algorithm that systematically partitions divergent expert data while pooling state-action pairs where no behavior conflict is observed. Theoretically, we prove that MA-BC converges to Pareto-optimal policies at a faster statistical rate than any learner that considers each expert dataset independently. Furthermore, we establish a novel lower bound for multi-objective imitation learning, demonstrating that MA-BC is minimax optimal. Finally, we empirically validate our algorithm across diverse discrete environments and, guided by our theoretical insights, extend and evaluate MA-BC on a continuous Linear Quadratic Regulator (LQR) control task.

preprint2024arXiv

Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence Rate

Second-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second-order updates within a lower-dimensional subspace, giving rise to subspace second-order methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $d$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of ${O}\left(\frac{1}{mk}+\frac{1}{k^2}\right)$ for solving convex optimization problems. Here, $m$ represents the subspace dimension, which can be significantly smaller than $d$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the Krylov subspace associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems.

preprint2023arXiv

Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps

In this paper we present a first-order method that admits near-optimal convergence rates for convex/concave min-max problems while requiring a simple and intuitive analysis. Similarly to the seminal work of Nemirovski and the recent approach of Piliouras et al. in normal form games, our work is based on the fact that the update rule of the Proximal Point method (PP) can be approximated up to accuracy $ε$ with only $O(\log 1/ε)$ additional gradient-calls through the iterations of a contraction map. Then combining the analysis of (PP) method with an error-propagation analysis we establish that the resulting first order method, called Clairvoyant Extra Gradient, admits near-optimal time-average convergence for general domains and last-iterate convergence in the unconstrained case.

preprint2022arXiv

Adversarial Audio Synthesis with Complex-valued Polynomial Networks

Time-frequency (TF) representations in audio synthesis have been increasingly modeled with real-valued networks. However, overlooking the complex-valued nature of TF representations can result in suboptimal performance and require additional modules (e.g., for modeling the phase). To this end, we introduce complex-valued polynomial networks, called APOLLO, that integrate such complex-valued representations in a natural way. Concretely, APOLLO captures high-order correlations of the input elements using high-order tensors as scaling parameters. By leveraging standard tensor decompositions, we derive different architectures and enable modeling richer correlations. We outline such architectures and showcase their performance in audio generation across four benchmarks. As a highlight, APOLLO results in $17.5\%$ improvement over adversarial methods and $8.2\%$ over the state-of-the-art diffusion models on SC09 dataset in audio generation. Our models can encourage the systematic design of other efficient architectures on the complex field.

preprint2022arXiv

An Inexact Augmented Lagrangian Framework for Nonconvex Optimization with Nonlinear Constraints

We propose a practical inexact augmented Lagrangian method (iALM) for nonconvex problems with nonlinear constraints. We characterize the total computational complexity of our method subject to a verifiable geometric condition, which is closely related to the Polyak-Lojasiewicz and Mangasarian-Fromowitz conditions. In particular, when a first-order solver is used for the inner iterates, we prove that iALM finds a first-order stationary point with $\tilde{\mathcal{O}}(1/ε^4)$ calls to the first-order oracle. If, in addition, the problem is smooth and a second-order solver is used for the inner iterates, iALM finds a second-order stationary point with $\tilde{\mathcal{O}}(1/ε^5)$ calls to the second-order oracle, which matches the known theoretical complexity result in the literature. We also provide strong numerical evidence on large-scale machine learning problems, including the Burer-Monteiro factorization of semidefinite programs, and a novel nonconvex relaxation of the standard basis pursuit template. For these examples, we also show how to verify our geometric condition.

preprint2022arXiv

Controlling the Complexity and Lipschitz Constant improves polynomial nets

While the class of Polynomial Nets demonstrates comparable performance to neural networks (NN), it currently has neither theoretical generalization characterization nor robustness guarantees. To this end, we derive new complexity bounds for the set of Coupled CP-Decomposition (CCP) and Nested Coupled CP-decomposition (NCP) models of Polynomial Nets in terms of the $\ell_\infty$-operator-norm and the $\ell_2$-operator norm. In addition, we derive bounds on the Lipschitz constant for both models to establish a theoretical certificate for their robustness. The theoretical results enable us to propose a principled regularization scheme that we also evaluate experimentally in six datasets and show that it improves the accuracy as well as the robustness of the models to adversarial perturbations. We showcase how this regularization can be combined with adversarial training, resulting in further improvements.

preprint2022arXiv

Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization

We propose a stochastic conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. Existing CGM variants for this template either suffer from slow convergence rates, or require carefully increasing the batch size over the course of the algorithm's execution, which leads to computing full gradients. In contrast, the proposed method, equipped with a stochastic average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques. In applications we put special emphasis on problems with a large number of separable constraints. Such problems are prevalent among semidefinite programming (SDP) formulations arising in machine learning and theoretical computer science. We provide numerical experiments on matrix completion, unsupervised clustering, and sparsest-cut SDPs.

preprint2022arXiv

High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize

In this paper, we propose a new, simplified high probability analysis of AdaGrad for smooth, non-convex problems. More specifically, we focus on a particular accelerated gradient (AGD) template (Lan, 2020), through which we recover the original AdaGrad and its variant with averaging, and prove a convergence rate of $\mathcal O (1/ \sqrt{T})$ with high probability without the knowledge of smoothness and variance. We use a particular version of Freedman's concentration bound for martingale difference sequences (Kakade & Tewari, 2008) which enables us to achieve the best-known dependence of $\log (1 / δ)$ on the probability margin $δ$. We present our analysis in a modular way and obtain a complementary $\mathcal O (1 / T)$ convergence rate in the deterministic setting. To the best of our knowledge, this is the first high probability result for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness and uniform variance bound, which simultaneously has best-known dependence of $\log( 1/ δ)$. We further prove noise adaptation property of AdaGrad under additional noise assumptions.

preprint2022arXiv

Kernel Conjugate Gradient Methods with Random Projections

We propose and study kernel conjugate gradient methods (KCGM) with random projections for least-squares regression over a separable Hilbert space. Considering two types of random projections generated by randomized sketches and Nyström subsampling, we prove optimal statistical results with respect to variants of norms for the algorithms under a suitable stopping rule. Particularly, our results show that if the projection dimension is proportional to the effective dimension of the problem, KCGM with randomized sketches can generalize optimally, while achieving a computational advantage. As a corollary, we derive optimal rates for classic KCGM in the well-conditioned regimes for the case that the target function may not be in the hypothesis space.

preprint2022arXiv

On the Complexity of a Practical Primal-Dual Coordinate Method

We prove complexity bounds for the primal-dual algorithm with random extrapolation and coordinate descent (PURE-CD), which has been shown to obtain good practical performance for solving convex-concave min-max problems with bilinear coupling. Our complexity bounds either match or improve the best-known results in the literature for both dense and sparse (strongly)-convex-(strongly)-concave problems.

preprint2022arXiv

On the convergence of stochastic primal-dual hybrid gradient

In this paper, we analyze the recently proposed stochastic primal-dual hybrid gradient (SPDHG) algorithm and provide new theoretical results. In particular, we prove almost sure convergence of the iterates to a solution with convexity and linear convergence with further structure, using standard step sizes independent of strong convexity or other regularity constants. In the general convex case, we also prove the $\mathcal{O}(1/k)$ convergence rate for the ergodic sequence, on expected primal-dual gap function. Our assumption for linear convergence is metric subregularity, which is satisfied for strongly convex-strongly concave problems in addition to many nonsmooth and/or nonstrongly convex problems, such as linear programs, Lasso, and support vector machines. We also provide numerical evidence showing that SPDHG with standard step sizes shows a competitive practical performance against its specialized strongly convex variant SPDHG-$μ$ and other state-of-the-art algorithms including variance reduction methods.

preprint2022arXiv

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

In this paper, we study regression problems over a separable Hilbert space with the square loss, covering non-parametric regression over a reproducing kernel Hilbert space. We investigate a class of spectral/regularized algorithms, including ridge regression, principal component regression, and gradient methods. We prove optimal, high-probability convergence results in terms of variants of norms for the studied algorithms, considering a capacity assumption on the hypothesis space and a general source condition on the target function. Consequently, we obtain almost sure convergence results with optimal rates. Our results improve and generalize previous results, filling a theoretical gap for the non-attainable cases.

preprint2022arXiv

Score matching enables causal discovery of nonlinear additive noise models

This paper demonstrates how to recover causal graphs from the score of the data distribution in non-linear additive (Gaussian) noise models. Using score matching algorithms as a building block, we show how to design a new generation of scalable causal discovery methods. To showcase our approach, we also propose a new efficient method for approximating the score's Jacobian, enabling to recover the causal graph. Empirically, we find that the new algorithm, called SCORE, is competitive with state-of-the-art causal discovery methods while being significantly faster.

preprint2022arXiv

The Spectral Bias of Polynomial Neural Networks

Polynomial neural networks (PNNs) have been recently shown to be particularly effective at image generation and face recognition, where high-frequency information is critical. Previous studies have revealed that neural networks demonstrate a $\textit{spectral bias}$ towards low-frequency functions, which yields faster learning of low-frequency components during training. Inspired by such studies, we conduct a spectral analysis of the Neural Tangent Kernel (NTK) of PNNs. We find that the $Π$-Net family, i.e., a recently proposed parametrization of PNNs, speeds up the learning of the higher frequencies. We verify the theoretical bias through extensive experiments. We expect our analysis to provide novel insights into designing architectures and learning frameworks by incorporating multiplicative interactions via polynomials.

preprint2021arXiv

The limits of min-max optimization algorithms: convergence to spurious non-critical sets

Compared to ordinary function minimization problems, min-max optimization algorithms encounter far greater challenges because of the existence of periodic cycles and similar phenomena. Even though some of these behaviors can be overcome in the convex-concave regime, the general case is considerably more difficult. On that account, we take an in-depth look at a comprehensive class of state-of-the art algorithms and prevalent heuristics in non-convex / non-concave problems, and we establish the following general results: a) generically, the algorithms' limit points are contained in the ICT sets of a common, mean-field system; b) the attractors of this system also attract the algorithms in question with arbitrarily high probability; and c) all algorithms avoid the system's unstable sets with probability 1. On the surface, this provides a highly optimistic outlook for min-max algorithms; however, we show that there exist spurious attractors that do not contain any stationary points of the problem under study. In this regard, our work suggests that existing min-max algorithms may be subject to inescapable convergence failures. We complement our theoretical analysis by illustrating such attractors in simple, two-dimensional, almost bilinear problems.

preprint2020arXiv

A new regret analysis for Adam-type algorithms

In this paper, we focus on a theory-practice gap for Adam and its variants (AMSgrad, AdamNC, etc.). In practice, these algorithms are used with a constant first-order moment parameter $β_{1}$ (typically between $0.9$ and $0.99$). In theory, regret guarantees for online convex optimization require a rapidly decaying $β_{1}\to0$ schedule. We show that this is an artifact of the standard analysis and propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant $β_{1}$, without further assumptions. We also demonstrate the flexibility of our analysis on a wide range of different algorithms and settings.

preprint2020arXiv

A Newton Frank-Wolfe Method for Constrained Self-Concordant Minimization

We demonstrate how to scalably solve a class of constrained self-concordant minimization problems using linear minimization oracles (LMO) over the constraint set. We prove that the number of LMO calls of our method is nearly the same as that of the Frank-Wolfe method in the L-smooth case. Specifically, our Newton Frank-Wolfe method uses $\mathcal{O}(ε^{-ν})$ LMO's, where $ε$ is the desired accuracy and $ν:= 1 + o(1)$. In addition, we demonstrate how our algorithm can exploit the improved variants of the LMO-based schemes, including away-steps, to attain linear convergence rates. We also provide numerical evidence with portfolio design with the competitive ratio, D-optimal experimental design, and logistic regression with the elastic net where Newton Frank-Wolfe outperforms the state-of-the-art.

preprint2020arXiv

An Optimal-Storage Approach to Semidefinite Programming using Approximate Complementarity

This paper develops a new storage-optimal algorithm that provably solves generic semidefinite programs (SDPs) in standard form. This method is particularly effective for weakly constrained SDPs. The key idea is to formulate an approximate complementarity principle: Given an approximate solution to the dual SDP, the primal SDP has an approximate solution whose range is contained in the eigenspace with small eigenvalues of the dual slack matrix. For weakly constrained SDPs, this eigenspace has very low dimension, so this observation significantly reduces the search space for the primal solution. This result suggests an algorithmic strategy that can be implemented with minimal storage: (1) Solve the dual SDP approximately; (2) compress the primal SDP to the eigenspace with small eigenvalues of the dual slack matrix; (3) solve the compressed primal SDP. The paper also provides numerical experiments showing that this approach is successful for a range of interesting large-scale SDPs.

preprint2020arXiv

Conditional gradient methods for stochastically constrained convex minimization

We propose two novel conditional gradient-based methods for solving structured stochastic convex optimization problems with a large number of linear constraints. Instances of this template naturally arise from SDP-relaxations of combinatorial problems, which involve a number of constraints that is polynomial in the problem dimension. The most important feature of our framework is that only a subset of the constraints is processed at each iteration, thus gaining a computational advantage over prior works that require full passes. Our algorithms rely on variance reduction and smoothing used in conjunction with conditional gradient steps, and are accompanied by rigorous convergence guarantees. Preliminary numerical experiments are provided for illustrating the practical performance of the methods.

preprint2020arXiv

Convergence of adaptive algorithms for weakly convex constrained optimization

We analyze the adaptive first order algorithm AMSGrad, for solving a constrained stochastic optimization problem with a weakly convex objective. We prove the $\mathcal{\tilde O}(t^{-1/4})$ rate of convergence for the norm of the gradient of Moreau envelope, which is the standard stationarity measure for this class of problems. It matches the known rates that adaptive algorithms enjoy for the specific case of unconstrained smooth stochastic optimization. Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly unbounded optimization domains. Finally, we illustrate the applications and extensions of our results to specific problems and algorithms.

preprint2020arXiv

Double-Loop Unadjusted Langevin Algorithm

A well-known first-order method for sampling from log-concave probability distributions is the Unadjusted Langevin Algorithm (ULA). This work proposes a new annealing step-size schedule for ULA, which allows to prove new convergence guarantees for sampling from a smooth log-concave distribution, which are not covered by existing state-of-the-art convergence guarantees. To establish this result, we derive a new theoretical bound that relates the Wasserstein distance to total variation distance between any two log-concave distributions that complements the reach of Talagrand T2 inequality. Moreover, applying this new step size schedule to an existing constrained sampling algorithm, we show state-of-the-art convergence rates for sampling from a constrained log-concave distribution, as well as improved dimension dependence.

preprint2020arXiv

Efficient Proximal Mapping of the 1-path-norm of Shallow Networks

We demonstrate two new important properties of the 1-path-norm of shallow neural networks. First, despite its non-smoothness and non-convexity it allows a closed form proximal operator which can be efficiently computed, allowing the use of stochastic proximal-gradient-type methods for regularized empirical risk minimization. Second, when the activation functions is differentiable, it provides an upper bound on the Lipschitz constant of the network. Such bound is tighter than the trivial layer-wise product of Lipschitz constants, motivating its use for training networks robust to adversarial perturbations. In practical experiments we illustrate the advantages of using the proximal mapping and we compare the robustness-accuracy trade-off induced by the 1-path-norm, L1-norm and layer-wise constraints on the Lipschitz constant (Parseval networks).

preprint2020arXiv

Environment Shaping in Reinforcement Learning using State Abstraction

One of the central challenges faced by a reinforcement learning (RL) agent is to effectively learn a (near-)optimal policy in environments with large state spaces having sparse and noisy feedback signals. In real-world applications, an expert with additional domain knowledge can help in speeding up the learning process via \emph{shaping the environment}, i.e., making the environment more learner-friendly. A popular paradigm in literature is \emph{potential-based reward shaping}, where the environment's reward function is augmented with additional local rewards using a potential function. However, the applicability of potential-based reward shaping is limited in settings where (i) the state space is very large, and it is challenging to compute an appropriate potential function, (ii) the feedback signals are noisy, and even with shaped rewards the agent could be trapped in local optima, and (iii) changing the rewards alone is not sufficient, and effective shaping requires changing the dynamics. We address these limitations of potential-based shaping methods and propose a novel framework of \emph{environment shaping using state abstraction}. Our key idea is to compress the environment's large state space with noisy signals to an abstracted space, and to use this abstraction in creating smoother and more effective feedback signals for the agent. We study the theoretical underpinnings of our abstraction-based environment shaping, and show that the agent's policy learnt in the shaped environment preserves near-optimal behavior in the original environment.

preprint2020arXiv

Interaction-limited Inverse Reinforcement Learning

This paper proposes an inverse reinforcement learning (IRL) framework to accelerate learning when the learner-teacher \textit{interaction} is \textit{limited} during training. Our setting is motivated by the realistic scenarios where a helpful teacher is not available or when the teacher cannot access the learning dynamics of the student. We present two different training strategies: Curriculum Inverse Reinforcement Learning (CIRL) covering the teacher's perspective, and Self-Paced Inverse Reinforcement Learning (SPIRL) focusing on the learner's perspective. Using experiments in simulations and experiments with a real robot learning a task from a human demonstrator, we show that our training strategies can allow a faster training than a random teacher for CIRL and than a batch learner for SPIRL.

preprint2020arXiv

Lipschitz constant estimation of Neural Networks via sparse polynomial optimization

We introduce LiPopt, a polynomial optimization framework for computing increasingly tighter upper bounds on the Lipschitz constant of neural networks. The underlying optimization problems boil down to either linear (LP) or semidefinite (SDP) programming. We show how to use the sparse connectivity of a network, to significantly reduce the complexity of computation. This is specially useful for convolutional as well as pruned neural networks. We conduct experiments on networks with random weights as well as networks trained on MNIST, showing that in the particular case of the $\ell_\infty$-Lipschitz constant, our approach yields superior estimates, compared to baselines available in the literature.

preprint2020arXiv

Mirrored Langevin Dynamics

We consider the problem of sampling from constrained distributions, which has posed significant challenges to both non-asymptotic analysis and algorithmic design. We propose a unified framework, which is inspired by the classical mirror descent, to derive novel first-order sampling schemes. We prove that, for a general target distribution with strongly convex potential, our framework implies the existence of a first-order algorithm achieving $\tilde{O}(ε^{-2}d)$ convergence, suggesting that the state-of-the-art $\tilde{O}(ε^{-6}d^5)$ can be vastly improved. With the important Latent Dirichlet Allocation (LDA) application in mind, we specialize our algorithm to sample from Dirichlet posteriors, and derive the first non-asymptotic $\tilde{O}(ε^{-2}d^2)$ rate for first-order sampling. We further extend our framework to the mini-batch setting and prove convergence rates when only stochastic gradients are available. Finally, we report promising experimental results for LDA on real datasets.

preprint2020arXiv

On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems. We first show that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules. Subsequently, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability $1$ for the entire spectrum of step-size policies considered. Finally, we prove that the algorithm's rate of convergence to Hurwicz minimizers is $\mathcal{O}(1/n^{p})$ if the method is employed with a $Θ(1/n^p)$ step-size schedule. This provides an important guideline for tuning the algorithm's step-size as it suggests that a cool-down phase with a vanishing step-size could lead to faster convergence; we demonstrate this heuristic using ResNet architectures on CIFAR.

preprint2020arXiv

Random extrapolation for primal-dual coordinate descent

We introduce a randomly extrapolated primal-dual coordinate descent method that adapts to sparsity of the data matrix and the favorable structures of the objective function. Our method updates only a subset of primal and dual variables with sparse data, and it uses large step sizes with dense data, retaining the benefits of the specific methods designed for each case. In addition to adapting to sparsity, our method attains fast convergence guarantees in favorable cases \textit{without any modifications}. In particular, we prove linear convergence under metric subregularity, which applies to strongly convex-strongly concave problems and piecewise linear quadratic functions. We show almost sure convergence of the sequence and optimal sublinear convergence rates for the primal-dual gap and objective values, in the general convex-concave case. Numerical evidence demonstrates the state-of-the-art empirical performance of our method in sparse and dense settings, matching and improving the existing methods.

preprint2020arXiv

Robust Maximization of Non-Submodular Objectives

We study the problem of maximizing a monotone set function subject to a cardinality constraint $k$ in the setting where some number of elements $τ$ is deleted from the returned set. The focus of this work is on the worst-case adversarial setting. While there exist constant-factor guarantees when the function is submodular, there are no guarantees for non-submodular objectives. In this work, we present a new algorithm Oblivious-Greedy and prove the first constant-factor approximation guarantees for a wider class of non-submodular objectives. The obtained theoretical bounds are the first constant-factor bounds that also hold in the linear regime, i.e. when the number of deletions $τ$ is linear in $k$. Our bounds depend on established parameters such as the submodularity ratio and some novel ones such as the inverse curvature. We bound these parameters for two important objectives including support selection and variance reduction. Finally, we numerically demonstrate the robust performance of Oblivious-Greedy for these two objectives on various datasets.

preprint2020arXiv

Scalable Learning-Based Sampling Optimization for Compressive Dynamic MRI

Compressed sensing applied to magnetic resonance imaging (MRI) allows to reduce the scanning time by enabling images to be reconstructed from highly undersampled data. In this paper, we tackle the problem of designing a sampling mask for an arbitrary reconstruction method and a limited acquisition budget. Namely, we look for an optimal probability distribution from which a mask with a fixed cardinality is drawn. We demonstrate that this problem admits a compactly supported solution, which leads to a deterministic optimal sampling mask. We then propose a stochastic greedy algorithm that (i) provides an approximate solution to this problem, and (ii) resolves the scaling issues of [1,2]. We validate its performance on in vivo dynamic MRI with retrospective undersampling, showing that our method preserves the performance of [1,2] while reducing the computational burden by a factor close to 200.

preprint2019arXiv

Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents

This article reviews recent advances in multi-agent reinforcement learning algorithms for large-scale control systems and communication networks, which learn to communicate and cooperate. We provide an overview of this emerging field, with an emphasis on the decentralized setting under different coordination protocols. We highlight the evolution of reinforcement learning algorithms from single-agent to multi-agent systems, from a distributed optimization perspective, and conclude with future directions and challenges, in the hope to catalyze the growing synergy among distributed optimization, signal processing, and reinforcement learning communities.

preprint2016arXiv

A single-phase, proximal path-following framework

We propose a new proximal, path-following framework for a class of constrained convex problems. We consider settings where the nonlinear---and possibly non-smooth---objective part is endowed with a proximity operator, and the constraint set is equipped with a self-concordant barrier. Our approach relies on the following two main ideas. First, we re-parameterize the optimality condition as an auxiliary problem, such that a good initial point is available; by doing so, a family of alternative paths towards the optimum is generated. Second, we combine the proximal operator with path-following ideas to design a single-phase, proximal, path-following algorithm. Our method has several advantages. First, it allows handling non-smooth objectives via proximal operators; this avoids lifting the problem dimension in order to accommodate non-smooth components in optimization. Second, it consists of only a \emph{single phase}: While the overall convergence rate of classical path-following schemes for self-concordant objectives does not suffer from the initialization phase, proximal path-following schemes undergo slow convergence, in order to obtain a good starting point \cite{TranDinh2013e}. In this work, we show how to overcome this limitation in the proximal setting and prove that our scheme has the same $\mathcal{O}(\sqrtν\log(1/\varepsilon))$ worst-case iteration-complexity with standard approaches \cite{Nesterov2004,Nesterov1994} without requiring an initial phase, where $ν$ is the barrier parameter and $\varepsilon$ is a desired accuracy. Finally, our framework allows errors in the calculation of proximal-Newton directions, without sacrificing the worst-case iteration complexity. We demonstrate the merits of our algorithm via three numerical examples, where proximal operators play a key role.

preprint2016arXiv

An Efficient Streaming Algorithm for the Submodular Cover Problem

We initiate the study of the classical Submodular Cover (SC) problem in the data streaming model which we refer to as the Streaming Submodular Cover (SSC). We show that any single pass streaming algorithm using sublinear memory in the size of the stream will fail to provide any non-trivial approximation guarantees for SSC. Hence, we consider a relaxed version of SSC, where we only seek to find a partial cover. We design the first Efficient bicriteria Submodular Cover Streaming (ESC-Streaming) algorithm for this problem, and provide theoretical guarantees for its performance supported by numerical evidence. Our algorithm finds solutions that are competitive with the near-optimal offline greedy algorithm despite requiring only a single pass over the data stream. In our numerical experiments, we evaluate the performance of ESC-Streaming on active set selection and large-scale graph cover problems.

preprint2016arXiv

Converse Bounds for Noisy Group Testing with Arbitrary Measurement Matrices

We consider the group testing problem, in which one seeks to identify a subset of defective items within a larger set of items based on a number of noisy tests. While matching achievability and converse bounds are known in several cases of interest for i.i.d.~measurement matrices, less is known regarding converse bounds for arbitrary measurement matrices. We address this by presenting two converse bounds for arbitrary matrices and general noise models. First, we provide a strong converse bound ($\mathbb{P}[\mathrm{error}] \to 1$) that matches existing achievability bounds in several cases of interest. Second, we provide a weak converse bound ($\mathbb{P}[\mathrm{error}] \not\to 0$) that matches existing achievability bounds in greater generality.

preprint2016arXiv

Convex block-sparse linear regression with expanders -- provably

Sparse matrices are favorable objects in machine learning and optimization. When such matrices are used, in place of dense ones, the overall complexity requirements in optimization can be significantly reduced in practice, both in terms of space and run-time. Prompted by this observation, we study a convex optimization scheme for block-sparse recovery from linear measurements. To obtain linear sketches, we use expander matrices, i.e., sparse matrices containing only few non-zeros per column. Hitherto, to the best of our knowledge, such algorithmic solutions have been only studied from a non-convex perspective. Our aim here is to theoretically characterize the performance of convex approaches under such setting. Our key novelty is the expression of the recovery error in terms of the model-based norm, while assuring that solution lives in the model. To achieve this, we show that sparse model-based matrices satisfy a group version of the null-space property. Our experimental findings on synthetic and real applications support our claims for faster recovery in the convex setting -- as opposed to using dense sensing matrices, while showing a competitive recovery performance.

preprint2016arXiv

Fixed Points of Generalized Approximate Message Passing with Arbitrary Matrices

The estimation of a random vector with independent components passed through a linear transform followed by a componentwise (possibly nonlinear) output map arises in a range of applications. Approximate message passing (AMP) methods, based on Gaussian approximations of loopy belief propagation, have recently attracted considerable attention for such problems. For large random transforms, these methods exhibit fast convergence and admit precise analytic characterizations with testable conditions for optimality, even for certain non-convex problem instances. However, the behavior of AMP under general transforms is not fully understood. In this paper, we consider the generalized AMP (GAMP) algorithm and relate the method to more common optimization techniques. This analysis enables a precise characterization of the GAMP algorithm fixed-points that applies to arbitrary transforms. In particular, we show that the fixed points of the so-called max-sum GAMP algorithm for MAP estimation are critical points of a constrained maximization of the posterior density. The fixed-points of the sum-product GAMP algorithm for estimation of the posterior marginals can be interpreted as critical points of a certain free energy.

preprint2016arXiv

Frank-Wolfe Works for Non-Lipschitz Continuous Gradient Objectives: Scalable Poisson Phase Retrieval

We study a phase retrieval problem in the Poisson noise model. Motivated by the PhaseLift approach, we approximate the maximum-likelihood estimator by solving a convex program with a nuclear norm constraint. While the Frank-Wolfe algorithm, together with the Lanczos method, can efficiently deal with nuclear norm constraints, our objective function does not have a Lipschitz continuous gradient, and hence existing convergence guarantees for the Frank-Wolfe algorithm do not apply. In this paper, we show that the Frank-Wolfe algorithm works for the Poisson phase retrieval problem, and has a global convergence rate of O(1/t), where t is the iteration counter. We provide rigorous theoretical guarantee and illustrating numerical results.

preprint2016arXiv

Learning Data Triage: Linear Decoding Works for Compressive MRI

The standard approach to compressive sampling considers recovering an unknown deterministic signal with certain known structure, and designing the sub-sampling pattern and recovery algorithm based on the known structure. This approach requires looking for a good representation that reveals the signal structure, and solving a non-smooth convex minimization problem (e.g., basis pursuit). In this paper, another approach is considered: We learn a good sub-sampling pattern based on available training signals, without knowing the signal structure in advance, and reconstruct an accordingly sub-sampled signal by computationally much cheaper linear reconstruction. We provide a theoretical guarantee on the recovery error, and show via experiments on real-world MRI data the effectiveness of the proposed compressive MRI scheme.

preprint2016arXiv

Learning Non-Parametric Basis Independent Models from Point Queries via Low-Rank Methods

We consider the problem of learning multi-ridge functions of the form f(x) = g(Ax) from point evaluations of f. We assume that the function f is defined on an l_2-ball in R^d, g is twice continuously differentiable almost everywhere, and A \in R^{k \times d} is a rank k matrix, where k << d. We propose a randomized, polynomial-complexity sampling scheme for estimating such functions. Our theoretical developments leverage recent techniques from low rank matrix recovery, which enables us to derive a polynomial time estimator of the function f along with uniform approximation guarantees. We prove that our scheme can also be applied for learning functions of the form: f(x) = \sum_{i=1}^{k} g_i(a_i^T x), provided f satisfies certain smoothness conditions in a neighborhood around the origin. We also characterize the noise robustness of the scheme. Finally, we present numerical examples to illustrate the theoretical bounds in action.

preprint2016arXiv

Learning-based Compressive Subsampling

The problem of recovering a structured signal $\mathbf{x} \in \mathbb{C}^p$ from a set of dimensionality-reduced linear measurements $\mathbf{b} = \mathbf {A}\mathbf {x}$ arises in a variety of applications, such as medical imaging, spectroscopy, Fourier optics, and computerized tomography. Due to computational and storage complexity or physical constraints imposed by the problem, the measurement matrix $\mathbf{A} \in \mathbb{C}^{n \times p}$ is often of the form $\mathbf{A} = \mathbf{P}_Ω\boldsymbolΨ$ for some orthonormal basis matrix $\boldsymbolΨ\in \mathbb{C}^{p \times p}$ and subsampling operator $\mathbf{P}_Ω: \mathbb{C}^{p} \rightarrow \mathbb{C}^{n}$ that selects the rows indexed by $Ω$. This raises the fundamental question of how best to choose the index set $Ω$ in order to optimize the recovery performance. Previous approaches to addressing this question rely on non-uniform \emph{random} subsampling using application-specific knowledge of the structure of $\mathbf{x}$. In this paper, we instead take a principled learning-based approach in which a \emph{fixed} index set is chosen based on a set of training signals $\mathbf{x}_1,\dotsc,\mathbf{x}_m$. We formulate combinatorial optimization problems seeking to maximize the energy captured in these signals in an average-case or worst-case sense, and we show that these can be efficiently solved either exactly or approximately via the identification of modularity and submodularity structures. We provide both deterministic and statistical theoretical guarantees showing how the resulting measurement matrices perform on signals differing from the training signals, and we provide numerical examples showing our approach to be effective on a variety of data sets.

preprint2016arXiv

Limits on Support Recovery with Probabilistic Models: An Information-Theoretic Framework

The support recovery problem consists of determining a sparse subset of a set of variables that is relevant in generating a set of observations, and arises in a diverse range of settings such as compressive sensing, and subset selection in regression, and group testing. In this paper, we take a unified approach to support recovery problems, considering general probabilistic models relating a sparse data vector to an observation vector. We study the information-theoretic limits of both exact and partial support recovery, taking a novel approach motivated by thresholding techniques in channel coding. We provide general achievability and converse bounds characterizing the trade-off between the error probability and number of measurements, and we specialize these to the linear, 1-bit, and group testing models. In several cases, our bounds not only provide matching scaling laws in the necessary and sufficient number of measurements, but also sharp thresholds with matching constant factors. Our approach has several advantages over previous approaches: For the achievability part, we obtain sharp thresholds under broader scalings of the sparsity level and other parameters (e.g., signal-to-noise ratio) compared to several previous works, and for the converse part, we not only provide conditions under which the error probability fails to vanish, but also conditions under which it tends to one.

preprint2016arXiv

On the Difficulty of Selecting Ising Models with Approximate Recovery

In this paper, we consider the problem of estimating the underlying graph associated with an Ising model given a number of independent and identically distributed samples. We adopt an \emph{approximate recovery} criterion that allows for a number of missed edges or incorrectly-included edges, in contrast with the widely-studied exact recovery problem. Our main results provide information-theoretic lower bounds on the sample complexity for graph classes imposing constraints on the number of edges, maximal degree, and other properties. We identify a broad range of scenarios where, either up to constant factors or logarithmic factors, our lower bounds match the best known lower bounds for the exact recovery criterion, several of which are known to be tight or near-tight. Hence, in these cases, approximate recovery has a similar difficulty to exact recovery in the minimax sense. Our bounds are obtained via a modification of Fano's inequality for handling the approximate recovery criterion, along with suitably-designed ensembles of graphs that can broadly be classed into two categories: (i) Those containing graphs that contain several isolated edges or cliques and are thus difficult to distinguish from the empty graph; (ii) Those containing graphs for which certain groups of nodes are highly correlated, thus making it difficult to determine precisely which edges connect them. We support our theoretical results on these ensembles with numerical experiments.

preprint2016arXiv

Partial Recovery Bounds for the Sparse Stochastic Block Model

In this paper, we study the information-theoretic limits of community detection in the symmetric two-community stochastic block model, with intra-community and inter-community edge probabilities $\frac{a}{n}$ and $\frac{b}{n}$ respectively. We consider the sparse setting, in which $a$ and $b$ do not scale with $n$, and provide upper and lower bounds on the proportion of community labels recovered on average. We provide a numerical example for which the bounds are near-matching for moderate values of $a - b$, and matching in the limit as $a-b$ grows large.

preprint2016arXiv

Time-Varying Gaussian Process Bandit Optimization

We consider the sequential Bayesian optimization problem with bandit feedback, adopting a formulation that allows for the reward function to vary with time. We model the reward function using a Gaussian process whose evolution obeys a simple Markov model. We introduce two natural extensions of the classical Gaussian process upper confidence bound (GP-UCB) algorithm. The first, R-GP-UCB, resets GP-UCB at regular intervals. The second, TV-GP-UCB, instead forgets about old data in a smooth fashion. Our main contribution comprises of novel regret bounds for these algorithms, providing an explicit characterization of the trade-off between the time horizon and the rate at which the function varies. We illustrate the performance of the algorithms on both synthetic and real data, and we find the gradual forgetting of TV-GP-UCB to perform favorably compared to the sharp resetting of R-GP-UCB. Moreover, both algorithms significantly outperform classical GP-UCB, since it treats stale and fresh data equally.

preprint2016arXiv

Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation

We present a new algorithm, truncated variance reduction (TruVaR), that treats Bayesian optimization (BO) and level-set estimation (LSE) with Gaussian processes in a unified fashion. The algorithm greedily shrinks a sum of truncated variances within a set of potential maximizers (BO) or unclassified points (LSE), which is updated based on confidence bounds. TruVaR is effective in several important settings that are typically non-trivial to incorporate into myopic algorithms, including pointwise costs and heteroscedastic noise. We provide a general theoretical guarantee for TruVaR covering these aspects, and use it to recover and strengthen existing results on BO and LSE. Moreover, we provide a new result for a setting where one can select from a number of noise levels having associated costs. We demonstrate the effectiveness of the algorithm on both synthetic and real-world data sets.

preprint2015arXiv

A Geometric View on Constrained M-Estimators

We study the estimation error of constrained M-estimators, and derive explicit upper bounds on the expected estimation error determined by the Gaussian width of the constraint set. Both of the cases where the true parameter is on the boundary of the constraint set (matched constraint), and where the true parameter is strictly in the constraint set (mismatched constraint) are considered. For both cases, we derive novel universal estimation error bounds for regression in a generalized linear model with the canonical link function. Our error bound for the mismatched constraint case is minimax optimal in terms of its dependence on the sample size, for Gaussian linear regression by the Lasso.

preprint2015arXiv

A Primal-Dual Algorithmic Framework for Constrained Convex Minimization

We present a primal-dual algorithmic framework to obtain approximate solutions to a prototypical constrained convex optimization problem, and rigorously characterize how common structural assumptions affect the numerical efficiency. Our main analysis technique provides a fresh perspective on Nesterov's excessive gap technique in a structured fashion and unifies it with smoothing and primal-dual methods. For instance, through the choices of a dual smoothing strategy and a center point, our framework subsumes decomposition algorithms, augmented Lagrangian as well as the alternating direction method-of-multipliers methods as its special cases, and provides optimal convergence rates on the primal objective residual as well as the primal feasibility gap of the iterates for all.

preprint2015arXiv

A totally unimodular view of structured sparsity

This paper describes a simple framework for structured sparse recovery based on convex optimization. We show that many structured sparsity models can be naturally represented by linear matrix inequalities on the support of the unknown parameters, where the constraint matrix has a totally unimodular (TU) structure. For such structured models, tight convex relaxations can be obtained in polynomial time via linear programming. Our modeling framework unifies the prevalent structured sparsity norms in the literature, introduces new interesting ones, and renders their tightness and tractability arguments transparent.

preprint2015arXiv

A Universal Primal-Dual Convex Optimization Framework

We propose a new primal-dual algorithmic framework for a prototypical constrained convex optimization template. The algorithmic instances of our framework are universal since they can automatically adapt to the unknown Holder continuity degree and constant within the dual formulation. They are also guaran- teed to have optimal convergence rates in the objective residual and the feasibility gap for each Holder smoothness degree. In contrast to existing primal-dual algorithms, our framework avoids the proximity operator of the objective function. We instead leverage computationally cheaper, Fenchel-type operators, which are the main workhorses of the generalized conditional gradient (GCG)-type methods. In contrast to the GCG-type methods, our framework does not require the objective function to be differentiable, and can also process additional general linear inclusion constraints, while guarantees the convergence rate on the primal problem

preprint2015arXiv

Adaptive-Rate Sparse Signal Reconstruction With Application in Compressive Background Subtraction

We propose and analyze an online algorithm for reconstructing a sequence of signals from a limited number of linear measurements. The signals are assumed sparse, with unknown support, and evolve over time according to a generic nonlinear dynamical model. Our algorithm, based on recent theoretical results for $\ell_1$-$\ell_1$ minimization, is recursive and computes the number of measurements to be taken at each time on-the-fly. As an example, we apply the algorithm to compressive video background subtraction, a problem that can be stated as follows: given a set of measurements of a sequence of images with a static background, simultaneously reconstruct each image while separating its foreground from the background. The performance of our method is illustrated on sequences of real images: we observe that it allows a dramatic reduction in the number of measurements with respect to state-of-the-art compressive background subtraction schemes.

preprint2015arXiv

Group-Sparse Model Selection: Hardness and Relaxations

Group-based sparsity models are proven instrumental in linear regression problems for recovering signals from much fewer measurements than standard compressive sensing. The main promise of these models is the recovery of "interpretable" signals through the identification of their constituent groups. In this paper, we establish a combinatorial framework for group-model selection problems and highlight the underlying tractability issues. In particular, we show that the group-model selection problem is equivalent to the well-known NP-hard weighted maximum coverage problem (WMC). Leveraging a graph-based understanding of group models, we describe group structures which enable correct model selection in polynomial time via dynamic programming. Furthermore, group structures that lead to totally unimodular constraints have tractable discrete as well as convex relaxations. We also present a generalization of the group-model that allows for within group sparsity, which can be used to model hierarchical sparsity. Finally, we study the Pareto frontier of group-sparse approximations for two tractable models, among which the tree sparsity model, and illustrate selection and computation trade-offs between our framework and the existing convex relaxations.

preprint2015arXiv

Linear Inverse Problems with Norm and Sparsity Constraints

We describe two nonconventional algorithms for linear regression, called GAME and CLASH. The salient characteristics of these approaches is that they exploit the convex $\ell_1$-ball and non-convex $\ell_0$-sparsity constraints jointly in sparse recovery. To establish the theoretical approximation guarantees of GAME and CLASH, we cover an interesting range of topics from game theory, convex and combinatorial optimization. We illustrate that these approaches lead to improved theoretical guarantees and empirical performance beyond convex and non-convex solvers alone.

preprint2015arXiv

Structured Sparsity: Discrete and Convex approaches

Compressive sensing (CS) exploits sparsity to recover sparse or compressible signals from dimensionality reducing, non-adaptive sensing mechanisms. Sparsity is also used to enhance interpretability in machine learning and statistics applications: While the ambient dimension is vast in modern data analysis problems, the relevant information therein typically resides in a much lower dimensional space. However, many solutions proposed nowadays do not leverage the true underlying structure. Recent results in CS extend the simple sparsity idea to more sophisticated {\em structured} sparsity models, which describe the interdependency between the nonzero components of a signal, allowing to increase the interpretability of the results and lead to better recovery performance. In order to better understand the impact of structured sparsity, in this chapter we analyze the connections between the discrete models and their convex relaxations, highlighting their relative advantages. We start with the general group sparse model and then elaborate on two important special cases: the dispersive and the hierarchical models. For each, we present the models in their discrete nature, discuss how to solve the ensuing discrete problems and then describe convex relaxations. We also consider more general structures as defined by set functions and present their convex proxies. Further, we discuss efficient optimization solutions for structured sparsity problems and illustrate structured sparsity in action via three applications.

preprint2014arXiv

A variational approach to stable principal component pursuit

We introduce a new convex formulation for stable principal component pursuit (SPCP) to decompose noisy signals into low-rank and sparse representations. For numerical solutions of our SPCP formulation, we first develop a convex variational framework and then accelerate it with quasi-Newton methods. We show, via synthetic and real data experiments, that our approach offers advantages over the classical SPCP formulations in scalability and practical parameter selection.

preprint2014arXiv

An Inexact Proximal Path-Following Algorithm for Constrained Convex Minimization

Many scientific and engineering applications feature nonsmooth convex minimization problems over convex sets. In this paper, we address an important instance of this broad class where we assume that the nonsmooth objective is equipped with a tractable proximity operator and that the convex constraint set affords a self-concordant barrier. We provide a new joint treatment of proximal and self-concordant barrier concepts and illustrate that such problems can be efficiently solved, without the need of lifting the problem dimensions, as in disciplined convex optimization approach. We propose an inexact path-following algorithmic framework and theoretically characterize the worst-case analytical complexity of this framework when the proximal subproblems are solved inexactly. To show the merits of our framework, we apply its instances to both synthetic and real-world applications, where it shows advantages over standard interior point methods. As a by-product, we describe how our framework can obtain points on the Pareto frontier of regularized problems with self-concordant objectives in a tuning free fashion.

preprint2014arXiv

Bilinear Generalized Approximate Message Passing

We extend the generalized approximate message passing (G-AMP) approach, originally proposed for high-dimensional generalized-linear regression in the context of compressive sensing, to the generalized-bilinear case, which enables its application to matrix completion, robust PCA, dictionary learning, and related matrix-factorization problems. In the first part of the paper, we derive our Bilinear G-AMP (BiG-AMP) algorithm as an approximation of the sum-product belief propagation algorithm in the high-dimensional limit, where central-limit theorem arguments and Taylor-series approximations apply, and under the assumption of statistically independent matrix entries with known priors. In addition, we propose an adaptive damping mechanism that aids convergence under finite problem sizes, an expectation-maximization (EM)-based method to automatically tune the parameters of the assumed priors, and two rank-selection strategies. In the second part of the paper, we discuss the specializations of EM-BiG-AMP to the problems of matrix completion, robust PCA, and dictionary learning, and present the results of an extensive empirical study comparing EM-BiG-AMP to state-of-the-art algorithms on each problem. Our numerical results, using both synthetic and real-world datasets, demonstrate that EM-BiG-AMP yields excellent reconstruction accuracy (often best in class) while maintaining competitive runtimes and avoiding the need to tune algorithmic parameters.

preprint2014arXiv

Composite Self-Concordant Minimization

We propose a variable metric framework for minimizing the sum of a self-concordant function and a possibly non-smooth convex function, endowed with an easily computable proximal operator. We theoretically establish the convergence of our framework without relying on the usual Lipschitz gradient assumption on the smooth part. An important highlight of our work is a new set of analytic step-size selection and correction procedures based on the structure of the problem. We describe concrete algorithmic instances of our framework for several interesting applications and demonstrate them numerically on both synthetic and real data.

preprint2014arXiv

Convex Optimization for Big Data

This article reviews recent advances in convex optimization algorithms for Big Data, which aim to reduce the computational, storage, and communications bottlenecks. We provide an overview of this emerging field, describe contemporary approximation techniques like first-order methods and randomization for scalability, and survey the important role of parallel and distributed computation. The new Big Data algorithms are based on surprisingly simple principles and attain staggering accelerations even on classical problems.

preprint2014arXiv

Model-based Sketching and Recovery with Expanders

Linear sketching and recovery of sparse vectors with randomly constructed sparse matrices has numerous applications in several areas, including compressive sensing, data stream computing, graph sketching, and combinatorial group testing. This paper considers the same problem with the added twist that the sparse coefficients of the unknown vector exhibit further correlations as determined by a known sparsity model. We prove that exploiting model-based sparsity in recovery provably reduces the sketch size without sacrificing recovery quality. In this context, we present the model-expander iterative hard thresholding algorithm for recovering model sparse signals from linear sketches obtained via sparse adjacency matrices of expander graphs with rigorous performance guarantees. The main computational cost of our algorithm depends on the difficulty of projecting onto the model-sparse set. For the tree and group-based sparsity models we describe in this paper, such projections can be obtained in linear time. Finally, we provide numerical experiments to illustrate the theoretical results in action.

preprint2014arXiv

Scalable sparse covariance estimation via self-concordance

We consider the class of convex minimization problems, composed of a self-concordant function, such as the $\log\det$ metric, a convex data fidelity term $h(\cdot)$ and, a regularizing -- possibly non-smooth -- function $g(\cdot)$. This type of problems have recently attracted a great deal of interest, mainly due to their omnipresence in top-notch applications. Under this \emph{locally} Lipschitz continuous gradient setting, we analyze the convergence behavior of proximal Newton schemes with the added twist of a probable presence of inexact evaluations. We prove attractive convergence rate guarantees and enhance state-of-the-art optimization schemes to accommodate such developments. Experimental results on sparse covariance estimation show the merits of our algorithm, both in terms of recovery efficiency and complexity.

preprint2014arXiv

Sparsistency of $\ell_1$-Regularized $M$-Estimators

We consider the model selection consistency or sparsistency of a broad set of $\ell_1$-regularized $M$-estimators for linear and non-linear statistical models in a unified fashion. For this purpose, we propose the local structured smoothness condition (LSSC) on the loss function. We provide a general result giving deterministic sufficient conditions for sparsistency in terms of the regularization parameter, ambient dimension, sparsity level, and number of measurements. We show that several important statistical models have $M$-estimators that indeed satisfy the LSSC, and as a result, the sparsistency guarantees for the corresponding $\ell_1$-regularized $M$-estimators can be derived as simple applications of our main theorem.

preprint2013arXiv

A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions

We propose an algorithmic framework for convex minimization problems of a composite function with two terms: a self-concordant function and a possibly nonsmooth regularization term. Our method is a new proximal Newton algorithm that features a local quadratic convergence rate. As a specific instance of our framework, we consider the sparse inverse covariance matrix estimation in graph learning problems. Via a careful dual formulation and a novel analytic step-size selection procedure, our approach for graph learning avoids Cholesky decompositions and matrix inversions in its iteration making it attractive for parallel and distributed implementations.

preprint2013arXiv

Convexity in source separation: Models, geometry, and algorithms

Source separation or demixing is the process of extracting multiple components entangled within a signal. Contemporary signal processing presents a host of difficult source separation problems, from interference cancellation to background subtraction, blind deconvolution, and even dictionary learning. Despite the recent progress in each of these applications, advances in high-throughput sensor technology place demixing algorithms under pressure to accommodate extremely high-dimensional signals, separate an ever larger number of sources, and cope with more sophisticated signal and mixing models. These difficulties are exacerbated by the need for real-time action in automated decision-making systems. Recent advances in convex optimization provide a simple framework for efficiently solving numerous difficult demixing problems. This article provides an overview of the emerging field, explains the theory that governs the underlying procedures, and surveys algorithms that solve them efficiently. We aim to equip practitioners with a toolkit for constructing their own demixing algorithms that work, as well as concrete intuition for why they work.

preprint2013arXiv

Energy-aware adaptive bi-Lipschitz embeddings

We propose a dimensionality reducing matrix design based on training data with constraints on its Frobenius norm and number of rows. Our design criteria is aimed at preserving the distances between the data points in the dimensionality reduced space as much as possible relative to their distances in original data space. This approach can be considered as a deterministic Bi-Lipschitz embedding of the data points. We introduce a scalable learning algorithm, dubbed AMUSE, and provide a rigorous estimation guarantee by leveraging game theoretic tools. We also provide a generalization characterization of our matrix based on our sample data. We use compressive sensing problems as an example application of our problem, where the Frobenius norm design constraint translates into the sensing energy.

preprint2013arXiv

Matrix Recipes for Hard Thresholding Methods

In this paper, we present and analyze a new set of low-rank recovery algorithms for linear inverse problems within the class of hard thresholding methods. We provide strategies on how to set up these algorithms via basic ingredients for different configurations to achieve complexity vs. accuracy tradeoffs. Moreover, we study acceleration schemes via memory-based techniques and randomized, $ε$-approximate matrix projections to decrease the computational costs in the recovery process. For most of the configurations, we present theoretical analysis that guarantees convergence under mild problem conditions. Simulation results demonstrate notable performance improvements as compared to state-of-the-art algorithms both in terms of reconstruction accuracy and computational complexity.

preprint2013arXiv

Randomized Low-Memory Singular Value Projection

Affine rank minimization algorithms typically rely on calculating the gradient of a data error followed by a singular value decomposition at every iteration. Because these two steps are expensive, heuristic approximations are often used to reduce computational burden. To this end, we propose a recovery scheme that merges the two steps with randomized approximations, and as a result, operates on space proportional to the degrees of freedom in the problem. We theoretically establish the estimation guarantees of the algorithm as a function of approximation tolerance. While the theoretical approximation requirements are overly pessimistic, we demonstrate that in practice the algorithm performs well on the quantum tomography recovery problem.

preprint2012arXiv

Combinatorial Selection and Least Absolute Shrinkage via the CLASH Algorithm

The least absolute shrinkage and selection operator (LASSO) for linear regression exploits the geometric interplay of the $\ell_2$-data error objective and the $\ell_1$-norm constraint to arbitrarily select sparse models. Guiding this uninformed selection process with sparsity models has been precisely the center of attention over the last decade in order to improve learning performance. To this end, we alter the selection process of LASSO to explicitly leverage combinatorial sparsity models (CSMs) via the combinatorial selection and least absolute shrinkage (CLASH) operator. We provide concrete guidelines how to leverage combinatorial constraints within CLASH, and characterize CLASH's guarantees as a function of the set restricted isometry constants of the sensing matrix. Finally, our experimental results show that CLASH can outperform both LASSO and model-based compressive sensing in sparse estimation.

preprint2012arXiv

Compressible Distributions for High-dimensional Statistics

We develop a principled way of identifying probability distributions whose independent and identically distributed (iid) realizations are compressible, i.e., can be well-approximated as sparse. We focus on Gaussian random underdetermined linear regression (GULR) problems, where compressibility is known to ensure the success of estimators exploiting sparse regularization. We prove that many distributions revolving around maximum a posteriori (MAP) interpretation of sparse regularized estimators are in fact incompressible, in the limit of large problem sizes. A highlight is the Laplace distribution and $\ell^{1}$ regularized estimators such as the Lasso and Basis Pursuit denoising. To establish this result, we identify non-trivial undersampling regions in GULR where the simple least squares solution almost surely outperforms an oracle sparse solution, when the data is generated from the Laplace distribution. We provide simple rules of thumb to characterize classes of compressible (respectively incompressible) distributions based on their second and fourth moments. Generalized Gaussians and generalized Pareto distributions serve as running examples for concreteness.

preprint2012arXiv

Matrix ALPS: Accelerated Low Rank and Sparse Matrix Reconstruction

We propose Matrix ALPS for recovering a sparse plus low-rank decomposition of a matrix given its corrupted and incomplete linear measurements. Our approach is a first-order projected gradient method over non-convex sets, and it exploits a well-known memory-based acceleration technique. We theoretically characterize the convergence properties of Matrix ALPS using the stable embedding properties of the linear measurement operator. We then numerically illustrate that our algorithm outperforms the existing convex as well as non-convex state-of-the-art algorithms in computational efficiency without sacrificing stability.

preprint2012arXiv

Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings

We tackle the multi-party speech recovery problem through modeling the acoustic of the reverberant chambers. Our approach exploits structured sparsity models to perform room modeling and speech recovery. We propose a scheme for characterizing the room acoustic from the unknown competing speech sources relying on localization of the early images of the speakers by sparse approximation of the spatial spectra of the virtual sources in a free-space model. The images are then clustered exploiting the low-rank structure of the spectro-temporal components belonging to each source. This enables us to identify the early support of the room impulse response function and its unique map to the room geometry. To further tackle the ambiguity of the reflection ratios, we propose a novel formulation of the reverberation model and estimate the absorption coefficients through a convex optimization exploiting joint sparsity model formulated upon spatio-spectral sparsity of concurrent speech representation. The acoustic parameters are then incorporated for separating individual speech signals through either structured sparse recovery or inverse filtering the acoustic channels. The experiments conducted on real data recordings demonstrate the effectiveness of the proposed approach for multi-party speech recovery and recognition.

preprint2012arXiv

Sublinear Time, Approximate Model-based Sparse Recovery For All

We describe a probabilistic, {\it sublinear} runtime, measurement-optimal system for model-based sparse recovery problems through dimensionality reducing, {\em dense} random matrices. Specifically, we obtain a linear sketch $u\in \R^M$ of a vector $\bestsignal\in \R^N$ in high-dimensions through a matrix $Φ\in \R^{M\times N}$ $(M<N)$. We assume this vector can be well approximated by $K$ non-zero coefficients (i.e., it is $K$-sparse). In addition, the nonzero coefficients of $\bestsignal$ can obey additional structure constraints such as matroid, totally unimodular, or knapsack constraints, which dub as model-based sparsity. We construct the dense measurement matrix using a probabilistic method so that it satisfies the so-called restricted isometry property in the $\ell_2$-norm. While recovery using such matrices is measurement-optimal as they require the smallest sketch sizes $\numsam= O(\sparsity \log(\dimension/\sparsity))$, the existing algorithms require superlinear runtime $Ω(N\log(N/K))$ with the exception of Porat and Strauss, which requires $O(β^5ε^{-3}K(N/K)^{1/β}), ~β\in \mathbb{Z}_{+}, $ but provides an $\ell_1/\ell_1$ approximation guarantee. In contrast, our approach features $ O\big(\max \lbrace \sketch \sparsity \log^{O(1)} \dimension, ~\sketch \sparsity^2 \log^2 (\dimension/\sparsity) \rbrace\big) $ complexity where $ L \in \mathbb{Z}_{+}$ is a design parameter, independent of $\dimension$, requires a smaller sketch size, can accommodate model sparsity, and provides a stronger $\ell_2/\ell_1$ guarantee. Our system applies to "for all" sparse signals, is robust against bounded perturbations in $u$ as well as perturbations on $\bestsignal$ itself.

preprint2009arXiv

Model-Based Compressive Sensing

Compressive sensing (CS) is an alternative to Shannon/Nyquist sampling for the acquisition of sparse or compressible signals that can be well approximated by just K << N elements from an N-dimensional basis. Instead of taking periodic samples, CS measures inner products with M < N random vectors and then recovers the signal via a sparsity-seeking optimization or greedy algorithm. Standard CS dictates that robust signal recovery is possible from M = O(K log(N/K)) measurements. It is possible to substantially decrease M without sacrificing robustness by leveraging more realistic signal models that go beyond simple sparsity and compressibility by including structural dependencies between the values and locations of the signal coefficients. This paper introduces a model-based CS theory that parallels the conventional theory and provides concrete guidelines on how to create model-based recovery algorithms with provable performance guarantees. A highlight is the introduction of a new class of structured compressible signals along with a new sufficient condition for robust structured compressible signal recovery that we dub the restricted amplification property, which is the natural counterpart to the restricted isometry property of conventional CS. Two examples integrate two relevant signal models - wavelet trees and block sparsity - into two state-of-the-art CS recovery algorithms and prove that they offer robust recovery from just M=O(K) measurements. Extensive numerical simulations demonstrate the validity and applicability of our new theory and algorithms.

Volkan Cevher

What is connected

Connect this record

See the researcher in context

Building this map preview

75 published item(s)

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

GRASP: Deterministic argument ranking in interaction graphs

Optimistic Dual Averaging Unifies Modern Optimizers

Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation

Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence Rate

Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps

Adversarial Audio Synthesis with Complex-valued Polynomial Networks

An Inexact Augmented Lagrangian Framework for Nonconvex Optimization with Nonlinear Constraints

Controlling the Complexity and Lipschitz Constant improves polynomial nets

Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization

High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize

Kernel Conjugate Gradient Methods with Random Projections

On the Complexity of a Practical Primal-Dual Coordinate Method

On the convergence of stochastic primal-dual hybrid gradient

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Score matching enables causal discovery of nonlinear additive noise models

The Spectral Bias of Polynomial Neural Networks

The limits of min-max optimization algorithms: convergence to spurious non-critical sets

A new regret analysis for Adam-type algorithms

A Newton Frank-Wolfe Method for Constrained Self-Concordant Minimization

An Optimal-Storage Approach to Semidefinite Programming using Approximate Complementarity

Conditional gradient methods for stochastically constrained convex minimization

Convergence of adaptive algorithms for weakly convex constrained optimization

Double-Loop Unadjusted Langevin Algorithm

Efficient Proximal Mapping of the 1-path-norm of Shallow Networks

Environment Shaping in Reinforcement Learning using State Abstraction

Interaction-limited Inverse Reinforcement Learning

Lipschitz constant estimation of Neural Networks via sparse polynomial optimization

Mirrored Langevin Dynamics

On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

Random extrapolation for primal-dual coordinate descent

Robust Maximization of Non-Submodular Objectives

Scalable Learning-Based Sampling Optimization for Compressive Dynamic MRI

Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents

A single-phase, proximal path-following framework

An Efficient Streaming Algorithm for the Submodular Cover Problem

Converse Bounds for Noisy Group Testing with Arbitrary Measurement Matrices

Convex block-sparse linear regression with expanders -- provably

Fixed Points of Generalized Approximate Message Passing with Arbitrary Matrices

Frank-Wolfe Works for Non-Lipschitz Continuous Gradient Objectives: Scalable Poisson Phase Retrieval

Learning Data Triage: Linear Decoding Works for Compressive MRI

Learning Non-Parametric Basis Independent Models from Point Queries via Low-Rank Methods

Learning-based Compressive Subsampling

Limits on Support Recovery with Probabilistic Models: An Information-Theoretic Framework

On the Difficulty of Selecting Ising Models with Approximate Recovery

Partial Recovery Bounds for the Sparse Stochastic Block Model

Time-Varying Gaussian Process Bandit Optimization

Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation

A Geometric View on Constrained M-Estimators

A Primal-Dual Algorithmic Framework for Constrained Convex Minimization

A totally unimodular view of structured sparsity

A Universal Primal-Dual Convex Optimization Framework

Adaptive-Rate Sparse Signal Reconstruction With Application in Compressive Background Subtraction

Group-Sparse Model Selection: Hardness and Relaxations

Linear Inverse Problems with Norm and Sparsity Constraints

Structured Sparsity: Discrete and Convex approaches

A variational approach to stable principal component pursuit

An Inexact Proximal Path-Following Algorithm for Constrained Convex Minimization

Bilinear Generalized Approximate Message Passing

Composite Self-Concordant Minimization

Convex Optimization for Big Data

Model-based Sketching and Recovery with Expanders

Scalable sparse covariance estimation via self-concordance

Sparsistency of $\ell_1$-Regularized $M$-Estimators

A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions

Convexity in source separation: Models, geometry, and algorithms

Energy-aware adaptive bi-Lipschitz embeddings

Matrix Recipes for Hard Thresholding Methods

Randomized Low-Memory Singular Value Projection

Combinatorial Selection and Least Absolute Shrinkage via the CLASH Algorithm

Compressible Distributions for High-dimensional Statistics

Matrix ALPS: Accelerated Low Rank and Sparse Matrix Reconstruction

Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings

Sublinear Time, Approximate Model-based Sparse Recovery For All