Source author record

Samet Oymak

Samet Oymak appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Information Theory math.IT math.OC Systems and Control Data Structures and Algorithms eess.SY math.ST Statistics Theory Artificial Intelligence Computation and Language Distributed, Parallel, and Cluster Computing math.PR math.DS math.FA

Catalog footprint

What is connected

37works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Latent Chain-of-Thought Improves Structured-Data Transformers

Chain-of-thought and more broadly test-time compute are known to augment the expressive capabilities of language models and have led to major innovations in reasoning. Motivated by this success, this paper explores latent chain-of-thought as well as the impact of depth and looping for time-series and tabular data. We propose a recurrent scheme in which a structured-data transformer, after an initial forward pass, compresses its query-position hidden states into feedback tokens that are appended to the input and processed again, allowing multiple rounds of latent computation before prediction. We compare CoT models against a same-depth no-CoT baseline, a deeper baseline matched to the CoT model in effective depth, and a looped transformer with weight-tied recurrence but no additional chain-of-thought tokens. Across 36 datasets in time-series forecasting and tabular prediction, latent chain-of-thought improves over the baseline on 7/9 time-series datasets (+12.63\% average gain) and 23/27 tabular datasets (+3.25\% average gain), with CoT models performing best on average in both settings. We also show that the benefit of CoT extends to pretrained foundation models: applying latent CoT to nanoTabPFN, a small open-source tabular foundation model, improves its performance above the much larger TabPFN-v2 on TabArena. Together, these results demonstrate that chain-of-thought is a useful axis for scaling test-time compute for structured data.

preprint2026arXiv

Stochastic Sparse Attention for Memory-Bound Inference

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git

preprint2026arXiv

VSPO: Vector-Steered Policy Optimization for Behavioral Control

Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.

preprint2022arXiv

AutoBalance: Optimized Loss Functions for Imbalanced Data

Imbalanced datasets are commonplace in modern machine learning problems. The presence of under-represented classes or groups with sensitive attributes results in concerns about generalization and fairness. Such concerns are further exacerbated by the fact that large capacity deep nets can perfectly fit the training data and appear to achieve perfect accuracy and fairness during training, but perform poorly during test. To address these challenges, we propose AutoBalance, a bi-level optimization framework that automatically designs a training loss function to optimize a blend of accuracy and fairness-seeking objectives. Specifically, a lower-level problem trains the model weights, and an upper-level problem tunes the loss function by monitoring and optimizing the desired objective over the validation data. Our loss design enables personalized treatment for classes/groups by employing a parametric cross-entropy loss and individualized data augmentation schemes. We evaluate the benefits and performance of our approach for the application scenarios of imbalanced and group-sensitive classification. Extensive empirical evaluations demonstrate the benefits of AutoBalance over state-of-the-art approaches. Our experimental findings are complemented with theoretical insights on loss function design and the benefits of train-validation split. All code is available open-source.

preprint2022arXiv

FedNest: Federated Bilevel, Minimax, and Compositional Optimization

Standard federated optimization methods successfully apply to stochastic problems with single-level structure. However, many contemporary ML problems -- including adversarial robustness, hyperparameter tuning, and actor-critic -- fall under nested bilevel programming that subsumes minimax and compositional optimization. In this work, we propose \fedblo: A federated alternating stochastic gradient method to address general nested problems. We establish provable convergence rates for \fedblo in the presence of heterogeneous data and introduce variations for bilevel, minimax, and compositional optimization. \fedblo introduces multiple innovations including federated hypergradient computation and variance reduction to address inner-level heterogeneity. We complement our theory with experiments on hyperparameter \& hyper-representation learning and minimax optimization that demonstrate the benefits of our method in practice. Code is available at https://github.com/ucr-optml/FedNest.

preprint2022arXiv

Finite Sample Identification of Bilinear Dynamical Systems

Bilinear dynamical systems are ubiquitous in many different domains and they can also be used to approximate more general control-affine systems. This motivates the problem of learning bilinear systems from a single trajectory of the system's states and inputs. Under a mild marginal mean-square stability assumption, we identify how much data is needed to estimate the unknown bilinear system up to a desired accuracy with high probability. Our sample complexity and statistical error rates are optimal in terms of the trajectory length, the dimensionality of the system and the input size. Our proof technique relies on an application of martingale small-ball condition. This enables us to correctly capture the properties of the problem, specifically our error rates do not deteriorate with increasing instability. Finally, we show that numerical experiments are well-aligned with our theoretical results.

preprint2022arXiv

Generalization Guarantees for Neural Architecture Search with Train-Validation Split

Neural Architecture Search (NAS) is a popular method for automatically designing optimized architectures for high-performance deep learning. In this approach, it is common to use bilevel optimization where one optimizes the model weights over the training data (inner problem) and various hyperparameters such as the configuration of the architecture over the validation data (outer problem). This paper explores the statistical aspects of such problems with train-validation splits. In practice, the inner problem is often overparameterized and can easily achieve zero loss. Thus, a-priori it seems impossible to distinguish the right hyperparameters based on training loss alone which motivates a better understanding of the role of train-validation split. To this aim this work establishes the following results. (1) We show that refined properties of the validation loss such as risk and hyper-gradients are indicative of those of the true test loss. This reveals that the outer problem helps select the most generalizable model and prevent overfitting with a near-minimal validation sample size. This is established for continuous search spaces which are relevant for differentiable schemes. Extensions to transfer learning are developed in terms of the mismatch between training & validation distributions. (2) We establish generalization bounds for NAS problems with an emphasis on an activation search problem. When optimized with gradient-descent, we show that the train-validation procedure returns the best (model, architecture) pair even if all architectures can perfectly fit the training data to achieve zero error. (3) Finally, we highlight connections between NAS, multiple kernel learning, and low-rank matrix learning. The latter leads to novel insights where the solution of the outer problem can be accurately learned via efficient spectral methods to achieve near-minimal risk.

preprint2022arXiv

Non-Stationary Representation Learning in Sequential Linear Bandits

In this paper, we study representation learning for multi-task decision-making in non-stationary environments. We consider the framework of sequential linear bandits, where the agent performs a series of tasks drawn from distinct sets associated with different environments. The embeddings of tasks in each set share a low-dimensional feature extractor called representation, and representations are different across sets. We propose an online algorithm that facilitates efficient decision-making by learning and transferring non-stationary representations in an adaptive fashion. We prove that our algorithm significantly outperforms the existing ones that treat tasks independently. We also conduct experiments using both synthetic and real data to validate our theoretical insights and demonstrate the efficacy of our algorithm.

preprint2022arXiv

Representation Learning for Context-Dependent Decision-Making

Humans are capable of adjusting to changing environments flexibly and quickly. Empirical evidence has revealed that representation learning plays a crucial role in endowing humans with such a capability. Inspired by this observation, we study representation learning in the sequential decision-making scenario with contextual changes. We propose an online algorithm that is able to learn and transfer context-dependent representations and show that it significantly outperforms the existing ones that do not learn representations adaptively. As a case study, we apply our algorithm to the Wisconsin Card Sorting Task, a well-established test for the mental flexibility of humans in sequential decision-making. By comparing our algorithm with the standard Q-learning and Deep-Q learning algorithms, we demonstrate the benefits of adaptive representation learning.

preprint2022arXiv

System Identification via Nuclear Norm Regularization

This paper studies the problem of identifying low-order linear systems via Hankel nuclear norm regularization. Hankel regularization encourages the low-rankness of the Hankel matrix, which maps to the low-orderness of the system. We provide novel statistical analysis for this regularization and carefully contrast it with the unregularized ordinary least-squares (OLS) estimator. Our analysis leads to new bounds on estimating the impulse response and the Hankel matrix associated with the linear system. We first design an input excitation and show that Hankel regularization enables one to recover the system using optimal number of observations in the true system order and achieve strong statistical estimation rates. Surprisingly, we demonstrate that the input design indeed matters, by showing that intuitive choices such as i.i.d. Gaussian input leads to provably sub-optimal sample complexity. To better understand the benefits of regularization, we also revisit the OLS estimator. Besides refining existing bounds, we experimentally identify when regularized approach improves over OLS: (1) For low-order systems with slow impulse-response decay, OLS method performs poorly in terms of sample complexity, (2) Hankel matrix returned by regularization has a more clear singular value gap that ease identification of the system order, (3) Hankel regularization is less sensitive to hyperparameter choice. Finally, we establish model selection guarantees through a joint train-validation procedure where we tune the regularization parameter for near-optimal estimation.

preprint2022arXiv

Towards Sample-efficient Overparameterized Meta-learning

An overarching goal in machine learning is to build a generalizable model with few samples. To this end, overparameterization has been the subject of immense interest to explain the generalization ability of deep nets even when the size of the dataset is smaller than that of the model. While the prior literature focuses on the classical supervised setting, this paper aims to demystify overparameterization for meta-learning. Here we have a sequence of linear-regression tasks and we ask: (1) Given earlier tasks, what is the optimal linear representation of features for a new downstream task? and (2) How many samples do we need to build this representation? This work shows that surprisingly, overparameterization arises as a natural answer to these fundamental meta-learning questions. Specifically, for (1), we first show that learning the optimal representation coincides with the problem of designing a task-aware regularization to promote inductive bias. We leverage this inductive bias to explain how the downstream task actually benefits from overparameterization, in contrast to prior works on few-shot learning. For (2), we develop a theory to explain how feature covariance can implicitly help reduce the sample complexity well below the degrees of freedom and lead to small estimation error. We then integrate these findings to obtain an overall performance guarantee for our meta-learning algorithm. Numerical experiments on real and synthetic data verify our insights on overparameterized meta-learning.

preprint2021arXiv

Sample Efficient Subspace-based Representations for Nonlinear Meta-Learning

Constructing good representations is critical for learning complex tasks in a sample efficient manner. In the context of meta-learning, representations can be constructed from common patterns of previously seen tasks so that a future task can be learned quickly. While recent works show the benefit of subspace-based representations, such results are limited to linear-regression tasks. This work explores a more general class of nonlinear tasks with applications ranging from binary classification, generalized linear models and neural nets. We prove that subspace-based representations can be learned in a sample-efficient manner and provably benefit future tasks in terms of sample complexity. Numerical results verify the theoretical predictions in classification and neural-network regression tasks.

preprint2020arXiv

Exploring Weight Importance and Hessian Bias in Model Pruning

Model pruning is an essential procedure for building compact and computationally-efficient machine learning models. A key feature of a good pruning algorithm is that it accurately quantifies the relative importance of the model weights. While model pruning has a rich history, we still don't have a full grasp of the pruning mechanics even for relatively simple problems involving linear models or shallow neural nets. In this work, we provide a principled exploration of pruning by building on a natural notion of importance. For linear models, we show that this notion of importance is captured by covariance scaling which connects to the well-known Hessian-based pruning. We then derive asymptotic formulas that allow us to precisely compare the performance of different pruning methods. For neural networks, we demonstrate that the importance can be at odds with larger magnitudes and proper initialization is critical for magnitude-based pruning. Specifically, we identify settings in which weights become more important despite becoming smaller, which in turn leads to a catastrophic failure of magnitude-based pruning. Our results also elucidate that implicit regularization in the form of Hessian structure has a catalytic role in identifying the important weights, which dictate the pruning performance.

preprint2020arXiv

On the Role of Dataset Quality and Heterogeneity in Model Confidence

Safety-critical applications require machine learning models that output accurate and calibrated probabilities. While uncalibrated deep networks are known to make over-confident predictions, it is unclear how model confidence is impacted by the variations in the data, such as label noise or class size. In this paper, we investigate the role of the dataset quality by studying the impact of dataset size and the label noise on the model confidence. We theoretically explain and experimentally demonstrate that, surprisingly, label noise in the training data leads to under-confident networks, while reduced dataset size leads to over-confident models. We then study the impact of dataset heterogeneity, where data quality varies across classes, on model confidence. We demonstrate that this leads to heterogenous confidence/accuracy behavior in the test data and is poorly handled by the standard calibration algorithms. To overcome this, we propose an intuitive heterogenous calibration technique and show that the proposed approach leads to improved calibration metrics (both average and worst-case errors) on the CIFAR datasets.

preprint2020arXiv

Statistical and Algorithmic Insights for Semi-supervised Learning with Self-training

Self-training is a classical approach in semi-supervised learning which is successfully applied to a variety of machine learning problems. Self-training algorithm generates pseudo-labels for the unlabeled examples and progressively refines these pseudo-labels which hopefully coincides with the actual labels. This work provides theoretical insights into self-training algorithm with a focus on linear classifiers. We first investigate Gaussian mixture models and provide a sharp non-asymptotic finite-sample characterization of the self-training iterations. Our analysis reveals the provable benefits of rejecting samples with low confidence and demonstrates that self-training iterations gracefully improve the model accuracy even if they do get stuck in sub-optimal fixed points. We then demonstrate that regularization and class margin (i.e. separation) is provably important for the success and lack of regularization may prevent self-training from identifying the core features in the data. Finally, we discuss statistical aspects of empirical risk minimization with self-training for general distributions. We show how a purely unsupervised notion of generalization based on self-training based clustering can be formalized based on cluster margin. We then establish a connection between self-training based semi-supervision and the more general problem of learning with heterogenous data and weak supervision.

preprint2020arXiv

Unsupervised Paraphrasing via Deep Reinforcement Learning

Paraphrasing is expressing the meaning of an input sentence in different wording while maintaining fluency (i.e., grammatical and syntactical correctness). Most existing work on paraphrasing use supervised models that are limited to specific domains (e.g., image captions). Such models can neither be straightforwardly transferred to other domains nor generalize well, and creating labeled training data for new domains is expensive and laborious. The need for paraphrasing across different domains and the scarcity of labeled training data in many such domains call for exploring unsupervised paraphrase generation methods. We propose Progressive Unsupervised Paraphrasing (PUP): a novel unsupervised paraphrase generation method based on deep reinforcement learning (DRL). PUP uses a variational autoencoder (trained using a non-parallel corpus) to generate a seed paraphrase that warm-starts the DRL model. Then, PUP progressively tunes the seed paraphrase guided by our novel reward function which combines semantic adequacy, language fluency, and expression diversity measures to quantify the quality of the generated paraphrases in each iteration without needing parallel sentences. Our extensive experimental evaluation shows that PUP outperforms unsupervised state-of-the-art paraphrasing techniques in terms of both automatic metrics and user studies on four real datasets. We also show that PUP outperforms domain-adapted supervised algorithms on several datasets. Our evaluation also shows that PUP achieves a great trade-off between semantic similarity and diversity of expression.

preprint2019arXiv

Quickly Finding the Best Linear Model in High Dimensions

We study the problem of finding the best linear model that can minimize least-squares loss given a data-set. While this problem is trivial in the low dimensional regime, it becomes more interesting in high dimensions where the population minimizer is assumed to lie on a manifold such as sparse vectors. We propose projected gradient descent (PGD) algorithm to estimate the population minimizer in the finite sample regime. We establish linear convergence rate and data dependent estimation error bounds for PGD. Our contributions include: 1) The results are established for heavier tailed sub-exponential distributions besides sub-gaussian. 2) We directly analyze the empirical risk minimization and do not require a realizable model that connects input data and labels. 3) Our PGD algorithm is augmented to learn the bias terms which boosts the performance. The numerical experiments validate our theoretical results.

preprint2016arXiv

Fast and Reliable Parameter Estimation from Nonlinear Observations

In this paper we study the problem of recovering a structured but unknown parameter ${\bfθ}^*$ from $n$ nonlinear observations of the form $y_i=f(\langle {\bf{x}}_i,{\bfθ}^*\rangle)$ for $i=1,2,\ldots,n$. We develop a framework for characterizing time-data tradeoffs for a variety of parameter estimation algorithms when the nonlinear function $f$ is unknown. This framework includes many popular heuristics such as projected/proximal gradient descent and stochastic schemes. For example, we show that a projected gradient descent scheme converges at a linear rate to a reliable solution with a near minimal number of samples. We provide a sharp characterization of the convergence rate of such algorithms as a function of sample size, amount of a-prior knowledge available about the parameter and a measure of the nonlinearity of the function $f$. These results provide a precise understanding of the various tradeoffs involved between statistical and computational resources as well as a-prior side information available for such nonlinear parameter estimation problems.

preprint2016arXiv

Near-Optimal Sample Complexity Bounds for Circulant Binary Embedding

Binary embedding is the problem of mapping points from a high-dimensional space to a Hamming cube in lower dimension while preserving pairwise distances. An efficient way to accomplish this is to make use of fast embedding techniques involving Fourier transform e.g.~circulant matrices. While binary embedding has been studied extensively, theoretical results on fast binary embedding are rather limited. In this work, we build upon the recent literature to obtain significantly better dependencies on the problem parameters. A set of $N$ points in $\mathbb{R}^n$ can be properly embedded into the Hamming cube $\{\pm 1\}^k$ with $δ$ distortion, by using $k\simδ^{-3}\log N$ samples which is optimal in the number of points $N$ and compares well with the optimal distortion dependency $δ^{-2}$. Our optimal embedding result applies in the regime $\log N\lesssim n^{1/3}$. Furthermore, if the looser condition $\log N\lesssim \sqrt{n}$ holds, we show that all but an arbitrarily small fraction of the points can be optimally embedded. We believe our techniques can be useful to obtain improved guarantees for other nonlinear embedding problems.

preprint2016arXiv

Sharp Time--Data Tradeoffs for Linear Inverse Problems

In this paper we characterize sharp time-data tradeoffs for optimization problems used for solving linear inverse problems. We focus on the minimization of a least-squares objective subject to a constraint defined as the sub-level set of a penalty function. We present a unified convergence analysis of the gradient projection algorithm applied to such problems. We sharply characterize the convergence rate associated with a wide variety of random measurement ensembles in terms of the number of measurements and structural complexity of the signal with respect to the chosen penalty function. The results apply to both convex and nonconvex constraints, demonstrating that a linear convergence rate is attainable even though the least squares objective is not strongly convex in these settings. When specialized to Gaussian measurements our results show that such linear convergence occurs when the number of measurements is merely 4 times the minimal number required to recover the desired signal at all (a.k.a. the phase transition). We also achieve a slower but geometric rate of convergence precisely above the phase transition point. Extensive numerical results suggest that the derived rates exactly match the empirical performance.

preprint2015arXiv

Isometric sketching of any set via the Restricted Isometry Property

In this paper we show that for the purposes of dimensionality reduction certain class of structured random matrices behave similarly to random Gaussian matrices. This class includes several matrices for which matrix-vector multiply can be computed in log-linear time, providing efficient dimensionality reduction of general sets. In particular, we show that using such matrices any set from high dimensions can be embedded into lower dimensions with near optimal distortion. We obtain our results by connecting dimensionality reduction of any set to dimensionality reduction of sparse vectors via a chaining argument.

preprint2015arXiv

Near-Optimal Bounds for Binary Embeddings of Arbitrary Sets

We study embedding a subset $K$ of the unit sphere to the Hamming cube $\{-1,+1\}^m$. We characterize the tradeoff between distortion and sample complexity $m$ in terms of the Gaussian width $ω(K)$ of the set. For subspaces and several structured sets we show that Gaussian maps provide the optimal tradeoff $m\sim δ^{-2}ω^2(K)$, in particular for $δ$ distortion one needs $m\approxδ^{-2}{d}$ where $d$ is the subspace dimension. For general sets, we provide sharp characterizations which reduces to $m\approx{δ^{-4}}{ω^2(K)}$ after simplification. We provide improved results for local embedding of points that are in close proximity of each other which is related to locality sensitive hashing. We also discuss faster binary embedding where one takes advantage of an initial sketching procedure based on Fast Johnson-Lindenstauss Transform. Finally, we list several numerical observations and discuss open problems.

preprint2015arXiv

Parallel Correlation Clustering on Big Graphs

Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, KwikCluster in practice requires a large number of clustering rounds, a potential bottleneck for large graphs. We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3-approximation ratio. We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.

preprint2015arXiv

Sparse Phase Retrieval: Uniqueness Guarantees and Recovery Algorithms

The problem of signal recovery from its Fourier transform magnitude is of paramount importance in various fields of engineering and has been around for over 100 years. Due to the absence of phase information, some form of additional information is required in order to be able to uniquely identify the signal of interest. In this work, we focus our attention on discrete-time sparse signals (of length $n$). We first show that, if the DFT dimension is greater than or equal to $2n$, almost all signals with {\em aperiodic} support can be uniquely identified by their Fourier transform magnitude (up to time-shift, conjugate-flip and global phase). Then, we develop an efficient Two-stage Sparse Phase Retrieval algorithm (TSPR), which involves: (i) identifying the support, i.e., the locations of the non-zero components, of the signal using a combinatorial algorithm (ii) identifying the signal values in the support using a convex algorithm. We show that TSPR can {\em provably} recover most $O(n^{1/2-\eps})$-sparse signals (up to a time-shift, conjugate-flip and global phase). We also show that, for most $O(n^{1/4-\eps})$-sparse signals, the recovery is {\em robust} in the presence of measurement noise. Numerical experiments complement our theoretical analysis and verify the effectiveness of TSPR.

preprint2015arXiv

The Gaussian min-max theorem in the Presence of Convexity

Gaussian comparison theorems are useful tools in probability theory; they are essential ingredients in the classical proofs of many results in empirical processes and extreme value theory. More recently, they have been used extensively in the analysis of non-smooth optimization problems that arise in the recovery of structured signals from noisy linear observations. We refer to such problems as Primary Optimization (PO) problems. A prominent role in the study of the (PO) problems is played by Gordon's Gaussian min-max theorem (GMT) which provides probabilistic lower bounds on the optimal cost via a simpler Auxiliary Optimization (AO) problem. Motivated by resent work of M. Stojnic, we show that under appropriate convexity assumptions the (AO) problem allows one to tightly bound both the optimal cost, as well as the norm of the solution of the (PO). As an application, we use our result to develop a general framework to tightly characterize the performance (e.g. squared-error) of a wide class of convex optimization algorithms used in the context of noisy signal recovery.

preprint2014arXiv

Simple Error Bounds for Regularized Noisy Linear Inverse Problems

Consider estimating a structured signal $\mathbf{x}_0$ from linear, underdetermined and noisy measurements $\mathbf{y}=\mathbf{A}\mathbf{x}_0+\mathbf{z}$, via solving a variant of the lasso algorithm: $\hat{\mathbf{x}}=\arg\min_\mathbf{x}\{ \|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2+λf(\mathbf{x})\}$. Here, $f$ is a convex function aiming to promote the structure of $\mathbf{x}_0$, say $\ell_1$-norm to promote sparsity or nuclear norm to promote low-rankness. We assume that the entries of $\mathbf{A}$ are independent and normally distributed and make no assumptions on the noise vector $\mathbf{z}$, other than it being independent of $\mathbf{A}$. Under this generic setup, we derive a general, non-asymptotic and rather tight upper bound on the $\ell_2$-norm of the estimation error $\|\hat{\mathbf{x}}-\mathbf{x}_0\|_2$. Our bound is geometric in nature and obeys a simple formula; the roles of $λ$, $f$ and $\mathbf{x}_0$ are all captured by a single summary parameter $δ(λ\partial((f(\mathbf{x}_0)))$, termed the Gaussian squared distance to the scaled subdifferential. We connect our result to the literature and verify its validity through simulations.

preprint2014arXiv

Simultaneously Structured Models with Application to Sparse and Low-rank Matrices

The topic of recovery of a structured model given a small number of linear observations has been well-studied in recent years. Examples include recovering sparse or group-sparse vectors, low-rank matrices, and the sum of sparse and low-rank matrices, among others. In various applications in signal processing and machine learning, the model of interest is known to be structured in several ways at the same time, for example, a matrix that is simultaneously sparse and low-rank. Often norms that promote each individual structure are known, and allow for recovery using an order-wise optimal number of measurements (e.g., $\ell_1$ norm for sparsity, nuclear norm for matrix rank). Hence, it is reasonable to minimize a combination of such norms. We show that, surprisingly, if we use multi-objective optimization with these norms, then we can do no better, order-wise, than an algorithm that exploits only one of the present structures. This result suggests that to fully exploit the multiple structures, we need an entirely new convex relaxation, i.e. not one that is a function of the convex relaxations used for each structure. We then specialize our results to the case of sparse and low-rank matrices. We show that a nonconvex formulation of the problem can recover the model from very few measurements, which is on the order of the degrees of freedom of the matrix, whereas the convex problem obtained from a combination of the $\ell_1$ and nuclear norms requires many more measurements. This proves an order-wise gap between the performance of the convex and nonconvex recovery problems in this case. Our framework applies to arbitrary structure-inducing norms as well as to a wide range of measurement ensembles. This allows us to give performance bounds for problems such as sparse phase retrieval and low-rank tensor completion.

preprint2013arXiv

Sharp MSE Bounds for Proximal Denoising

Denoising has to do with estimating a signal $x_0$ from its noisy observations $y=x_0+z$. In this paper, we focus on the "structured denoising problem", where the signal $x_0$ possesses a certain structure and $z$ has independent normally distributed entries with mean zero and variance $σ^2$. We employ a structure-inducing convex function $f(\cdot)$ and solve $\min_x\{\frac{1}{2}\|y-x\|_2^2+σλf(x)\}$ to estimate $x_0$, for some $λ>0$. Common choices for $f(\cdot)$ include the $\ell_1$ norm for sparse vectors, the $\ell_1-\ell_2$ norm for block-sparse signals and the nuclear norm for low-rank matrices. The metric we use to evaluate the performance of an estimate $x^*$ is the normalized mean-squared-error $\text{NMSE}(σ)=\frac{\mathbb{E}\|x^*-x_0\|_2^2}{σ^2}$. We show that NMSE is maximized as $σ\rightarrow 0$ and we find the \emph{exact} worst case NMSE, which has a simple geometric interpretation: the mean-squared-distance of a standard normal vector to the $λ$-scaled subdifferential $λ\partial f(x_0)$. When $λ$ is optimally tuned to minimize the worst-case NMSE, our results can be related to the constrained denoising problem $\min_{f(x)\leq f(x_0)}\{\|y-x\|_2\}$. The paper also connects these results to the generalized LASSO problem, in which, one solves $\min_{f(x)\leq f(x_0)}\{\|y-Ax\|_2\}$ to estimate $x_0$ from noisy linear observations $y=Ax_0+z$. We show that certain properties of the LASSO problem are closely related to the denoising problem. In particular, we characterize the normalized LASSO cost and show that it exhibits a "phase transition" as a function of number of observations. Our results are significant in two ways. First, we find a simple formula for the performance of a general convex estimator. Secondly, we establish a connection between the denoising and linear inverse problems.

preprint2013arXiv

Simple Bounds for Noisy Linear Inverse Problems with Exact Side Information

This paper considers the linear inverse problem where we wish to estimate a structured signal $x$ from its corrupted observations. When the problem is ill-posed, it is natural to make use of a convex function $f(\cdot)$ that exploits the structure of the signal. For example, $\ell_1$ norm can be used for sparse signals. To carry out the estimation, we consider two well-known convex programs: 1) Second order cone program (SOCP), and, 2) Lasso. Assuming Gaussian measurements, we show that, if precise information about the value $f(x)$ or the $\ell_2$-norm of the noise is available, one can do a particularly good job at estimation. In particular, the reconstruction error becomes proportional to the "sparsity" of the signal rather than the ambient dimension of the noise vector. We connect our results to existing works and provide a discussion on the relation of our results to the standard least-squares problem. Our error bounds are non-asymptotic and sharp, they apply to arbitrary convex functions and do not assume any distribution on the noise.

preprint2013arXiv

Sparse Phase Retrieval: Convex Algorithms and Limitations

We consider the problem of recovering signals from their power spectral density. This is a classical problem referred to in literature as the phase retrieval problem, and is of paramount importance in many fields of applied sciences. In general, additional prior information about the signal is required to guarantee unique recovery as the mapping from signals to power spectral density is not one-to-one. In this paper, we assume that the underlying signals are sparse. Recently, semidefinite programming (SDP) based approaches were explored by various researchers. Simulations of these algorithms strongly suggest that signals upto $o(\sqrt{n})$ sparsity can be recovered by this technique. In this work, we develop a tractable algorithm based on reweighted $l_1$-minimization that recovers a sparse signal from its power spectral density for significantly higher sparsities, which is unprecedented. We discuss the square-root bottleneck of the existing convex algorithms and show that a $k$-sparse signal can be efficiently recovered using $O(k^2logn)$ phaseless Fourier measurements. We also show that a $k$-sparse signal can be recovered using only $O(k log n)$ phaseless measurements if we are allowed to design the measurement matrices.

preprint2013arXiv

The Squared-Error of Generalized LASSO: A Precise Analysis

We consider the problem of estimating an unknown signal $x_0$ from noisy linear observations $y = Ax_0 + z\in R^m$. In many practical instances, $x_0$ has a certain structure that can be captured by a structure inducing convex function $f(\cdot)$. For example, $\ell_1$ norm can be used to encourage a sparse solution. To estimate $x_0$ with the aid of $f(\cdot)$, we consider the well-known LASSO method and provide sharp characterization of its performance. We assume the entries of the measurement matrix $A$ and the noise vector $z$ have zero-mean normal distributions with variances $1$ and $σ^2$ respectively. For the LASSO estimator $x^*$, we attempt to calculate the Normalized Square Error (NSE) defined as $\frac{\|x^*-x_0\|_2^2}{σ^2}$ as a function of the noise level $σ$, the number of observations $m$ and the structure of the signal. We show that, the structure of the signal $x_0$ and choice of the function $f(\cdot)$ enter the error formulae through the summary parameters $D(cone)$ and $D(λ)$, which are defined as the Gaussian squared-distances to the subdifferential cone and to the $λ$-scaled subdifferential, respectively. The first LASSO estimator assumes a-priori knowledge of $f(x_0)$ and is given by $\arg\min_{x}\{{\|y-Ax\|_2}~\text{subject to}~f(x)\leq f(x_0)\}$. We prove that its worst case NSE is achieved when $σ\rightarrow 0$ and concentrates around $\frac{D(cone)}{m-D(cone)}$. Secondly, we consider $\arg\min_{x}\{\|y-Ax\|_2+λf(x)\}$, for some $λ\geq 0$. This time the NSE formula depends on the choice of $λ$ and is given by $\frac{D(λ)}{m-D(λ)}$. We then establish a mapping between this and the third estimator $\arg\min_{x}\{\frac{1}{2}\|y-Ax\|_2^2+ λf(x)\}$. Finally, for a number of important structured signal classes, we translate our abstract formulae to closed-form upper bounds on the NSE.

preprint2012arXiv

Recovering Jointly Sparse Signals via Joint Basis Pursuit

This work considers recovery of signals that are sparse over two bases. For instance, a signal might be sparse in both time and frequency, or a matrix can be low rank and sparse simultaneously. To facilitate recovery, we consider minimizing the sum of the $\ell_1$-norms that correspond to each basis, which is a tractable convex approach. We find novel optimality conditions which indicates a gain over traditional approaches where $\ell_1$ minimization is done over only one basis. Next, we analyze these optimality conditions for the particular case of time-frequency bases. Denoting sparsity in the first and second bases by $k_1,k_2$ respectively, we show that, for a general class of signals, using this approach, one requires as small as $O(\max\{k_1,k_2\}\log\log n)$ measurements for successful recovery hence overcoming the classical requirement of $Θ(\min\{k_1,k_2\}\log(\frac{n}{\min\{k_1,k_2\}}))$ for $\ell_1$ minimization when $k_1\approx k_2$. Extensive simulations show that, our analysis is approximately tight.

preprint2012arXiv

Recovery of Sparse 1-D Signals from the Magnitudes of their Fourier Transform

The problem of signal recovery from the autocorrelation, or equivalently, the magnitudes of the Fourier transform, is of paramount importance in various fields of engineering. In this work, for one-dimensional signals, we give conditions, which when satisfied, allow unique recovery from the autocorrelation with very high probability. In particular, for sparse signals, we develop two non-iterative recovery algorithms. One of them is based on combinatorial analysis, which we prove can recover signals upto sparsity $o(n^{1/3})$ with very high probability, and the other is developed using a convex optimization based framework, which numerical simulations suggest can recover signals upto sparsity $o(n^{1/2})$ with very high probability.

preprint2011arXiv

A Simplified Approach to Recovery Conditions for Low Rank Matrices

Recovering sparse vectors and low-rank matrices from noisy linear measurements has been the focus of much recent research. Various reconstruction algorithms have been studied, including $\ell_1$ and nuclear norm minimization as well as $\ell_p$ minimization with $p<1$. These algorithms are known to succeed if certain conditions on the measurement map are satisfied. Proofs of robust recovery for matrices have so far been much more involved than in the vector case. In this paper, we show how several robust classes of recovery conditions can be extended from vectors to matrices in a simple and transparent way, leading to the best known restricted isometry and nullspace conditions for matrix recovery. Our results rely on the ability to "vectorize" matrices through the use of a key singular value inequality.

preprint2011arXiv

Finding Dense Clusters via "Low Rank + Sparse" Decomposition

Finding "densely connected clusters" in a graph is in general an important and well studied problem in the literature \cite{Schaeffer}. It has various applications in pattern recognition, social networking and data mining \cite{Duda,Mishra}. Recently, Ames and Vavasis have suggested a novel method for finding cliques in a graph by using convex optimization over the adjacency matrix of the graph \cite{Ames, Ames2}. Also, there has been recent advances in decomposing a given matrix into its "low rank" and "sparse" components \cite{Candes, Chandra}. In this paper, inspired by these results, we view "densely connected clusters" as imperfect cliques, where imperfections correspond missing edges, which are relatively sparse. We analyze the problem in a probabilistic setting and aim to detect disjointly planted clusters. Our main result basically suggests that, one can find \emph{dense} clusters in a graph, as long as the clusters are sufficiently large. We conclude by discussing possible extensions and future research directions.

preprint2011arXiv

Subspace Expanders and Matrix Rank Minimization

Matrix rank minimization (RM) problems recently gained extensive attention due to numerous applications in machine learning, system identification and graphical models. In RM problem, one aims to find the matrix with the lowest rank that satisfies a set of linear constraints. The existing algorithms include nuclear norm minimization (NNM) and singular value thresholding. Thus far, most of the attention has been on i.i.d. Gaussian measurement operators. In this work, we introduce a new class of measurement operators, and a novel recovery algorithm, which is notably faster than NNM. The proposed operators are based on what we refer to as subspace expanders, which are inspired by the well known expander graphs based measurement matrices in compressed sensing. We show that given an $n\times n$ PSD matrix of rank $r$, it can be uniquely recovered from a minimal sampling of $O(nr)$ measurements using the proposed structures, and the recovery algorithm can be cast as matrix inversion after a few initial processing steps.

preprint2010arXiv

New Null Space Results and Recovery Thresholds for Matrix Rank Minimization

Nuclear norm minimization (NNM) has recently gained significant attention for its use in rank minimization problems. Similar to compressed sensing, using null space characterizations, recovery thresholds for NNM have been studied in \cite{arxiv,Recht_Xu_Hassibi}. However simulations show that the thresholds are far from optimal, especially in the low rank region. In this paper we apply the recent analysis of Stojnic for compressed sensing \cite{mihailo} to the null space conditions of NNM. The resulting thresholds are significantly better and in particular our weak threshold appears to match with simulation results. Further our curves suggest for any rank growing linearly with matrix size $n$ we need only three times of oversampling (the model complexity) for weak recovery. Similar to \cite{arxiv} we analyze the conditions for weak, sectional and strong thresholds. Additionally a separate analysis is given for special case of positive semidefinite matrices. We conclude by discussing simulation results and future research directions.

Samet Oymak

What is connected

Connect this record

See the researcher in context

Building this map preview

37 published item(s)

Latent Chain-of-Thought Improves Structured-Data Transformers

Stochastic Sparse Attention for Memory-Bound Inference

VSPO: Vector-Steered Policy Optimization for Behavioral Control

AutoBalance: Optimized Loss Functions for Imbalanced Data

FedNest: Federated Bilevel, Minimax, and Compositional Optimization

Finite Sample Identification of Bilinear Dynamical Systems

Generalization Guarantees for Neural Architecture Search with Train-Validation Split

Non-Stationary Representation Learning in Sequential Linear Bandits

Representation Learning for Context-Dependent Decision-Making

System Identification via Nuclear Norm Regularization

Towards Sample-efficient Overparameterized Meta-learning

Sample Efficient Subspace-based Representations for Nonlinear Meta-Learning

Exploring Weight Importance and Hessian Bias in Model Pruning

On the Role of Dataset Quality and Heterogeneity in Model Confidence

Statistical and Algorithmic Insights for Semi-supervised Learning with Self-training

Unsupervised Paraphrasing via Deep Reinforcement Learning

Quickly Finding the Best Linear Model in High Dimensions

Fast and Reliable Parameter Estimation from Nonlinear Observations

Near-Optimal Sample Complexity Bounds for Circulant Binary Embedding

Sharp Time--Data Tradeoffs for Linear Inverse Problems

Isometric sketching of any set via the Restricted Isometry Property

Near-Optimal Bounds for Binary Embeddings of Arbitrary Sets

Parallel Correlation Clustering on Big Graphs

Sparse Phase Retrieval: Uniqueness Guarantees and Recovery Algorithms

The Gaussian min-max theorem in the Presence of Convexity

Simple Error Bounds for Regularized Noisy Linear Inverse Problems

Simultaneously Structured Models with Application to Sparse and Low-rank Matrices

Sharp MSE Bounds for Proximal Denoising

Simple Bounds for Noisy Linear Inverse Problems with Exact Side Information

Sparse Phase Retrieval: Convex Algorithms and Limitations

The Squared-Error of Generalized LASSO: A Precise Analysis

Recovering Jointly Sparse Signals via Joint Basis Pursuit

Recovery of Sparse 1-D Signals from the Magnitudes of their Fourier Transform

A Simplified Approach to Recovery Conditions for Low Rank Matrices

Finding Dense Clusters via "Low Rank + Sparse" Decomposition

Subspace Expanders and Matrix Rank Minimization

New Null Space Results and Recovery Thresholds for Matrix Rank Minimization