Source author record

Tianbao Yang

Tianbao Yang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.OC Artificial Intelligence Computer Vision Numerical Analysis Distributed, Parallel, and Cluster Computing physics.soc-ph Social and Information Networks Computational Complexity Data Structures and Algorithms Information Theory math.IT math.NA math.ST Robotics Statistics Theory

Catalog footprint

What is connected

55works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Domain Incremental Continual Learning Benchmark for ICU Time Series Model Transportability

In recent years, machine learning has made significant progress in clinical outcome prediction, demonstrating increasingly accurate results. However, the substantial resources required for hospitals to train these models, such as data collection, labeling, and computational power, limit the feasibility for smaller hospitals to develop their own models. An alternative approach involves transferring a machine learning model trained by a large hospital to smaller hospitals, allowing them to fine-tune the model on their specific patient data. However, these models are often trained and validated on data from a single hospital, raising concerns about their generalizability to new data. Our research shows that there are notable differences in measurement distributions and frequencies across various regions in the United States. To address this, we propose a benchmark that tests a machine learning model's ability to transfer from a source domain to different regions across the country. This benchmark assesses a model's capacity to learn meaningful information about each new domain while retaining key features from the original domain. Using this benchmark, we frame the transfer of a machine learning model from one region to another as a domain incremental learning problem. While the task of patient outcome prediction remains the same, the input data distribution varies, necessitating a model that can effectively manage these shifts. We evaluate two popular domain incremental learning methods: data replay, which stores examples from previous data sources for fine-tuning on the current source, and Elastic Weight Consolidation (EWC), a model parameter regularization method that maintains features important for both data sources.

preprint2026arXiv

AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving

Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems. However, limited work exists on studying the trustworthiness of DriveVLMs -- a critical factor that directly impacts public transportation safety. In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives -- including trustfulness, safety, robustness, privacy, and fairness. We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries. We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models. Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness. DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information. Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations. Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs -- an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems. We release all the codes and datasets in https://github.com/taco-group/AutoTrust.

preprint2026arXiv

Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP

CLIP has become a cornerstone of multimodal representation learning, yet improving its performance typically requires a prohibitively costly process of training from scratch on billions of samples. We ask a different question: Can we improve the performance of open-weight CLIP models across various downstream tasks using only existing self-supervised datasets? Unlike supervised fine-tuning, which adapts a pretrained model to a single downstream task, our setting seeks to improve general performance across various tasks. However, as both our experiments and prior studies reveal, simply applying standard training protocols starting from an open-weight CLIP model often fails, leading to performance degradation. In this paper, we introduce TuneCLIP, a self-supervised fine-tuning framework that overcomes the performance degradation. TuneCLIP has two key components: (1) a warm-up stage of recovering optimization statistics to reduce cold-start bias, inspired by theoretical analysis, and (2) a fine-tuning stage of optimizing a new contrastive loss to mitigate the penalization on false negative pairs. Our extensive experiments show that TuneCLIP consistently improves performance across model architectures and scales. Notably, it elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks, and +1.2% on the highly competitive DataComp benchmark, setting a new strong baseline for efficient post-pretraining adaptation.

preprint2026arXiv

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7\% over GRPO and 6\% over DAPO across six benchmark tasks for a 1.5B model.

preprint2026arXiv

Memory-Efficient Continual Learning with CLIP Models

Contrastive Language-Image Pretraining (CLIP) models excel at understanding image-text relationships but struggle with adapting to new data without forgetting prior knowledge. To address this, models are typically fine-tuned using both new task data and a memory buffer of past tasks. However, CLIP's contrastive loss suffers when the memory buffer is small, leading to performance degradation on previous tasks. We propose a memory-efficient, distributionally robust method that dynamically reweights losses per class during training. Our approach, tested on class incremental settings (CIFAR-100, ImageNet1K) and a domain incremental setting (DomainNet) adapts CLIP models quickly while minimizing catastrophic forgetting, even with minimal memory usage.

preprint2026arXiv

Statistical Consistency and Generalization of Contrastive Representation Learning

Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order $O(1/m + 1/\sqrt{n})$ and $O(1/\sqrt{m} + 1/\sqrt{n})$, respectively, where $m$ denotes the number of negative samples and $n$ the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between $m$ and $n$. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.

preprint2022arXiv

AUC Maximization in the Era of Big Data and AI: A Survey

Area under the ROC curve, a.k.a. AUC, is a measure of choice for assessing the performance of a classifier for imbalanced data. AUC maximization refers to a learning paradigm that learns a predictive model by directly maximizing its AUC score. It has been studied for more than two decades dating back to late 90s and a huge amount of work has been devoted to AUC maximization since then. Recently, stochastic AUC maximization for big data and deep AUC maximization for deep learning have received increasing attention and yielded dramatic impact for solving real-world problems. However, to the best our knowledge there is no comprehensive survey of related works for AUC maximization. This paper aims to address the gap by reviewing the literature in the past two decades. We not only give a holistic view of the literature but also present detailed explanations and comparisons of different papers from formulations to algorithms and theoretical guarantees. We also identify and discuss remaining and emerging issues for deep AUC maximization, and provide suggestions on topics for future work.

preprint2022arXiv

Benchmarking Deep AUROC Optimization: Loss Functions and Algorithmic Choices

The area under the ROC curve (AUROC) has been vigorously applied for imbalanced classification and moreover combined with deep learning techniques. However, there is no existing work that provides sound information for peers to choose appropriate deep AUROC maximization techniques. In this work, we fill this gap from three aspects. (i) We benchmark a variety of loss functions with different algorithmic choices for deep AUROC optimization problem. We study the loss functions in two categories: pairwise loss and composite loss, which includes a total of 10 loss functions. Interestingly, we find composite loss, as an innovative loss function class, shows more competitive performance than pairwise loss from both training convergence and testing generalization perspectives. Nevertheless, data with more corrupted labels favors a pairwise symmetric loss. (ii) Moreover, we benchmark and highlight the essential algorithmic choices such as positive sampling rate, regularization, normalization/activation, and optimizers. Key findings include: higher positive sampling rate is likely to be beneficial for deep AUROC maximization; different datasets favors different weights of regularizations; appropriate normalization techniques, such as sigmoid and $\ell_2$ score normalization, could improve model performance. (iii) For optimization aspect, we benchmark SGD-type, Momentum-type, and Adam-type optimizers for both pairwise and composite loss. Our findings show that although Adam-type method is more competitive from training perspective, but it does not outperform others from testing perspective.

preprint2022arXiv

GraphFM: Improving Large-Scale GNN Training via Feature Momentum

Training of graph neural networks (GNNs) for large-scale node classification is challenging. A key difficulty lies in obtaining accurate hidden node representations while avoiding the neighborhood explosion problem. Here, we propose a new technique, named feature momentum (FM), that uses a momentum step to incorporate historical embeddings when updating feature representations. We develop two specific algorithms, known as GraphFM-IB and GraphFM-OB, that consider in-batch and out-of-batch data, respectively. GraphFM-IB applies FM to in-batch sampled data, while GraphFM-OB applies FM to out-of-batch data that are 1-hop neighborhood of in-batch data. We provide a convergence analysis for GraphFM-IB and some theoretical insight for GraphFM-OB. Empirically, we observe that GraphFM-IB can effectively alleviate the neighborhood explosion problem of existing methods. In addition, GraphFM-OB achieves promising performance on multiple large-scale graph datasets.

preprint2022arXiv

Momentum Accelerates the Convergence of Stochastic AUPRC Maximization

In this paper, we study stochastic optimization of areas under precision-recall curves (AUPRC), which is widely used for combating imbalanced classification tasks. Although a few methods have been proposed for maximizing AUPRC, stochastic optimization of AUPRC with convergence guarantee remains an undeveloped territory. A state-of-the-art complexity is $O(1/ε^5)$ for finding an $ε$-stationary solution. In this paper, we further improve the stochastic optimization of AURPC by (i) developing novel stochastic momentum methods with a better iteration complexity of $O(1/ε^4)$ for finding an $ε$-stationary solution; and (ii) designing a novel family of stochastic adaptive methods with the same iteration complexity, which enjoy faster convergence in practice. To this end, we propose two innovative techniques that are critical for improving the convergence: (i) the biased estimators for tracking individual ranking scores are updated in a randomized coordinate-wise manner; and (ii) a momentum update is used on top of the stochastic gradient estimator for tracking the gradient of the objective. The novel analysis of Adam-style updates is also one main contribution. Extensive experiments on various data sets demonstrate the effectiveness of the proposed algorithms. Of independent interest, the proposed stochastic momentum and adaptive algorithms are also applicable to a class of two-level stochastic dependent compositional optimization problems.

preprint2022arXiv

Multi-block-Single-probe Variance Reduced Estimator for Coupled Compositional Optimization

Variance reduction techniques such as SPIDER/SARAH/STORM have been extensively studied to improve the convergence rates of stochastic non-convex optimization, which usually maintain and update a sequence of estimators for a single function across iterations. What if we need to track multiple functional mappings across iterations but only with access to stochastic samples of $\mathcal{O}(1)$ functional mappings at each iteration? There is an important application in solving an emerging family of coupled compositional optimization problems in the form of $\sum_{i=1}^m f_i(g_i(\mathbf{w}))$, where $g_i$ is accessible through a stochastic oracle. The key issue is to track and estimate a sequence of $\mathbf g(\mathbf{w})=(g_1(\mathbf{w}), \ldots, g_m(\mathbf{w}))$ across iterations, where $\mathbf g(\mathbf{w})$ has $m$ blocks and it is only allowed to probe $\mathcal{O}(1)$ blocks to attain their stochastic values and Jacobians. To improve the complexity for solving these problems, we propose a novel stochastic method named Multi-block-Single-probe Variance Reduced (MSVR) estimator to track the sequence of $\mathbf g(\mathbf{w})$. It is inspired by STORM but introduces a customized error correction term to alleviate the noise not only in stochastic samples for the selected blocks but also in those blocks that are not sampled. With the help of the MSVR estimator, we develop several algorithms for solving the aforementioned compositional problems with improved complexities across a spectrum of settings with non-convex/convex/strongly convex/Polyak-Łojasiewicz (PL) objectives. Our results improve upon prior ones in several aspects, including the order of sample complexities and dependence on the strong convexity parameter. Empirical studies on multi-task deep AUC maximization demonstrate the better performance of using the new estimator.

preprint2020arXiv

A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints

Stochastic convex optimization problems with expectation constraints (SOECs) are encountered in statistics and machine learning, business, and engineering. In data-rich environments, the SOEC objective and constraints contain expectations defined with respect to large datasets. Therefore, efficient algorithms for solving such SOECs need to limit the fraction of data points that they use, which we refer to as algorithmic data complexity. Recent stochastic first order methods exhibit low data complexity when handling SOECs but guarantee near-feasibility and near-optimality only at convergence. These methods may thus return highly infeasible solutions when heuristically terminated, as is often the case, due to theoretical convergence criteria being highly conservative. This issue limits the use of first order methods in several applications where the SOEC constraints encode implementation requirements. We design a stochastic feasible level set method (SFLS) for SOECs that has low data complexity and emphasizes feasibility before convergence. Specifically, our level-set method solves a root-finding problem by calling a novel first order oracle that computes a stochastic upper bound on the level-set function by extending mirror descent and online validation techniques. We establish that SFLS maintains a high-probability feasible solution at each root-finding iteration and exhibits favorable iteration complexity compared to state-of-the-art deterministic feasible level set and stochastic subgradient methods. Numerical experiments on three diverse applications validate the low data complexity of SFLS relative to the former approach and highlight how SFLS finds feasible solutions with small optimality gaps significantly faster than the latter method.

preprint2020arXiv

A Simple and Effective Framework for Pairwise Deep Metric Learning

Deep metric learning (DML) has received much attention in deep learning due to its wide applications in computer vision. Previous studies have focused on designing complicated losses and hard example mining methods, which are mostly heuristic and lack of theoretical understanding. In this paper, we cast DML as a simple pairwise binary classification problem that classifies a pair of examples as similar or dissimilar. It identifies the most critical issue in this problem--imbalanced data pairs. To tackle this issue, we propose a simple and effective framework to sample pairs in a batch of data for updating the model. The key to this framework is to define a robust loss for all pairs over a mini-batch of data, which is formulated by distributionally robust optimization. The flexibility in constructing the uncertainty decision set of the dual variable allows us to recover state-of-the-art complicated losses and also to induce novel variants. Empirical studies on several benchmark data sets demonstrate that our simple and effective method outperforms the state-of-the-art results. Codes are available at: https://github.com/qiqi-helloworld/A-Simple-and-Effective-Framework-for-Pairewise-Distance-Metric-Learning

preprint2020arXiv

Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition

In this paper, a new theory is developed for first-order stochastic convex optimization, showing that the global convergence rate is sufficiently quantified by a local growth rate of the objective function in a neighborhood of the optimal solutions. In particular, if the objective function $F(\mathbf w)$ in the $ε$-sublevel set grows as fast as $\|\mathbf w - \mathbf w_*\|_2^{1/θ}$, where $\mathbf w_*$ represents the closest optimal solution to $\mathbf w$ and $θ\in(0,1]$ quantifies the local growth rate, the iteration complexity of first-order stochastic optimization for achieving an $ε$-optimal solution can be $\widetilde O(1/ε^{2(1-θ)})$, which is optimal at most up to a logarithmic factor. To achieve the faster global convergence, we develop two different accelerated stochastic subgradient methods by iteratively solving the original problem approximately in a local region around a historical solution with the size of the local region gradually decreasing as the solution approaches the optimal set. Besides the theoretical improvements, this work also includes new contributions towards making the proposed algorithms practical: (i) we present practical variants of accelerated stochastic subgradient methods that can run without the knowledge of multiplicative growth constant and even the growth rate $θ$; (ii) we consider a broad family of problems in machine learning to demonstrate that the proposed algorithms enjoy faster convergence than traditional stochastic subgradient method. We also characterize the complexity of the proposed algorithms for ensuring the gradient is small without the smoothness assumption.

preprint2020arXiv

Minimizing Dynamic Regret and Adaptive Regret Simultaneously

Regret minimization is treated as the golden rule in the traditional study of online learning. However, regret minimization algorithms tend to converge to the static optimum, thus being suboptimal for changing environments. To address this limitation, new performance measures, including dynamic regret and adaptive regret have been proposed to guide the design of online algorithms. The former one aims to minimize the global regret with respect to a sequence of changing comparators, and the latter one attempts to minimize every local regret with respect to a fixed comparator. Existing algorithms for dynamic regret and adaptive regret are developed independently, and only target one performance measure. In this paper, we bridge this gap by proposing novel online algorithms that are able to minimize the dynamic regret and adaptive regret simultaneously. In fact, our theoretical guarantee is even stronger in the sense that one algorithm is able to minimize the dynamic regret over any interval.

preprint2020arXiv

Nearly Optimal Robust Method for Convex Compositional Problems with Heavy-Tailed Noise

In this paper, we propose robust stochastic algorithms for solving convex compositional problems of the form $f(\E_ξg(\cdot; ξ)) + r(\cdot)$ by establishing {\bf sub-Gaussian confidence bounds} under weak assumptions about the tails of noise distribution, i.e., {\bf heavy-tailed noise} with bounded second-order moments. One can achieve this goal by using an existing boosting strategy that boosts a low probability convergence result into a high probability result. However, piecing together existing results for solving compositional problems suffers from several drawbacks: (i) the boosting technique requires strong convexity of the objective; (ii) it requires a separate algorithm to handle non-smooth $r$; (iii) it also suffers from an additional polylogarithmic factor of the condition number. To address these issues, we directly develop a single-trial stochastic algorithm for minimizing optimal strongly convex compositional objectives, which has a nearly optimal high probability convergence result matching the lower bound of stochastic strongly convex optimization up to a logarithmic factor. To the best of our knowledge, this is the first work that establishes nearly optimal sub-Gaussian confidence bounds for compositional problems under heavy-tailed assumptions.

preprint2020arXiv

Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

Epoch gradient descent method (a.k.a. Epoch-GD) proposed by Hazan and Kale (2011) was deemed a breakthrough for stochastic strongly convex minimization, which achieves the optimal convergence rate of $O(1/T)$ with $T$ iterative updates for the {\it objective gap}. However, its extension to solving stochastic min-max problems with strong convexity and strong concavity still remains open, and it is still unclear whether a fast rate of $O(1/T)$ for the {\it duality gap} is achievable for stochastic min-max optimization under strong convexity and strong concavity. Although some recent studies have proposed stochastic algorithms with fast convergence rates for min-max problems, they require additional assumptions about the problem, e.g., smoothness, bi-linear structure, etc. In this paper, we bridge this gap by providing a sharp analysis of epoch-wise stochastic gradient descent ascent method (referred to as Epoch-GDA) for solving strongly convex strongly concave (SCSC) min-max problems, without imposing any additional assumption about smoothness or the function's structure. To the best of our knowledge, our result is the first one that shows Epoch-GDA can achieve the optimal rate of $O(1/T)$ for the duality gap of general SCSC min-max problems. We emphasize that such generalization of Epoch-GD for strongly convex minimization problems to Epoch-GDA for SCSC min-max problems is non-trivial and requires novel technical analysis. Moreover, we notice that the key lemma can also be used for proving the convergence of Epoch-GDA for weakly-convex strongly-concave min-max problems, leading to a nearly optimal complexity without resorting to smoothness or other structural conditions.

preprint2020arXiv

Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

Stochastic gradient descent (SGD) has been widely studied in the literature from different angles, and is commonly employed for solving many big data machine learning problems. However, the averaging technique, which combines all iterative solutions into a single solution, is still under-explored. While some increasingly weighted averaging schemes have been considered in the literature, existing works are mostly restricted to strongly convex objective functions and the convergence of optimization error. It remains unclear how these averaging schemes affect the convergence of {\it both optimization error and generalization error} (two equally important components of testing error) for {\bf non-strongly convex objectives, including non-convex problems}. In this paper, we {\it fill the gap} by comprehensively analyzing the increasingly weighted averaging on convex, strongly convex and non-convex objective functions in terms of both optimization error and generalization error. In particular, we analyze a family of increasingly weighted averaging, where the weight for the solution at iteration $t$ is proportional to $t^α$ ($α> 0$). We show how $α$ affects the optimization error and the generalization error, and exhibit the trade-off caused by $α$. Experiments have demonstrated this trade-off and the effectiveness of polynomially increased weighted averaging compared with other averaging schemes for a wide range of problems including deep learning.

preprint2020arXiv

Stochastic AUC Maximization with Deep Neural Networks

Stochastic AUC maximization has garnered an increasing interest due to better fit to imbalanced data classification. However, existing works are limited to stochastic AUC maximization with a linear predictive model, which restricts its predictive power when dealing with extremely complex data. In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predictive model. Building on the saddle point reformulation of a surrogated loss of AUC, the problem can be cast into a {\it non-convex concave} min-max problem. The main contribution made in this paper is to make stochastic AUC maximization more practical for deep neural networks and big data with theoretical insights as well. In particular, we propose to explore Polyak-Łojasiewicz (PL) condition that has been proved and observed in deep learning, which enables us to develop new stochastic algorithms with even faster convergence rate and more practical step size scheme. An AdaGrad-style algorithm is also analyzed under the PL condition with adaptive convergence rate. Our experimental results demonstrate the effectiveness of the proposed algorithms.

preprint2020arXiv

Stochastic Optimization for Non-convex Inf-Projection Problems

In this paper, we study a family of non-convex and possibly non-smooth inf-projection minimization problems, where the target objective function is equal to minimization of a joint function over another variable. This problem include difference of convex (DC) functions and a family of bi-convex functions as special cases. We develop stochastic algorithms and establish their first-order convergence for finding a (nearly) stationary solution of the target non-convex function under different conditions of the component functions. To the best of our knowledge, this is the first work that comprehensively studies stochastic optimization of non-convex inf-projection minimization problems with provable convergence guarantee. Our algorithms enable efficient stochastic optimization of a family of non-decomposable DC functions and a family of bi-convex functions. To demonstrate the power of the proposed algorithms we consider an important application in variance-based regularization. Experiments verify the effectiveness of our inf-projection based formulation and the proposed stochastic algorithm in comparison with previous stochastic algorithms based on the min-max formulation for achieving the same effect.

preprint2020arXiv

Variance-Reduced Off-Policy Memory-Efficient Policy Search

Off-policy policy optimization is a challenging problem in reinforcement learning (RL). The algorithms designed for this problem often suffer from high variance in their estimators, which results in poor sample efficiency, and have issues with convergence. A few variance-reduced on-policy policy gradient algorithms have been recently proposed that use methods from stochastic optimization to reduce the variance of the gradient estimate in the REINFORCE algorithm. However, these algorithms are not designed for the off-policy setting and are memory-inefficient, since they need to collect and store a large ``reference'' batch of samples from time to time. To achieve variance-reduced off-policy-stable policy optimization, we propose an algorithm family that is memory-efficient, stochastically variance-reduced, and capable of learning from off-policy samples. Empirical studies validate the effectiveness of the proposed approaches.

preprint2016arXiv

A Simple Homotopy Proximal Mapping for Compressive Sensing

In this paper, we present a novel yet simple homotopy proximal mapping algorithm for compressive sensing. The algorithm adopts a simple proximal mapping of the $\ell_1$ norm at each iteration and gradually reduces the regularization parameter for the $\ell_1$ norm. We prove a global linear convergence of the proposed homotopy proximal mapping (HPM) algorithm for solving compressive sensing under three different settings (i) sparse signal recovery under noiseless measurements, (ii) sparse signal recovery under noisy measurements, and (iii) nearly-sparse signal recovery under sub-gaussian noisy measurements. In particular, we show that when the measurement matrix satisfies Restricted Isometric Properties (RIP), our theoretical results in settings (i) and (ii) almost recover the best condition on the RIP constants for compressive sensing. In addition, in setting (iii), our results for sparse signal recovery are better than the previous results, and furthermore our analysis explicitly exhibits that more observations lead to not only more accurate recovery but also faster convergence. Compared with previous studies on linear convergence for sparse signal recovery, our algorithm is simple and efficient, and our results are better and provide more insights. Finally our empirical studies provide further support for the proposed homotopy proximal mapping algorithm and verify the theoretical results.

preprint2016arXiv

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

We study distributed optimization algorithms for minimizing the average of convex functions. The applications include empirical risk minimization problems in statistical machine learning where the datasets are large and have to be stored on different machines. We design a distributed stochastic variance reduced gradient algorithm that, under certain conditions on the condition number, simultaneously achieves the optimal parallel runtime, amount of communication and rounds of communication among all distributed first-order methods up to constant factors. Our method and its accelerated extension also outperform existing distributed algorithms in terms of the rounds of communication as long as the condition number is not too large compared to the size of data in each machine. We also prove a lower bound for the number of rounds of communication for a broad class of distributed first-order methods including the proposed algorithms in this paper. We show that our accelerated distributed stochastic variance reduced gradient algorithm achieves this lower bound so that it uses the fewest rounds of communication among all distributed first-order algorithms.

preprint2016arXiv

Efficient Non-oblivious Randomized Reduction for Risk Minimization with Improved Excess Risk Guarantee

In this paper, we address learning problems for high dimensional data. Previously, oblivious random projection based approaches that project high dimensional features onto a random subspace have been used in practice for tackling high-dimensionality challenge in machine learning. Recently, various non-oblivious randomized reduction methods have been developed and deployed for solving many numerical problems such as matrix product approximation, low-rank matrix approximation, etc. However, they are less explored for the machine learning tasks, e.g., classification. More seriously, the theoretical analysis of excess risk bounds for risk minimization, an important measure of generalization performance, has not been established for non-oblivious randomized reduction methods. It therefore remains an open problem what is the benefit of using them over previous oblivious random projection based approaches. To tackle these challenges, we propose an algorithmic framework for employing non-oblivious randomized reduction method for general empirical risk minimizing in machine learning tasks, where the original high-dimensional features are projected onto a random subspace that is derived from the data with a small matrix approximation error. We then derive the first excess risk bound for the proposed non-oblivious randomized reduction approach without requiring strong assumptions on the training data. The established excess risk bound exhibits that the proposed approach provides much better generalization performance and it also sheds more insights about different randomized reduction approaches. Finally, we conduct extensive experiments on both synthetic and real-world benchmark datasets, whose dimension scales to $O(10^7)$, to demonstrate the efficacy of our proposed approach.

preprint2016arXiv

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than $O(1/ε)$

In this paper, we develop a novel {\bf ho}moto{\bf p}y {\bf s}moothing (HOPS) algorithm for solving a family of non-smooth problems that is composed of a non-smooth term with an explicit max-structure and a smooth term or a simple non-smooth term whose proximal mapping is easy to compute. The best known iteration complexity for solving such non-smooth optimization problems is $O(1/ε)$ without any assumption on the strong convexity. In this work, we will show that the proposed HOPS achieved a lower iteration complexity of $\widetilde O(1/ε^{1-θ})$\footnote{$\widetilde O()$ suppresses a logarithmic factor.} with $θ\in(0,1]$ capturing the local sharpness of the objective function around the optimal solutions. To the best of our knowledge, this is the lowest iteration complexity achieved so far for the considered non-smooth optimization problems without strong convexity assumption. The HOPS algorithm employs Nesterov's smoothing technique and Nesterov's accelerated gradient method and runs in stages, which gradually decreases the smoothing parameter in a stage-wise manner until it yields a sufficiently good approximation of the original function. We show that HOPS enjoys a linear convergence for many well-known non-smooth problems (e.g., empirical risk minimization with a piece-wise linear loss function and $\ell_1$ norm regularizer, finding a point in a polyhedron, cone programming, etc). Experimental results verify the effectiveness of HOPS in comparison with Nesterov's smoothing algorithm and the primal-dual style of first-order methods.

preprint2016arXiv

Hybrid-DCA: A Double Asynchronous Approach for Stochastic Dual Coordinate Ascent

In prior works, stochastic dual coordinate ascent (SDCA) has been parallelized in a multi-core environment where the cores communicate through shared memory, or in a multi-processor distributed memory environment where the processors communicate through message passing. In this paper, we propose a hybrid SDCA framework for multi-core clusters, the most common high performance computing environment that consists of multiple nodes each having multiple cores and its own shared memory. We distribute data across nodes where each node solves a local problem in an asynchronous parallel fashion on its cores, and then the local updates are aggregated via an asynchronous across-node update scheme. The proposed double asynchronous method converges to a global solution for $L$-Lipschitz continuous loss functions, and at a linear convergence rate if a smooth convex loss function is used. Extensive empirical comparison has shown that our algorithm scales better than the best known shared-memory methods and runs faster than previous distributed-memory methods. Big datasets, such as one of 280 GB from the LIBSVM repository, cannot be accommodated on a single node and hence cannot be solved by a parallel algorithm. For such a dataset, our hybrid algorithm takes 30 seconds to achieve a duality gap of $10^{-6}$ on 16 nodes each using 8 cores, which is significantly faster than the best known distributed algorithms, such as CoCoA+, that take more than 300 seconds on 16 nodes.

preprint2016arXiv

Improved Dropout for Shallow and Deep Learning

Dropout has been witnessed with great success in training deep neural networks by independently zeroing out the outputs of neurons at random. It has also received a surge of interest for shallow learning, e.g., logistic regression. However, the independent sampling for dropout could be suboptimal for the sake of convergence. In this paper, we propose to use multinomial sampling for dropout, i.e., sampling features or neurons according to a multinomial distribution with different probabilities for different features/neurons. To exhibit the optimal dropout probabilities, we analyze the shallow learning with multinomial dropout and establish the risk bound for stochastic optimization. By minimizing a sampling dependent factor in the risk bound, we obtain a distribution-dependent dropout with sampling probabilities dependent on the second order statistics of the data distribution. To tackle the issue of evolving distribution of neurons in deep learning, we propose an efficient adaptive dropout (named \textbf{evolutional dropout}) that computes the sampling probabilities on-the-fly from a mini-batch of examples. Empirical studies on several benchmark datasets demonstrate that the proposed dropouts achieve not only much faster convergence and but also a smaller testing error than the standard dropout. For example, on the CIFAR-100 data, the evolutional dropout achieves relative improvements over 10\% on the prediction performance and over 50\% on the convergence speed compared to the standard dropout.

preprint2016arXiv

Learning Attributes Equals Multi-Source Domain Generalization

Attributes possess appealing properties and benefit many computer vision problems, such as object recognition, learning with humans in the loop, and image retrieval. Whereas the existing work mainly pursues utilizing attributes for various computer vision problems, we contend that the most basic problem---how to accurately and robustly detect attributes from images---has been left under explored. Especially, the existing work rarely explicitly tackles the need that attribute detectors should generalize well across different categories, including those previously unseen. Noting that this is analogous to the objective of multi-source domain generalization, if we treat each category as a domain, we provide a novel perspective to attribute detection and propose to gear the techniques in multi-source domain generalization for the purpose of learning cross-category generalizable attribute detectors. We validate our understanding and approach with extensive experiments on four challenging datasets and three different problems.

preprint2016arXiv

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

We consider stochastic strongly convex optimization with a complex inequality constraint. This complex inequality constraint may lead to computationally expensive projections in algorithmic iterations of the stochastic gradient descent~(SGD) methods. To reduce the computation costs pertaining to the projections, we propose an Epoch-Projection Stochastic Gradient Descent~(Epro-SGD) method. The proposed Epro-SGD method consists of a sequence of epochs; it applies SGD to an augmented objective function at each iteration within the epoch, and then performs a projection at the end of each epoch. Given a strongly convex optimization and for a total number of $T$ iterations, Epro-SGD requires only $\log(T)$ projections, and meanwhile attains an optimal convergence rate of $O(1/T)$, both in expectation and with a high probability. To exploit the structure of the optimization problem, we propose a proximal variant of Epro-SGD, namely Epro-ORDA, based on the optimal regularized dual averaging method. We apply the proposed methods on real-world applications; the empirical results demonstrate the effectiveness of our methods.

preprint2016arXiv

Sparse Learning for Large-scale and High-dimensional Data: A Randomized Convex-concave Optimization Approach

In this paper, we develop a randomized algorithm and theory for learning a sparse model from large-scale and high-dimensional data, which is usually formulated as an empirical risk minimization problem with a sparsity-inducing regularizer. Under the assumption that there exists a (approximately) sparse solution with high classification accuracy, we argue that the dual solution is also sparse or approximately sparse. The fact that both primal and dual solutions are sparse motivates us to develop a randomized approach for a general convex-concave optimization problem. Specifically, the proposed approach combines the strength of random projection with that of sparse learning: it utilizes random projection to reduce the dimensionality, and introduces $\ell_1$-norm regularization to alleviate the approximation error caused by random projection. Theoretical analysis shows that under favored conditions, the randomized algorithm can accurately recover the optimal solutions to the convex-concave optimization problem (i.e., recover both the primal and dual solutions).

preprint2016arXiv

Stochastic subGradient Methods with Linear Convergence for Polyhedral Convex Optimization

In this paper, we show that simple {Stochastic} subGradient Decent methods with multiple Restarting, named {\bf RSGD}, can achieve a \textit{linear convergence rate} for a class of non-smooth and non-strongly convex optimization problems where the epigraph of the objective function is a polyhedron, to which we refer as {\bf polyhedral convex optimization}. Its applications in machine learning include $\ell_1$ constrained or regularized piecewise linear loss minimization and submodular function minimization. To the best of our knowledge, this is the first result on the linear convergence rate of stochastic subgradient methods for non-smooth and non-strongly convex optimization problems.

preprint2016arXiv

Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient

This work focuses on dynamic regret of online convex optimization that compares the performance of online learning to a clairvoyant who knows the sequence of loss functions in advance and hence selects the minimizer of the loss function at each step. By assuming that the clairvoyant moves slowly (i.e., the minimizers change slowly), we present several improved variation-based upper bounds of the dynamic regret under the true and noisy gradient feedback, which are {\it optimal} in light of the presented lower bounds. The key to our analysis is to explore a regularity metric that measures the temporal changes in the clairvoyant's minimizers, to which we refer as {\it path variation}. Firstly, we present a general lower bound in terms of the path variation, and then show that under full information or gradient feedback we are able to achieve an optimal dynamic regret. Secondly, we present a lower bound with noisy gradient feedback and then show that we can achieve optimal dynamic regrets under a stochastic gradient feedback and two-point bandit feedback. Moreover, for a sequence of smooth loss functions that admit a small variation in the gradients, our dynamic regret under the two-point bandit feedback matches what is achieved with full information.

preprint2016arXiv

Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

Recently, {\it stochastic momentum} methods have been widely adopted in training deep neural networks. However, their convergence analysis is still underexplored at the moment, in particular for non-convex optimization. This paper fills the gap between practice and theory by developing a basic convergence analysis of two stochastic momentum methods, namely stochastic heavy-ball method and the stochastic variant of Nesterov's accelerated gradient method. We hope that the basic convergence results developed in this paper can serve the reference to the convergence of stochastic momentum methods and also serve the baselines for comparison in future development of stochastic momentum methods. The novelty of convergence analysis presented in this paper is a unified framework, revealing more insights about the similarities and differences between different stochastic momentum methods and stochastic gradient method. The unified framework exhibits a continuous change from the gradient method to Nesterov's accelerated gradient method and finally the heavy-ball method incurred by a free parameter, which can help explain a similar change observed in the testing error convergence behavior for deep learning. Furthermore, our empirical results for optimizing deep neural networks demonstrate that the stochastic variant of Nesterov's accelerated gradient method achieves a good tradeoff (between speed of convergence in training error and robustness of convergence in testing error) among the three stochastic methods.

preprint2015arXiv

An Explicit Sampling Dependent Spectral Error Bound for Column Subset Selection

In this paper, we consider the problem of column subset selection. We present a novel analysis of the spectral norm reconstruction for a simple randomized algorithm and establish a new bound that depends explicitly on the sampling probabilities. The sampling dependent error bound (i) allows us to better understand the tradeoff in the reconstruction error due to sampling probabilities, (ii) exhibits more insights than existing error bounds that exploit specific probability distributions, and (iii) implies better sampling distributions. In particular, we show that a sampling distribution with probabilities proportional to the square root of the statistical leverage scores is always better than uniform sampling and is better than leverage-based sampling when the statistical leverage scores are very nonuniform. And by solving a constrained optimization problem related to the error bound with an efficient bisection search we are able to achieve better performance than using either the leverage-based distribution or that proportional to the square root of the statistical leverage scores. Numerical simulations demonstrate the benefits of the new sampling distributions for low-rank matrix approximation and least square approximation compared to state-of-the art algorithms.

preprint2015arXiv

Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees

In this paper, we study a fast approximation method for {\it large-scale high-dimensional} sparse least-squares regression problem by exploiting the Johnson-Lindenstrauss (JL) transforms, which embed a set of high-dimensional vectors into a low-dimensional space. In particular, we propose to apply the JL transforms to the data matrix and the target vector and then to solve a sparse least-squares problem on the compressed data with a {\it slightly larger regularization parameter}. Theoretically, we establish the optimization error bound of the learned model for two different sparsity-inducing regularizers, i.e., the elastic net and the $\ell_1$ norm. Compared with previous relevant work, our analysis is {\it non-asymptotic and exhibits more insights} on the bound, the sample complexity and the regularization. As an illustration, we also provide an error bound of the {\it Dantzig selector} under JL transforms.

preprint2015arXiv

On Data Preconditioning for Regularized Loss Minimization

In this work, we study data preconditioning, a well-known and long-existing technique, for boosting the convergence of first-order methods for regularized loss minimization. It is well understood that the condition number of the problem, i.e., the ratio of the Lipschitz constant to the strong convexity modulus, has a harsh effect on the convergence of the first-order optimization methods. Therefore, minimizing a small regularized loss for achieving good generalization performance, yielding an ill conditioned problem, becomes the bottleneck for big data problems. We provide a theory on data preconditioning for regularized loss minimization. In particular, our analysis exhibits an appropriate data preconditioner and characterizes the conditions on the loss function and on the data under which data preconditioning can reduce the condition number and therefore boost the convergence for minimizing the regularized loss. To make the data preconditioning practically useful, we endeavor to employ and analyze a random sampling approach to efficiently compute the preconditioned data. The preliminary experiments validate our theory.

preprint2015arXiv

Online Stochastic Linear Optimization under One-bit Feedback

In this paper, we study a special bandit setting of online stochastic linear optimization, where only one-bit of information is revealed to the learner at each round. This problem has found many applications including online advertisement and online recommendation. We assume the binary feedback is a random variable generated from the logit model, and aim to minimize the regret defined by the unknown linear function. Although the existing method for generalized linear bandit can be applied to our problem, the high computational cost makes it impractical for real-world problems. To address this challenge, we develop an efficient online learning algorithm by exploiting particular structures of the observation model. Specifically, we adopt online Newton step to estimate the unknown parameter and derive a tight confidence region based on the exponential concavity of the logistic loss. Our analysis shows that the proposed algorithm achieves a regret bound of $O(d\sqrt{T})$, which matches the optimal result of stochastic linear bandits.

preprint2015arXiv

Stochastic Proximal Gradient Descent for Nuclear Norm Regularization

In this paper, we utilize stochastic optimization to reduce the space complexity of convex composite optimization with a nuclear norm regularizer, where the variable is a matrix of size $m \times n$. By constructing a low-rank estimate of the gradient, we propose an iterative algorithm based on stochastic proximal gradient descent (SPGD), and take the last iterate of SPGD as the final solution. The main advantage of the proposed algorithm is that its space complexity is $O(m+n)$, in contrast, most of previous algorithms have a $O(mn)$ space complexity. Theoretical analysis shows that it achieves $O(\log T/\sqrt{T})$ and $O(\log T/T)$ convergence rates for general convex functions and strongly convex functions, respectively.

preprint2015arXiv

Theory of Dual-sparse Regularized Randomized Reduction

In this paper, we study randomized reduction methods, which reduce high-dimensional features into low-dimensional space by randomized methods (e.g., random projection, random hashing), for large-scale high-dimensional classification. Previous theoretical results on randomized reduction methods hinge on strong assumptions about the data, e.g., low rank of the data matrix or a large separable margin of classification, which hinder their applications in broad domains. To address these limitations, we propose dual-sparse regularized randomized reduction methods that introduce a sparse regularizer into the reduced dual problem. Under a mild condition that the original dual solution is a (nearly) sparse vector, we show that the resulting dual solution is close to the original dual solution and concentrates on its support set. In numerical experiments, we present an empirical study to support the analysis and we also present a novel application of the dual-sparse regularized randomized reduction methods to reducing the communication cost of distributed learning from large-scale high-dimensional data.

preprint2014arXiv

Analysis of Distributed Stochastic Dual Coordinate Ascent

In \citep{Yangnips13}, the author presented distributed stochastic dual coordinate ascent (DisDCA) algorithms for solving large-scale regularized loss minimization. Extraordinary performances have been observed and reported for the well-motivated updates, as referred to the practical updates, compared to the naive updates. However, no serious analysis has been provided to understand the updates and therefore the convergence rates. In the paper, we bridge the gap by providing a theoretical analysis of the convergence rates of the practical DisDCA algorithm. Our analysis helped by empirical studies has shown that it could yield an exponential speed-up in the convergence by increasing the number of dual updates at each iteration. This result justifies the superior performances of the practical DisDCA as compared to the naive variant. As a byproduct, our analysis also reveals the convergence behavior of the one-communication DisDCA.

preprint2014arXiv

Object-centric Sampling for Fine-grained Image Classification

This paper proposes to go beyond the state-of-the-art deep convolutional neural network (CNN) by incorporating the information from object detection, focusing on dealing with fine-grained image classification. Unfortunately, CNN suffers from over-fiting when it is trained on existing fine-grained image classification benchmarks, which typically only consist of less than a few tens of thousands training images. Therefore, we first construct a large-scale fine-grained car recognition dataset that consists of 333 car classes with more than 150 thousand training images. With this large-scale dataset, we are able to build a strong baseline for CNN with top-1 classification accuracy of 81.6%. One major challenge in fine-grained image classification is that many classes are very similar to each other while having large within-class variation. One contributing factor to the within-class variation is cluttered image background. However, the existing CNN training takes uniform window sampling over the image, acting as blind on the location of the object of interest. In contrast, this paper proposes an \emph{object-centric sampling} (OCS) scheme that samples image windows based on the object location information. The challenge in using the location information lies in how to design powerful object detector and how to handle the imperfectness of detection results. To that end, we design a saliency-aware object detection approach specific for the setting of fine-grained image classification, and the uncertainty of detection results are naturally handled in our OCS scheme. Our framework is demonstrated to be very effective, improving top-1 accuracy to 89.3% (from 81.6%) on the large-scale fine-grained car classification dataset.

preprint2014arXiv

Recovering the Optimal Solution by Dual Random Projection

Random projection has been widely used in data classification. It maps high-dimensional data into a low-dimensional subspace in order to reduce the computational cost in solving the related optimization problem. While previous studies are focused on analyzing the classification performance of using random projection, in this work, we consider the recovery problem, i.e., how to accurately recover the optimal solution to the original optimization problem in the high-dimensional space based on the solution learned from the subspace spanned by random projections. We present a simple algorithm, termed Dual Random Projection, that uses the dual solution of the low-dimensional optimization problem to recover the optimal solution to the original problem. Our theoretical analysis shows that with a high probability, the proposed algorithm is able to accurately recover the optimal solution to the original problem, provided that the data matrix is of low rank or can be well approximated by a low rank matrix.

preprint2013arXiv

A New Analysis of Compressive Sensing by Stochastic Proximal Gradient Descent

In this manuscript, we analyze the sparse signal recovery (compressive sensing) problem from the perspective of convex optimization by stochastic proximal gradient descent. This view allows us to significantly simplify the recovery analysis of compressive sensing. More importantly, it leads to an efficient optimization algorithm for solving the regularized optimization problem related to the sparse recovery problem. Compared to the existing approaches, there are two advantages of the proposed algorithm. First, it enjoys a geometric convergence rate and therefore is computationally efficient. Second, it guarantees that the support set of any intermediate solution generated by the proposed algorithm is concentrated on the support set of the optimal solution.

preprint2013arXiv

An Efficient Primal-Dual Prox Method for Non-Smooth Optimization

We study the non-smooth optimization problems in machine learning, where both the loss function and the regularizer are non-smooth functions. Previous studies on efficient empirical loss minimization assume either a smooth loss function or a strongly convex regularizer, making them unsuitable for non-smooth optimization. We develop a simple yet efficient method for a family of non-smooth optimization problems where the dual form of the loss function is bilinear in primal and dual variables. We cast a non-smooth optimization problem into a minimax optimization problem, and develop a primal dual prox method that solves the minimax optimization problem at a rate of $O(1/T)$ {assuming that the proximal step can be efficiently solved}, significantly faster than a standard subgradient descent method that has an $O(1/\sqrt{T})$ convergence rate. Our empirical study verifies the efficiency of the proposed method for various non-smooth optimization problems that arise ubiquitously in machine learning by comparing it to the state-of-the-art first order methods.

preprint2013arXiv

Online Stochastic Optimization with Multiple Objectives

In this paper we propose a general framework to characterize and solve the stochastic optimization problems with multiple objectives underlying many real world learning applications. We first propose a projection based algorithm which attains an $O(T^{-1/3})$ convergence rate. Then, by leveraging on the theory of Lagrangian in constrained optimization, we devise a novel primal-dual stochastic approximation algorithm which attains the optimal convergence rate of $O(T^{-1/2})$ for general Lipschitz continuous objectives.

preprint2013arXiv

Sparse Multiple Kernel Learning with Geometric Convergence Rate

In this paper, we study the problem of sparse multiple kernel learning (MKL), where the goal is to efficiently learn a combination of a fixed small number of kernels from a large pool that could lead to a kernel classifier with a small prediction error. We develop an efficient algorithm based on the greedy coordinate descent algorithm, that is able to achieve a geometric convergence rate under appropriate conditions. The convergence rate is achieved by measuring the size of functional gradients by an empirical $\ell_2$ norm that depends on the empirical data distribution. This is in contrast to previous algorithms that use a functional norm to measure the size of gradients, which is independent from the data samples. We also establish a generalization error bound of the learned sparse kernel classifier using the technique of local Rademacher complexity.

preprint2012arXiv

A Bayesian Framework for Community Detection Integrating Content and Link

This paper addresses the problem of community detection in networked data that combines link and content analysis. Most existing work combines link and content information by a generative model. There are two major shortcomings with the existing approaches. First, they assume that the probability of creating a link between two nodes is determined only by the community memberships of the nodes; however other factors (e.g. popularity) could also affect the link pattern. Second, they use generative models to model the content of individual nodes, whereas these generative models are vulnerable to the content attributes that are irrelevant to communities. We propose a Bayesian framework for combining link and content information for community detection that explicitly addresses these shortcomings. A new link model is presented that introduces a random variable to capture the node popularity when deciding the link between two nodes; a discriminative model is used to determine the community membership of a node by its content. An approximate inference algorithm is presented for efficient Bayesian inference. Our empirical study shows that the proposed framework outperforms several state-of-theart approaches in combining link and content information for community detection.

preprint2012arXiv

A Simple Algorithm for Semi-supervised Learning with Improved Generalization Error Bound

In this work, we develop a simple algorithm for semi-supervised regression. The key idea is to use the top eigenfunctions of integral operator derived from both labeled and unlabeled examples as the basis functions and learn the prediction function by a simple linear regression. We show that under appropriate assumptions about the integral operator, this approach is able to achieve an improved regression error bound better than existing bounds of supervised learning. We also verify the effectiveness of the proposed algorithm by an empirical study.

preprint2012arXiv

An Improved Bound for the Nystrom Method for Large Eigengap

We develop an improved bound for the approximation error of the Nyström method under the assumption that there is a large eigengap in the spectrum of kernel matrix. This is based on the empirical observation that the eigengap has a significant impact on the approximation error of the Nyström method. Our approach is based on the concentration inequality of integral operator and the theory of matrix perturbation. Our analysis shows that when there is a large eigengap, we can improve the approximation error of the Nyström method from $O(N/m^{1/4})$ to $O(N/m^{1/2})$ when measured in Frobenius norm, where $N$ is the size of the kernel matrix, and $m$ is the number of sampled columns.

preprint2012arXiv

Efficient Constrained Regret Minimization

Online learning constitutes a mathematical and compelling framework to analyze sequential decision making problems in adversarial environments. The learner repeatedly chooses an action, the environment responds with an outcome, and then the learner receives a reward for the played action. The goal of the learner is to maximize his total reward. However, there are situations in which, in addition to maximizing the cumulative reward, there are some additional constraints on the sequence of decisions that must be satisfied on average by the learner. In this paper we study an extension to the online learning where the learner aims to maximize the total reward given that some additional constraints need to be satisfied. By leveraging on the theory of Lagrangian method in constrained optimization, we propose Lagrangian exponentially weighted average (LEWA) algorithm, which is a primal-dual variant of the well known exponentially weighted average algorithm, to efficiently solve constrained online decision making problems. Using novel theoretical analysis, we establish the regret and the violation of the constraint bounds in full information and bandit feedback models.

preprint2012arXiv

Improved Bound for the Nystrom's Method and its Application to Kernel Classification

We develop two approaches for analyzing the approximation error bound for the Nyström method, one based on the concentration inequality of integral operator, and one based on the compressive sensing theory. We show that the approximation error, measured in the spectral norm, can be improved from $O(N/\sqrt{m})$ to $O(N/m^{1 - ρ})$ in the case of large eigengap, where $N$ is the total number of data points, $m$ is the number of sampled data points, and $ρ\in (0, 1/2)$ is a positive constant that characterizes the eigengap. When the eigenvalues of the kernel matrix follow a $p$-power law, our analysis based on compressive sensing theory further improves the bound to $O(N/m^{p - 1})$ under an incoherence assumption, which explains why the Nyström method works well for kernel matrix with skewed eigenvalues. We present a kernel classification approach based on the Nyström method and derive its generalization performance using the improved bound. We show that when the eigenvalues of kernel matrix follow a $p$-power law, we can reduce the number of support vectors to $N^{2p/(p^2 - 1)}$, a number less than $N$ when $p > 1+\sqrt{2}$, without seriously sacrificing its generalization performance.

preprint2012arXiv

Influence Analysis in the Blogosphere

In this paper we analyze influence in the blogosphere. Recently, influence analysis has become an increasingly important research topic, as online communities, such as social networks and e-commerce sites, playing a more and more significant role in our daily life. However, so far few studies have succeeded in extracting influence from online communities in a satisfactory way. One of the challenges that limited previous researches is that it is difficult to capture user behaviors. Consequently, the influence among users could only be inferred in an indirect and heuristic way, which is inaccurate and noise-prone. In this study, we conduct an extensive investigation in regard to influence among bloggers at a Japanese blog web site, BIGLOBE. By processing the log files of the web servers, we are able to accurately extract the activities of BIGLOBE members in terms of writing their blog posts and reading other member's posts. Based on these activities, we propose a principled framework to detect influence among the members with high confidence level. From the extracted influence, we conduct in-depth analysis on how influence varies over different topics and how influence varies over different members. We also show the potentials of leveraging the extracted influence to make personalized recommendation in BIGLOBE. To our best knowledge, this is one of the first studies that capture and analyze influence in the blogosphere in such a large scale.

preprint2012arXiv

Multiple Kernel Learning from Noisy Labels by Stochastic Programming

We study the problem of multiple kernel learning from noisy labels. This is in contrast to most of the previous studies on multiple kernel learning that mainly focus on developing efficient algorithms and assume perfectly labeled training examples. Directly applying the existing multiple kernel learning algorithms to noisily labeled examples often leads to suboptimal performance due to the incorrect class assignments. We address this challenge by casting multiple kernel learning from noisy labels into a stochastic programming problem, and presenting a minimax formulation. We develop an efficient algorithm for solving the related convex-concave optimization problem with a fast convergence rate of $O(1/T)$ where $T$ is the number of iterations. Empirical studies on UCI data sets verify both the effectiveness of the proposed framework and the efficiency of the proposed optimization algorithm.

preprint2012arXiv

Regret Bound by Variation for Online Convex Optimization

In citep{Hazan-2008-extract}, the authors showed that the regret of online linear optimization can be bounded by the total variation of the cost vectors. In this paper, we extend this result to general online convex optimization. We first analyze the limitations of the algorithm in \citep{Hazan-2008-extract} when applied it to online convex optimization. We then present two algorithms for online convex optimization whose regrets are bounded by the variation of cost functions. We finally consider the bandit setting, and present a randomized algorithm for online bandit convex optimization with a variation-based regret bound. We show that the regret bound for online bandit convex optimization is optimal when the variation of cost functions is independent of the number of trials.

preprint2012arXiv

Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

In this paper we propose a framework for solving constrained online convex optimization problem. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set $\mathcal{K}$ from which the decisions are made. While for simple shapes (e.g. Euclidean ball) the projection is straightforward, for arbitrary complex sets this is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring decisions belong to $\mathcal{K}$ for all rounds, we only require that the constraints which define the set $\mathcal{K}$ be satisfied in the long run. We show that our framework can be utilized to solve a relaxed version of online learning with side constraints addressed in \cite{DBLP:conf/colt/MannorT06} and \cite{DBLP:conf/aaai/KvetonYTM08}. By turning the problem into an online convex-concave optimization problem, we propose an efficient algorithm which achieves $\tilde{\mathcal{O}}(\sqrt{T})$ regret bound and $\tilde{\mathcal{O}}(T^{3/4})$ bound for the violation of constraints. Then we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting $\tilde{\mathcal{O}}(T^{3/4})$ regret bound. Our second algorithm is based on the Mirror Prox method \citep{nemirovski-2005-prox} to solve variational inequalities which achieves $\tilde{\mathcal{\mathcal{O}}}(T^{2/3})$ bound for both regret and the violation of constraints when the domain $\K$ can be described by a finite number of linear constraints. Finally, we extend the result to the setting where we only have partial access to the convex set $\mathcal{K}$ and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2412.15206:author:9:tianbao-yang

Imported May 21, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.03832:author:4:tianbao-yang

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.02116:author:3:tianbao-yang

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.03866:author:4:tianbao-yang

Imported May 20, 2026Synced May 21, 2026

23 works

Rong Jin

Researcher

Rong Jin contributes to research discovery and scholarly infrastructure.

Open to collaborate

16 works

Lijun Zhang

Researcher

Lijun Zhang contributes to research discovery and scholarly infrastructure.

Open to collaborate

11 works

Shenghuo Zhu

Researcher

Shenghuo Zhu contributes to research discovery and scholarly infrastructure.

Open to collaborate

10 works

Mehrdad Mahdavi

Researcher

Mehrdad Mahdavi contributes to research discovery and scholarly infrastructure.

Open to collaborate

Tianbao Yang

What is connected

Connect this record

See the researcher in context

Building this map preview

55 published item(s)

A Domain Incremental Continual Learning Benchmark for ICU Time Series Model Transportability

AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving

Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Memory-Efficient Continual Learning with CLIP Models

Statistical Consistency and Generalization of Contrastive Representation Learning

AUC Maximization in the Era of Big Data and AI: A Survey

Benchmarking Deep AUROC Optimization: Loss Functions and Algorithmic Choices

GraphFM: Improving Large-Scale GNN Training via Feature Momentum

Momentum Accelerates the Convergence of Stochastic AUPRC Maximization

Multi-block-Single-probe Variance Reduced Estimator for Coupled Compositional Optimization

A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints

A Simple and Effective Framework for Pairwise Deep Metric Learning

Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition

Minimizing Dynamic Regret and Adaptive Regret Simultaneously

Nearly Optimal Robust Method for Convex Compositional Problems with Heavy-Tailed Noise

Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

Stochastic AUC Maximization with Deep Neural Networks

Stochastic Optimization for Non-convex Inf-Projection Problems

Variance-Reduced Off-Policy Memory-Efficient Policy Search

A Simple Homotopy Proximal Mapping for Compressive Sensing

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Efficient Non-oblivious Randomized Reduction for Risk Minimization with Improved Excess Risk Guarantee

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than $O(1/ε)$

Hybrid-DCA: A Double Asynchronous Approach for Stochastic Dual Coordinate Ascent

Improved Dropout for Shallow and Deep Learning

Learning Attributes Equals Multi-Source Domain Generalization

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Sparse Learning for Large-scale and High-dimensional Data: A Randomized Convex-concave Optimization Approach

Stochastic subGradient Methods with Linear Convergence for Polyhedral Convex Optimization

Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient

Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

An Explicit Sampling Dependent Spectral Error Bound for Column Subset Selection

Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees

On Data Preconditioning for Regularized Loss Minimization

Online Stochastic Linear Optimization under One-bit Feedback

Stochastic Proximal Gradient Descent for Nuclear Norm Regularization

Theory of Dual-sparse Regularized Randomized Reduction

Analysis of Distributed Stochastic Dual Coordinate Ascent

Object-centric Sampling for Fine-grained Image Classification

Recovering the Optimal Solution by Dual Random Projection

A New Analysis of Compressive Sensing by Stochastic Proximal Gradient Descent

An Efficient Primal-Dual Prox Method for Non-Smooth Optimization

Online Stochastic Optimization with Multiple Objectives

Sparse Multiple Kernel Learning with Geometric Convergence Rate

A Bayesian Framework for Community Detection Integrating Content and Link

A Simple Algorithm for Semi-supervised Learning with Improved Generalization Error Bound

An Improved Bound for the Nystrom Method for Large Eigengap

Efficient Constrained Regret Minimization

Improved Bound for the Nystrom's Method and its Application to Kernel Classification

Influence Analysis in the Blogosphere

Multiple Kernel Learning from Noisy Labels by Stochastic Programming

Regret Bound by Variation for Online Convex Optimization

Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints