Source author record

Chaobing Song

Chaobing Song appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.OC Information Theory math.IT Artificial Intelligence Computer Vision math.NA Numerical Analysis

Catalog footprint

What is connected

10works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions

We study stochastic monotone inclusion problems, which widely appear in machine learning applications, including robust regression and adversarial learning. We propose novel variants of stochastic Halpern iteration with recursive variance reduction. In the cocoercive -- and more generally Lipschitz-monotone -- setup, our algorithm attains $ε$ norm of the operator with $\mathcal{O}(\frac{1}{ε^3})$ stochastic operator evaluations, which significantly improves over state of the art $\mathcal{O}(\frac{1}{ε^4})$ stochastic operator evaluations required for existing monotone inclusion solvers applied to the same problem classes. We further show how to couple one of the proposed variants of stochastic Halpern iteration with a scheduled restart scheme to solve stochastic monotone inclusion problems with ${\mathcal{O}}(\frac{\log(1/ε)}{ε^2})$ stochastic operator evaluations under additional sharpness or strong monotonicity assumptions.

preprint2022arXiv

A Fast Scale-Invariant Algorithm for Non-negative Least Squares with Non-negative Data

Nonnegative (linear) least square problems are a fundamental class of problems that is well-studied in statistical learning and for which solvers have been implemented in many of the standard programming languages used within the machine learning community. The existing off-the-shelf solvers view the non-negativity constraint in these problems as an obstacle and, compared to unconstrained least squares, perform additional effort to address it. However, in many of the typical applications, the data itself is nonnegative as well, and we show that the nonnegativity in this case makes the problem easier. In particular, while the oracle complexity of unconstrained least squares problems necessarily scales with one of the data matrix constants (typically the spectral norm) and these problems are solved to additive error, we show that nonnegative least squares problems with nonnegative data are solvable to multiplicative error and with complexity that is independent of any matrix constants. The algorithm we introduce is accelerated and based on a primal-dual perspective. We further show how to provably obtain linear convergence using adaptive restart coupled with our method and demonstrate its effectiveness on large-scale data via numerical experiments.

preprint2021arXiv

Optimistic Dual Extrapolation for Coherent Non-monotone Variational Inequalities

The optimization problems associated with training generative adversarial neural networks can be largely reduced to certain {\em non-monotone} variational inequality problems (VIPs), whereas existing convergence results are mostly based on monotone or strongly monotone assumptions. In this paper, we propose {\em optimistic dual extrapolation (OptDE)}, a method that only performs {\em one} gradient evaluation per iteration. We show that OptDE is provably convergent to {\em a strong solution} under different coherent non-monotone assumptions. In particular, when a {\em weak solution} exists, the convergence rate of our method is $O(1/{ε^{2}})$, which matches the best existing result of the methods with two gradient evaluations. Further, when a {\em $σ$-weak solution} exists, the convergence guarantee is improved to the linear rate $O(\log\frac{1}ε)$. Along the way--as a byproduct of our inquiries into non-monotone variational inequalities--we provide the near-optimal $O\big(\frac{1}ε\log \frac{1}ε\big)$ convergence guarantee in terms of restricted strong merit function for monotone variational inequalities. We also show how our results can be naturally generalized to the stochastic setting, and obtain corresponding new convergence results. Taken together, our results contribute to the broad landscape of variational inequality--both non-monotone and monotone alike--by providing a novel and more practical algorithm with the state-of-the-art convergence guarantees.

preprint2021arXiv

Variance Reduction via Accelerated Dual Averaging for Finite-Sum Optimization

In this paper, we introduce a simplified and unified method for finite-sum convex optimization, named \emph{Variance Reduction via Accelerated Dual Averaging (VRADA)}. In both general convex and strongly convex settings, VRADA can attain an $O\big(\frac{1}{n}\big)$-accurate solution in $O(n\log\log n)$ number of stochastic gradient evaluations which improves the best-known result $O(n\log n)$, where $n$ is the number of samples. Meanwhile, VRADA matches the lower bound of the general convex setting up to a $\log\log n$ factor and matches the lower bounds in both regimes $n\le Θ(κ)$ and $n\gg κ$ of the strongly convex setting, where $κ$ denotes the condition number. Besides improving the best-known results and matching all the above lower bounds simultaneously, VRADA has more unified and simplified algorithmic implementation and convergence analysis for both the general convex and strongly convex settings. The underlying novel approaches such as the novel initialization strategy in VRADA may be of independent interest. Through experiments on real datasets, we show the good performance of VRADA over existing methods for large-scale machine learning problems.

preprint2020arXiv

Breaking the $O(1/ε)$ Optimal Rate for a Class of Minimax Problems

It is known that for convex optimization $\min_{\mathbf{w}\in\mathcal{W}}f(\mathbf{w})$, the best possible rate of first order accelerated methods is $O(1/\sqrtε)$. However, for the bilinear minimax problem: $\min_{\mathbf{w}\in\mathcal{W}}\max_{\mathbf{v}\in\mathcal{V}}$ $f(\mathbf{w})$ $+\langle\mathbf{w}, \boldsymbol{A}\mathbf{v}\rangle$ $-h(\mathbf{v})$ where both $f(\mathbf{w})$ and $h(\mathbf{v})$ are convex, the best known rate of first order methods slows down to $O(1/ε)$. It is not known whether one can achieve the accelerated rate $O(1/\sqrtε)$ for the bilinear minimax problem without assuming $f(\mathbf{w})$ and $h(\mathbf{v})$ being strongly convex. In this paper, we fill this theoretical gap by proposing a bilinear accelerated extragradient (BAXG) method. We show that when $\mathcal{W}=\mathbb{R}^d$, $f(\mathbf{w})$ and $h(\mathbf{v})$ are convex and smooth, and $\boldsymbol{A}$ has full column rank, then the BAXG method achieves an accelerated rate $O(1/\sqrtε\log \frac{1}ε)$, within a logarithmic factor to the likely optimal rate $O(1/\sqrtε)$. As result, a large class of bilinear convex concave minimax problems, including a few problems of practical importance, can be solved much faster than previously known methods.

preprint2020arXiv

Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction

To learn intrinsic low-dimensional structures from high-dimensional data that most discriminate between classes, we propose the principle of Maximal Coding Rate Reduction ($\text{MCR}^2$), an information-theoretic measure that maximizes the coding rate difference between the whole dataset and the sum of each individual class. We clarify its relationships with most existing frameworks such as cross-entropy, information bottleneck, information gain, contractive and contrastive learning, and provide theoretical guarantees for learning diverse and discriminative features. The coding rate can be accurately computed from finite samples of degenerate subspace-like distributions and can learn intrinsic representations in supervised, self-supervised, and unsupervised settings in a unified manner. Empirically, the representations learned using this principle alone are significantly more robust to label corruptions in classification than those using cross-entropy, and can lead to state-of-the-art results in clustering mixed data from self-learned invariant features.

preprint2016arXiv

Alternating direction algorithms for $\ell_0$ regularization in compressed sensing

In this paper we propose three iterative greedy algorithms for compressed sensing, called \emph{iterative alternating direction} (IAD), \emph{normalized iterative alternating direction} (NIAD) and \emph{alternating direction pursuit} (ADP), which stem from the iteration steps of alternating direction method of multiplier (ADMM) for $\ell_0$-regularized least squares ($\ell_0$-LS) and can be considered as the alternating direction versions of the well-known iterative hard thresholding (IHT), normalized iterative hard thresholding (NIHT) and hard thresholding pursuit (HTP) respectively. Firstly, relative to the general iteration steps of ADMM, the proposed algorithms have no splitting or dual variables in iterations and thus the dependence of the current approximation on past iterations is direct. Secondly, provable theoretical guarantees are provided in terms of restricted isometry property, which is the first theoretical guarantee of ADMM for $\ell_0$-LS to the best of our knowledge. Finally, they outperform the corresponding IHT, NIHT and HTP greatly when reconstructing both constant amplitude signals with random signs (CARS signals) and Gaussian signals.

preprint2016arXiv

Bayesian linear regression with Student-t assumptions

As an automatic method of determining model complexity using the training data alone, Bayesian linear regression provides us a principled way to select hyperparameters. But one often needs approximation inference if distribution assumption is beyond Gaussian distribution. In this paper, we propose a Bayesian linear regression model with Student-t assumptions (BLRS), which can be inferred exactly. In this framework, both conjugate prior and expectation maximization (EM) algorithm are generalized. Meanwhile, we prove that the maximum likelihood solution is equivalent to the standard Bayesian linear regression with Gaussian assumptions (BLRG). The $q$-EM algorithm for BLRS is nearly identical to the EM algorithm for BLRG. It is showed that $q$-EM for BLRS can converge faster than EM for BLRG for the task of predicting online news popularity.

preprint2016arXiv

Nonextensive information theoretical machine

In this paper, we propose a new discriminative model named \emph{nonextensive information theoretical machine (NITM)} based on nonextensive generalization of Shannon information theory. In NITM, weight parameters are treated as random variables. Tsallis divergence is used to regularize the distribution of weight parameters and maximum unnormalized Tsallis entropy distribution is used to evaluate fitting effect. On the one hand, it is showed that some well-known margin-based loss functions such as $\ell_{0/1}$ loss, hinge loss, squared hinge loss and exponential loss can be unified by unnormalized Tsallis entropy. On the other hand, Gaussian prior regularization is generalized to Student-t prior regularization with similar computational complexity. The model can be solved efficiently by gradient-based convex optimization and its performance is illustrated on standard datasets.

preprint2016arXiv

Unifying Decision Trees Split Criteria Using Tsallis Entropy

The construction of efficient and effective decision trees remains a key topic in machine learning because of their simplicity and flexibility. A lot of heuristic algorithms have been proposed to construct near-optimal decision trees. ID3, C4.5 and CART are classical decision tree algorithms and the split criteria they used are Shannon entropy, Gain Ratio and Gini index respectively. All the split criteria seem to be independent, actually, they can be unified in a Tsallis entropy framework. Tsallis entropy is a generalization of Shannon entropy and provides a new approach to enhance decision trees' performance with an adjustable parameter $q$. In this paper, a Tsallis Entropy Criterion (TEC) algorithm is proposed to unify Shannon entropy, Gain Ratio and Gini index, which generalizes the split criteria of decision trees. More importantly, we reveal the relations between Tsallis entropy with different $q$ and other split criteria. Experimental results on UCI data sets indicate that the TEC algorithm achieves statistically significant improvement over the classical algorithms.

Chaobing Song

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions

A Fast Scale-Invariant Algorithm for Non-negative Least Squares with Non-negative Data

Optimistic Dual Extrapolation for Coherent Non-monotone Variational Inequalities

Variance Reduction via Accelerated Dual Averaging for Finite-Sum Optimization

Breaking the $O(1/ε)$ Optimal Rate for a Class of Minimax Problems

Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction

Alternating direction algorithms for $\ell_0$ regularization in compressed sensing

Bayesian linear regression with Student-t assumptions

Nonextensive information theoretical machine

Unifying Decision Trees Split Criteria Using Tsallis Entropy