Researcher profile

Zhi-Hua Zhou

Zhi-Hua Zhou contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
35works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

35 published item(s)

preprint2026arXiv

AeroSketch: Near-Optimal Time Matrix Sketch Framework for Persistent, Sliding Window, and Distributed Streams

Many real-world matrix datasets arrive as high-throughput vector streams, making it impractical to store or process them in their entirety. To enable real-time analytics under limited computational, memory, and communication resources, matrix sketching techniques have been developed over recent decades to provide compact approximations of such streaming data. Some algorithms have achieved optimal space and communication complexity. However, these approaches often require frequent time-consuming matrix factorization operations. In particular, under tight approximation error bounds, each matrix factorization computation incurs cubic time complexity, thereby limiting their update efficiency. In this paper, we introduce AeroSketch, a novel matrix sketching framework that leverages recent advances in randomized numerical linear algebra (RandNLA). AeroSketch achieves optimal communication and space costs while delivering near-optimal update time complexity (within logarithmic factors) across persistent, sliding window, and distributed streaming scenarios. Extensive experiments on both synthetic and real-world datasets demonstrate that AeroSketch consistently outperforms state-of-the-art methods in update throughput. In particular, under tight approximation error constraints, AeroSketch reduces the cubic time complexity to the quadratic level. Meanwhile, it maintains comparable approximation quality while retaining optimal communication and space costs.

preprint2026arXiv

Non-Parametric Rehearsal Learning via Conditional Mean Embeddings

In machine learning, a critical class of decision-related problems concerns preventing predicted undesirable outcomes, referred to as the \textit{avoiding undesired future} (AUF) problem. To address this, the \textit{rehearsal learning} framework has been proposed to model influence relations for effective decisions. However, existing rehearsal methods rely on restrictive parametric assumptions such as linear systems or additive noise, limiting their practical applicability. In this paper, we propose the first non-parametric rehearsal learning approach for AUF without assuming specific functional forms of data generation processes. Specifically, we use kernel machinery to reformulate the AUF objective into a unified representation that disentangles desirability modeling from action-induced distributional changes. To handle the discontinuity of desirability indicator, we present a smooth Probit surrogate and provide an approximation error bound. Meanwhile, we capture the action-induced changes via conditional mean embeddings, and develop a kernel ridge regression based nested estimator for AUF objective with consistency guarantees. Such a formulation naturally accommodates nonlinear systems and non-additive noise, and empirical results on synthetic and real-data-derived semi-synthetic benchmarks demonstrate the effectiveness and flexibility of our approach.

preprint2026arXiv

Order-based Rehearsal Learning

When a machine learning (ML) model forecasts an undesired event, one often seeks a decision to avoid it, known as the avoiding undesired future (AUF) problem. Many rehearsal learning methods have been proposed for AUF, but they rely on an underlying graph structure; learning such a graph from observational data is challenging and can incur substantial estimation error. In this work, we demonstrate that the order structure can be sufficient for AUF decision-making, and propose the first order-based rehearsal learning method. Although an order is less informative than a graph, it can be sufficient to identify the influence of decisions from observational data, suggesting that learning the entire graph is not always necessary. To learn the order, we develop an information-theoretic method that imposes no restrictions on the form of structural functions or the type of noise distributions. For AUF decision-making, we construct an order-based sampler to approximate the influence of decisions and, combined with a surrogate objective for maximizing the post-decision success probability, reduce the AUF task to a differentiable optimization problem. Experiments show that our order learning method outperforms existing methods, and that our AUF approach not only surpasses methods relying on learned graphs or learned orders, but also matches or even exceeds oracle baselines that are given the true graph.

preprint2026arXiv

Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs

Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a \emph{refined analysis framework}, which simplifies the derivation and, importantly, produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an $\tilde{O}(k_μ^{5/4} c_μ^{-3/4} d^{3/4} P_T^{1/4}T^{3/4})$ regret, improving the $\tilde{O}(k_μ^{2} c_μ^{-1}d^{9/10} P_T^{1/5}T^{4/5})$ bound in prior work, where $k_μ$ and $c_μ$ characterize the reward model's nonlinearity, $P_T$ measures the non-stationarity, $d$ and $T$ denote the dimension and time horizon. Moreover, we extend our framework to non-stationary Markov Decision Processes (MDPs) with function approximation, focusing on Linear Mixture MDP and Multinomial Logit (MNL) Mixture MDP. For both classes, we propose algorithms based on the weighted strategy and establish dynamic regret guarantees using our analysis framework.

preprint2023arXiv

Adapting to Online Label Shift with Provable Guarantees

The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.

preprint2022arXiv

Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits

We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is poly$(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\widetilde{O}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.

preprint2022arXiv

Dynamic Regret of Online Markov Decision Processes

We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For these three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure. Furthermore, when the online environments encountered by the learner are predictable, we design improved algorithms and achieve better dynamic regret bounds for the episodic (loop-free) SSP; and moreover, we demonstrate impossibility results for the infinite-horizon MDPs.

preprint2022arXiv

No-Regret Learning in Time-Varying Zero-Sum Games

Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning. We consider a variant of this problem where the game payoff matrix changes over time, possibly in an adversarial manner. We first present three performance measures to guide the algorithmic design for this problem: 1) the well-studied individual regret, 2) an extension of duality gap, and 3) a new measure called dynamic Nash Equilibrium regret, which quantifies the cumulative difference between the player's payoff and the minimax game value. Next, we develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under all these three performance measures. These guarantees are adaptive to different non-stationarity measures of the payoff matrices and, importantly, recover the best known results when the payoff matrix is fixed. Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property, along with several novel ingredients specifically designed for the time-varying game setting. Empirical results further validate the effectiveness of our algorithm.

preprint2022arXiv

Open-environment Machine Learning

Conventional machine learning studies generally assume close-environment scenarios where important factors of the learning process hold invariant. With the great success of machine learning, nowadays, more and more practical tasks, particularly those involving open-environment scenarios where important factors are subject to change, called open-environment machine learning (Open ML) in this article, are present to the community. Evidently it is a grand challenge for machine learning turning from close environment to open environment. It becomes even more challenging since, in various big data tasks, data are usually accumulated with time, like streams, while it is hard to train the machine learning model after collecting all data as in conventional studies. This article briefly introduces some advances in this line of research, focusing on techniques concerning emerging new classes, decremental/incremental features, changing data distributions, varied learning objectives, and discusses some theoretical issues.

preprint2022arXiv

Result Diversification by Multi-objective Evolutionary Algorithms with Theoretical Guarantees

Given a ground set of items, the result diversification problem aims to select a subset with high "quality" and "diversity" while satisfying some constraints. It arises in various real-world artificial intelligence applications, such as web-based search, document summarization and feature selection, and also has applications in other areas, e.g., computational geometry, databases, finance and operations research. Previous algorithms are mainly based on greedy or local search. In this paper, we propose to reformulate the result diversification problem as a bi-objective maximization problem, and solve it by a multi-objective evolutionary algorithm (EA), i.e., the GSEMO. We theoretically prove that the GSEMO can achieve the (asymptotically) optimal theoretical guarantees under both static and dynamic environments. For cardinality constraints, the GSEMO can achieve the optimal polynomial-time approximation ratio, $1/2$. For more general matroid constraints, the GSEMO can achieve an asymptotically optimal polynomial-time approximation ratio, $1/2-ε/(4n)$, where $ε>0$ and $n$ is the size of the ground set of items. Furthermore, when the objective function (i.e., a linear combination of quality and diversity) changes dynamically, the GSEMO can maintain this approximation ratio in polynomial running time, addressing the open question proposed by Borodin. This also theoretically shows the superiority of EAs over local search for solving dynamic optimization problems for the first time, and discloses the robustness of the mutation operator of EAs against dynamic changes. Experiments on the applications of web-based search, multi-label feature selection and document summarization show the superior performance of the GSEMO over the state-of-the-art algorithms (i.e., the greedy algorithm and local search) under both static and dynamic environments.

preprint2021arXiv

Storage Fit Learning with Feature Evolvable Streams

Feature evolvable learning has been widely studied in recent years where old features will vanish and new features will emerge when learning with streams. Conventional methods usually assume that a label will be revealed after prediction at each time step. However, in practice, this assumption may not hold whereas no label will be given at most time steps. A good solution is to leverage the technique of manifold regularization to utilize the previous similar data to assist the refinement of the online model. Nevertheless, this approach needs to store all previous data which is impossible in learning with streams that arrive sequentially in large volume. Thus we need a buffer to store part of them. Considering that different devices may have different storage budgets, the learning approaches should be flexible subject to the storage budget limit. In this paper, we propose a new setting: Storage-Fit Feature-Evolvable streaming Learning (SF$^2$EL) which incorporates the issue of rarely-provided labels into feature evolution. Our framework is able to fit its behavior to different storage budgets when learning with feature evolvable streams with unlabeled data. Besides, both theoretical and empirical results validate that our approach can preserve the merit of the original feature evolvable learning i.e., can always track the best baseline and thus perform well at any time step.

preprint2021arXiv

Towards Understanding Theoretical Advantages of Complex-Reaction Networks

Complex-valued neural networks have attracted increasing attention in recent years, while it remains open on the advantages of complex-valued neural networks in comparison with real-valued networks. This work takes one step on this direction by introducing the \emph{complex-reaction network} with fully-connected feed-forward architecture. We prove the universal approximation property for complex-reaction networks, and show that a class of radial functions can be approximated by a complex-reaction network using the polynomial number of parameters, whereas real-valued networks need at least exponential parameters to reach the same approximation level. For empirical risk minimization, our theoretical result shows that the critical point set of complex-reaction networks is a proper subset of that of real-valued networks, which may show some insights on finding the optimal solutions more easily for complex-reaction networks.

preprint2020arXiv

AliExpress Learning-To-Rank: Maximizing Online Model Performance without Going Online

Learning-to-rank (LTR) has become a key technology in E-commerce applications. Most existing LTR approaches follow a supervised learning paradigm from offline labeled data collected from the online system. However, it has been noticed that previous LTR models can have a good validation performance over offline validation data but have a poor online performance, and vice versa, which implies a possible large inconsistency between the offline and online evaluation. We investigate and confirm in this paper that such inconsistency exists and can have a significant impact on AliExpress Search. Reasons for the inconsistency include the ignorance of item context during the learning, and the offline data set is insufficient for learning the context. Therefore, this paper proposes an evaluator-generator framework for LTR with item context. The framework consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning, and a discriminator that ensures the generalization of the evaluator. Extensive experiments in simulation environments and AliExpress Search online system show that, firstly, the classic data-based metrics on the offline dataset can show significant inconsistency with online performance, and can even be misleading. Secondly, the proposed evaluator score is significantly more consistent with the online performance than common ranking metrics. Finally, as the consequence, our method achieves a significant improvement (\textgreater$2\%$) in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests.

preprint2020arXiv

Bandit Convex Optimization in Non-stationary Environments

Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point function values. In this paper, we investigate BCO in non-stationary environments and choose the \emph{dynamic regret} as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path-length of the comparator sequence that reflects the non-stationarity of environments. We propose a novel algorithm that achieves $O(T^{3/4}(1+P_T)^{1/2})$ and $O(T^{1/2}(1+P_T)^{1/2})$ dynamic regret respectively for the one-point and two-point feedback models. The latter result is optimal, matching the $Ω(T^{1/2}(1+P_T)^{1/2})$ lower bound established in this paper. Notably, our algorithm is more adaptive to non-stationary environments since it does not require prior knowledge of the path-length $P_T$ ahead of time, which is generally unknown.

preprint2020arXiv

Deep Forest

Current deep learning models are mostly build upon neural networks, i.e., multiple layers of parameterized differentiable nonlinear modules that can be trained by backpropagation. In this paper, we explore the possibility of building deep models based on non-differentiable modules. We conjecture that the mystery behind the success of deep neural networks owes much to three characteristics, i.e., layer-by-layer processing, in-model feature transformation and sufficient model complexity. We propose the gcForest approach, which generates \textit{deep forest} holding these characteristics. This is a decision tree ensemble approach, with much less hyper-parameters than deep neural networks, and its model complexity can be automatically determined in a data-dependent way. Experiments show that its performance is quite robust to hyper-parameter settings, such that in most cases, even across different data from different domains, it is able to get excellent performance by using the same default setting. This study opens the door of deep learning based on non-differentiable modules, and exhibits the possibility of constructing deep models without using backpropagation.

preprint2020arXiv

Distributed Deep Forest and its Application to Automatic Detection of Cash-out Fraud

Internet companies are facing the need for handling large-scale machine learning applications on a daily basis and distributed implementation of machine learning algorithms which can handle extra-large scale tasks with great performance is widely needed. Deep forest is a recently proposed deep learning framework which uses tree ensembles as its building blocks and it has achieved highly competitive results on various domains of tasks. However, it has not been tested on extremely large scale tasks. In this work, based on our parameter server system, we developed the distributed version of deep forest. To meet the need for real-world tasks, many improvements are introduced to the original deep forest model, including MART (Multiple Additive Regression Tree) as base learners for efficiency and effectiveness consideration, the cost-based method for handling prevalent class-imbalanced data, MART based feature selection for high dimension data and different evaluation metrics for automatically determining of the cascade level. We tested the deep forest model on an extra-large scale task, i.e., automatic detection of cash-out fraud, with more than 100 millions of training samples. Experimental results showed that the deep forest model has the best performance according to the evaluation metrics from different perspectives even with very little effort for parameter tuning. This model can block fraud transactions in a large amount of money each day. Even compared with the best-deployed model, the deep forest model can additionally bring into a significant decrease in economic loss each day.

preprint2020arXiv

Flexible Transmitter Network

Current neural networks are mostly built upon the MP model, which usually formulates the neuron as executing an activation function on the real-valued weighted aggregation of signals received from other neurons. In this paper, we propose the Flexible Transmitter (FT) model, a novel bio-plausible neuron model with flexible synaptic plasticity. The FT model employs a pair of parameters to model the transmitters between neurons and puts up a neuron-exclusive variable to record the regulated neurotrophin density, which leads to the formulation of the FT model as a two-variable two-valued function, taking the commonly-used MP neuron model as its special case. This modeling manner makes the FT model not only biologically more realistic, but also capable of handling complicated data, even time series. To exhibit its power and potential, we present the Flexible Transmitter Network (FTNet), which is built on the most common fully-connected feed-forward architecture taking the FT model as the basic building block. FTNet allows gradient calculation and can be implemented by an improved back-propagation algorithm in the complex-valued domain. Experiments on a board range of tasks show the superiority of the proposed FTNet. This study provides an alternative basic building block in neural networks and exhibits the feasibility of developing artificial neural networks with neuronal plasticity.

preprint2020arXiv

Learning with Interpretable Structure from Gated RNN

The interpretability of deep learning models has raised extended attention these years. It will be beneficial if we can learn an interpretable structure from deep learning models. In this paper, we focus on Recurrent Neural Networks~(RNNs) especially gated RNNs whose inner mechanism is still not clearly understood. We find that Finite State Automaton~(FSA) that processes sequential data has more interpretable inner mechanism according to the definition of interpretability and can be learned from RNNs as the interpretable structure. We propose two methods to learn FSA from RNN based on two different clustering methods. With the learned FSA and via experiments on artificial and real datasets, we find that FSA is more trustable than the RNN from which it learned, which gives FSA a chance to substitute RNNs in applications involving humans' lives or dangerous facilities. Besides, we analyze how the number of gates affects the performance of RNN. Our result suggests that gate in RNN is important but the less the better, which could be a guidance to design other RNNs. Finally, we observe that the FSA learned from RNN gives semantic aggregated states and its transition graph shows us a very interesting vision of how RNNs intrinsically handle text classification tasks.

preprint2020arXiv

Model Reuse with Reduced Kernel Mean Embedding Specification

Given a publicly available pool of machine learning models constructed for various tasks, when a user plans to build a model for her own machine learning application, is it possible to build upon models in the pool such that the previous efforts on these existing models can be reused rather than starting from scratch? Here, a grand challenge is how to find models that are helpful for the current application, without accessing the raw training data for the models in the pool. In this paper, we present a two-phase framework. In the upload phase, when a model is uploading into the pool, we construct a reduced kernel mean embedding (RKME) as a specification for the model. Then in the deployment phase, the relatedness of the current task and pre-trained models will be measured based on the value of the RKME specification. Theoretical results and extensive experiments validate the effectiveness of our approach.

preprint2020arXiv

Soft Gradient Boosting Machine

Gradient Boosting Machine has proven to be one successful function approximator and has been widely used in a variety of areas. However, since the training procedure of each base learner has to take the sequential order, it is infeasible to parallelize the training process among base learners for speed-up. In addition, under online or incremental learning settings, GBMs achieved sub-optimal performance due to the fact that the previously trained base learners can not adapt with the environment once trained. In this work, we propose the soft Gradient Boosting Machine (sGBM) by wiring multiple differentiable base learners together, by injecting both local and global objectives inspired from gradient boosting, all base learners can then be jointly optimized with linear speed-up. When using differentiable soft decision trees as base learner, such device can be regarded as an alternative version of the (hard) gradient boosting decision trees with extra benefits. Experimental results showed that, sGBM enjoys much higher time efficiency with better accuracy, given the same base learner in both on-line and off-line settings.

preprint2018arXiv

Handling Concept Drift via Model Reuse

In many real-world applications, data are often collected in the form of stream, and thus the distribution usually changes in nature, which is referred as concept drift in literature. We propose a novel and effective approach to handle concept drift via model reuse, leveraging previous knowledge by reusing models. Each model is associated with a weight representing its reusability towards current data, and the weight is adaptively adjusted according to the model performance. We provide generalization and regret analysis. Experimental results also validate the superiority of our approach on both synthetic and real-world datasets.

preprint2018arXiv

Multi-Layered Gradient Boosting Decision Trees

Multi-layered representation is believed to be the key ingredient of deep neural networks especially in cognitive tasks like computer vision. While non-differentiable models such as gradient boosting decision trees (GBDTs) are the dominant methods for modeling discrete or tabular data, they are hard to incorporate with such representation learning ability. In this work, we propose the multi-layered GBDT forest (mGBDTs), with an explicit emphasis on exploring the ability to learn hierarchical representations by stacking several layers of regression GBDTs as its building block. The model can be jointly trained by a variant of target propagation across layers, without the need to derive back-propagation nor differentiability. Experiments and visualizations confirmed the effectiveness of the model in terms of performance and representation learning ability.

preprint2018arXiv

Unorganized Malicious Attacks Detection

Recommender system has attracted much attention during the past decade. Many attack detection algorithms have been developed for better recommendations, mostly focusing on shilling attacks, where an attack organizer produces a large number of user profiles by the same strategy to promote or demote an item. This work considers a different attack style: unorganized malicious attacks, where attackers individually utilize a small number of user profiles to attack different items without any organizer. This attack style occurs in many real applications, yet relevant study remains open. We first formulate the unorganized malicious attacks detection as a matrix completion problem, and propose the Unorganized Malicious Attacks detection (UMA) approach, a proximal alternating splitting augmented Lagrangian method. We verify, both theoretically and empirically, the effectiveness of our proposed approach.

preprint2017arXiv

A Unified View of Multi-Label Performance Measures

Multi-label classification deals with the problem where each instance is associated with multiple class labels. Because evaluation in multi-label classification is more complicated than single-label setting, a number of performance measures have been proposed. It is noticed that an algorithm usually performs differently on different measures. Therefore, it is important to understand which algorithms perform well on which measure(s) and why. In this paper, we propose a unified margin view to revisit eleven performance measures in multi-label classification. In particular, we define label-wise margin and instance-wise margin, and prove that through maximizing these margins, different corresponding performance measures will be optimized. Based on the defined margins, a max-margin approach called LIMO is designed and empirical results verify our theoretical findings.

preprint2017arXiv

AutoEncoder by Forest

Auto-encoding is an important task which is typically realized by deep neural networks (DNNs) such as convolutional neural networks (CNN). In this paper, we propose EncoderForest (abbrv. eForest), the first tree ensemble based auto-encoder. We present a procedure for enabling forests to do backward reconstruction by utilizing the equivalent classes defined by decision paths of the trees, and demonstrate its usage in both supervised and unsupervised setting. Experiments show that, compared with DNN autoencoders, eForest is able to obtain lower reconstruction error with fast training speed, while the model itself is reusable and damage-tolerable.

preprint2017arXiv

Crowdsourcing with Unsure Option

One of the fundamental problems in crowdsourcing is the trade-off between the number of the workers needed for high-accuracy aggregation and the budget to pay. For saving budget, it is important to ensure high quality of the crowd-sourced labels, hence the total cost on label collection will be reduced. Since the self-confidence of the workers often has a close relationship with their abilities, a possible way for quality control is to request the workers to return the labels only when they feel confident, by means of providing unsure option to them. On the other hand, allowing workers to choose unsure option also leads to the potential danger of budget waste. In this work, we propose the analysis towards understanding when providing the unsure option indeed leads to significant cost reduction, as well as how the confidence threshold is set. We also propose an online mechanism, which is alternative for threshold selection when the estimation of the crowd ability distribution is difficult.

preprint2017arXiv

Distribution-Free One-Pass Learning

In many large-scale machine learning applications, data are accumulated with time, and thus, an appropriate model should be able to update in an online paradigm. Moreover, as the whole data volume is unknown when constructing the model, it is desired to scan each data item only once with a storage independent with the data volume. It is also noteworthy that the distribution underlying may change during the data accumulation procedure. To handle such tasks, in this paper we propose DFOP, a distribution-free one-pass learning approach. This approach works well when distribution change occurs during data accumulation, without requiring prior knowledge about the change. Every data item can be discarded once it has been scanned. Besides, theoretical guarantee shows that the estimate error, under a mild assumption, decreases until convergence with high probability. The performance of DFOP for both regression and classification are validated in experiments.

preprint2016arXiv

One-Pass Learning with Incremental and Decremental Features

In many real tasks the features are evolving, with some features being vanished and some other features augmented. For example, in environment monitoring some sensors expired whereas some new ones deployed; in mobile game recommendation some games dropped whereas some new ones added. Learning with such incremental and decremental features is crucial but rarely studied, particularly when the data coming like a stream and thus it is infeasible to keep the whole data for optimization. In this paper, we study this challenging problem and present the OPID approach. Our approach attempts to compress important information of vanished features into functions of survived features, and then expand to include the augmented features. It is the one-pass learning approach, which only needs to scan each instance once and does not need to store the whole data, and thus satisfy the evolving streaming data nature. The effectiveness of our approach is validated theoretically and empirically.

preprint2016arXiv

Optimal Margin Distribution Machine

Support vector machine (SVM) has been one of the most popular learning algorithms, with the central idea of maximizing the minimum margin, i.e., the smallest distance from the instances to the classification boundary. Recent theoretical results, however, disclosed that maximizing the minimum margin does not necessarily lead to better generalization performances, and instead, the margin distribution has been proven to be more crucial. Based on this idea, we propose a new method, named Optimal margin Distribution Machine (ODM), which tries to achieve a better generalization performance by optimizing the margin distribution. We characterize the margin distribution by the first- and second-order statistics, i.e., the margin mean and variance. The proposed method is a general learning approach which can be used in any place where SVM can be applied, and their superiority is verified both theoretically and empirically in this paper.

preprint2014arXiv

Dropout Rademacher Complexity of Deep Neural Networks

Great successes of deep neural networks have been witnessed in various real applications. Many algorithmic and implementation techniques have been developed, however, theoretical understanding of many aspects of deep neural networks is far from clear. A particular interesting issue is the usefulness of dropout, which was motivated from the intuition of preventing complex co-adaptation of feature detectors. In this paper, we study the Rademacher complexity of different types of dropout, and our theoretical results disclose that for shallow neural networks (with one or none hidden layer) dropout is able to reduce the Rademacher complexity in polynomial, whereas for deep neural networks it can amazingly lead to an exponential reduction of the Rademacher complexity.

preprint2014arXiv

On the Consistency of AUC Pairwise Optimization

AUC (area under ROC curve) is an important evaluation criterion, which has been popularly used in many learning tasks such as class-imbalance learning, cost-sensitive learning, learning to rank, etc. Many learning approaches try to optimize AUC, while owing to the non-convexity and discontinuousness of AUC, almost all approaches work with surrogate loss functions. Thus, the consistency of AUC is crucial; however, it has been almost untouched before. In this paper, we provide a sufficient condition for the asymptotic consistency of learning approaches based on surrogate loss functions. Based on this result, we prove that exponential loss and logistic loss are consistent with AUC, but hinge loss is inconsistent. Then, we derive the $q$-norm hinge loss and general hinge loss that are consistent with AUC. We also derive the consistent bounds for exponential loss and logistic loss, and obtain the consistent bounds for many surrogate loss functions under the non-noise setting. Further, we disclose an equivalence between the exponential surrogate loss of AUC and exponential surrogate loss of accuracy, and one straightforward consequence of such finding is that AdaBoost and RankBoost are equivalent.

preprint2014arXiv

Top Rank Optimization in Linear Time

Bipartite ranking aims to learn a real-valued ranking function that orders positive instances before negative instances. Recent efforts of bipartite ranking are focused on optimizing ranking accuracy at the top of the ranked list. Most existing approaches are either to optimize task specific metrics or to extend the ranking loss by emphasizing more on the error associated with the top ranked instances, leading to a high computational cost that is super-linear in the number of training instances. We propose a highly efficient approach, titled TopPush, for optimizing accuracy at the top that has computational complexity linear in the number of training instances. We present a novel analysis that bounds the generalization error for the top ranked instances for the proposed approach. Empirical study shows that the proposed approach is highly competitive to the state-of-the-art approaches and is 10-100 times faster.

preprint2013arXiv

Convex and Scalable Weakly Labeled SVMs

In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difficult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the WellSVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Definite Programming (SDP) relaxations. Moreover, the WellSVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and WellSVM is also readily applicable on large data sets.

preprint2013arXiv

Fast Multi-Instance Multi-Label Learning

In many real-world tasks, particularly those involving data objects with complicated semantics such as images and texts, one object can be represented by multiple instances and simultaneously be associated with multiple labels. Such tasks can be formulated as multi-instance multi-label learning (MIML) problems, and have been extensively studied during the past few years. Existing MIML approaches have been found useful in many applications; however, most of them can only handle moderate-sized data. To efficiently handle large data sets, in this paper we propose the MIMLfast approach, which first constructs a low-dimensional subspace shared by all labels, and then trains label specific linear models to optimize approximated ranking loss via stochastic gradient descent. Although the MIML problem is complicated, MIMLfast is able to achieve excellent performance by exploiting label relations with shared space and discovering sub-concepts for complicated labels. Experiments show that the performance of MIMLfast is highly competitive to state-of-the-art techniques, whereas its time cost is much less; particularly, on a data set with 20K bags and 180K instances, MIMLfast is more than 100 times faster than existing MIML approaches. On a larger data set where none of existing approaches can return results in 24 hours, MIMLfast takes only 12 minutes. Moreover, our approach is able to identify the most representative instance for each label, and thus providing a chance to understand the relation between input patterns and output label semantics.

preprint2013arXiv

One-Pass AUC Optimization

AUC is an important performance measure and many algorithms have been devoted to AUC optimization, mostly by minimizing a surrogate convex loss on a training data set. In this work, we focus on one-pass AUC optimization that requires only going through the training data once without storing the entire training dataset, where conventional online learning algorithms cannot be applied directly because AUC is measured by a sum of losses defined over pairs of instances from different classes. We develop a regression-based algorithm which only needs to maintain the first and second order statistics of training data in memory, resulting a storage requirement independent from the size of training data. To efficiently handle high dimensional data, we develop a randomized algorithm that approximates the covariance matrices by low rank matrices. We verify, both theoretically and empirically, the effectiveness of the proposed algorithm.