Source author record

Wu-Jun Li

Wu-Jun Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computer Vision Information Retrieval Artificial Intelligence Computation and Language Computational Engineering, Finance, and Science Data Structures and Algorithms Distributed, Parallel, and Cluster Computing math.OC Numerical Analysis q-fin.TR Social and Information Networks

Catalog footprint

What is connected

19works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Controllable Financial Market Generation with Diffusion Guided Meta Agent

Generative modeling has transformed many fields, such as language and visual modeling, while its application in financial markets remains under-explored. As the minimal unit within a financial market is an order, order-flow modeling represents a fundamental generative financial task. However, current approaches often yield unsatisfactory fidelity in generating order flow, and their generation lacks controllability, thereby limiting their practical applications. In this paper, we formulate the challenge of controllable financial market generation, and propose a Diffusion Guided Meta Agent (DigMA) model to address it. Specifically, we employ a conditional diffusion model to capture the dynamics of the market state represented by time-evolving distribution parameters of the mid-price return rate and the order arrival rate, and we define a meta agent with financial economic priors to generate orders from the corresponding distributions. Extensive experimental results show that DigMA achieves superior controllability and generation fidelity. Moreover, we validate its effectiveness as a generative environment for downstream high-frequency trading tasks and its computational efficiency.

preprint2026arXiv

Ordered Local Momentum for Asynchronous Distributed Learning under Arbitrary Delays

Momentum SGD (MSGD) serves as a foundational optimizer in training deep models due to momentum's key role in accelerating convergence and enhancing generalization. Meanwhile, asynchronous distributed learning is crucial for training large-scale deep models, especially when the computing capabilities of the workers in the cluster are heterogeneous. To reduce communication frequency, local updates are widely adopted in distributed learning. However, how to implement asynchronous distributed MSGD with local updates remains unexplored. To solve this problem, we propose a novel method, called \underline{or}dered \underline{lo}cal \underline{mo}mentum (OrLoMo), for asynchronous distributed learning. In OrLoMo, each worker runs MSGD locally. Then the local momentum from each worker will be aggregated by the server in order based on its global iteration index. To the best of our knowledge, OrLoMo is the first method to implement asynchronous distributed MSGD with local updates. We prove the convergence of OrLoMo for non-convex problems under arbitrary delays. Experiments validate that OrLoMo can outperform its synchronous counterpart and other asynchronous methods.

preprint2022arXiv

Buffered Asynchronous SGD for Byzantine Learning

Distributed learning has become a hot research topic due to its wide application in clusterbased large-scale learning, federated learning, edge computing and so on. Most traditional distributed learning methods typically assume no failure or attack. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist non-omniscient attacks without storing any instances on server. Furthermore, we also propose an improved variant of BASGD, called BASGD with momentum (BASGDm), by introducing momentum into BASGD. BASGDm can resist both non-omniscient and omniscient attacks. Compared with those methods which need to store instances on server, BASGD and BASGDm have a wider scope of application. Both BASGD and BASGDm are compatible with various aggregation rules. Moreover, both BASGD and BASGDm are proved to be convergent and be able to resist failure or attack. Empirical results show that our methods significantly outperform existing ABL baselines when there exists failure or attack on workers.

preprint2020arXiv

Collaborative Self-Attention for Recommender Systems

Recommender systems (RS), which have been an essential part in a wide range of applications, can be formulated as a matrix completion (MC) problem. To boost the performance of MC, matrix completion with side information, called inductive matrix completion (IMC), was further proposed. In real applications, the factorized version of IMC is more favored due to its efficiency of optimization and implementation. Regarding the factorized version, traditional IMC method can be interpreted as learning an individual representation for each feature, which is independent from each other. Moreover, representations for the same features are shared across all users/items. However, the independent characteristic for features and shared characteristic for the same features across all users/items may limit the expressiveness of the model. The limitation also exists in variants of IMC, such as deep learning based IMC models. To break the limitation, we generalize recent advances of self-attention mechanism to IMC and propose a context-aware model called collaborative self-attention (CSA), which can jointly learn context-aware representations for features and perform inductive matrix completion process. Extensive experiments on three large-scale datasets from real RS applications demonstrate effectiveness of CSA.

preprint2020arXiv

ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval

Retrieving content relevant images from a large-scale fine-grained dataset could suffer from intolerably slow query speed and highly redundant storage cost, due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper, we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images, leveraging the search and storage efficiency of hash learning to alleviate the aforementioned problems. Specifically, we propose a unified end-to-end trainable network, termed as ExchNet. Based on attention mechanisms and proposed attention constraints, it can firstly obtain both local and global features to represent object parts and whole fine-grained objects, respectively. Furthermore, to ensure the discriminative ability and semantic meaning's consistency of these part-level features across images, we design a local feature alignment approach by performing a feature exchanging operation. Later, an alternative learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes. Validated by extensive experiments, our proposal consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets, which shows our effectiveness. Moreover, compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction, revealing its efficiency and practicality.

preprint2020arXiv

Stagewise Enlargement of Batch Size for SGD-based Learning

Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent~(SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called \underline{s}tagewise \underline{e}nlargement of \underline{b}atch \underline{s}ize~(\mbox{SEBS}), to set proper batch size for SGD. More specifically, \mbox{SEBS} adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, \mbox{SEBS} can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for \mbox{SGD}, momentum \mbox{SGD} and AdaGrad. Empirical results on real data successfully verify the theories of \mbox{SEBS}. Furthermore, empirical results also show that SEBS can outperform other baselines.

preprint2020arXiv

TOMA: Topological Map Abstraction for Reinforcement Learning

Animals are able to discover the topological map (graph) of surrounding environment, which will be used for navigation. Inspired by this biological phenomenon, researchers have recently proposed to generate graph representation for Markov decision process (MDP) and use such graphs for planning in reinforcement learning (RL). However, existing graph generation methods suffer from many drawbacks. One drawback is that existing methods do not learn an abstraction for graphs, which results in high memory and computation cost. This drawback also makes generated graph non-robust, which degrades the planning performance. Another drawback is that existing methods cannot be used for facilitating exploration which is important in RL. In this paper, we propose a new method, called topological map abstraction (TOMA), for graph generation. TOMA can generate an abstract graph representation for MDP, which costs much less memory and computation cost than existing methods. Furthermore, TOMA can be used for facilitating exploration. In particular, we propose planning to explore, in which TOMA is used to accelerate exploration by guiding the agent towards unexplored states. A novel experience replay module called vertex memory is also proposed to improve exploration performance. Experimental results show that TOMA can outperform existing methods to achieve the state-of-the-art performance.

preprint2016arXiv

A Proximal Stochastic Quasi-Newton Algorithm

In this paper, we discuss the problem of minimizing the sum of two convex functions: a smooth function plus a non-smooth function. Further, the smooth part can be expressed by the average of a large number of smooth component functions, and the non-smooth part is equipped with a simple proximal mapping. We propose a proximal stochastic second-order method, which is efficient and scalable. It incorporates the Hessian in the smooth part of the function and exploits multistage scheme to reduce the variance of the stochastic gradient. We prove that our method can achieve linear rate of convergence.

preprint2016arXiv

Deep Cross-Modal Hashing

Due to its low storage cost and fast query speed, cross-modal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications. However, almost all existing CMH methods are based on hand-crafted features which might not be optimally compatible with the hash-code learning procedure. As a result, existing CMH methods with handcrafted features may not achieve satisfactory performance. In this paper, we propose a novel cross-modal hashing method, called deep crossmodal hashing (DCMH), by integrating feature learning and hash-code learning into the same framework. DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. Experiments on two real datasets with text-image modalities show that DCMH can outperform other baselines to achieve the state-of-the-art performance in cross-modal retrieval applications.

preprint2016arXiv

Feature Learning based Deep Supervised Hashing with Pairwise Labels

Recent years have witnessed wide application of hashing for large-scale image retrieval. However, most existing hashing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Recently, deep hashing methods have been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels. For another common application scenario with pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing(DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. Experiments on real datasets show that our DPSH method can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.

preprint2016arXiv

Full-Time Supervision based Bidirectional RNN for Factoid Question Answering

Recently, bidirectional recurrent neural network (BRNN) has been widely used for question answering (QA) tasks with promising performance. However, most existing BRNN models extract the information of questions and answers by directly using a pooling operation to generate the representation for loss or similarity calculation. Hence, these existing models don't put supervision (loss or similarity calculation) at every time step, which will lose some useful information. In this paper, we propose a novel BRNN model called full-time supervision based BRNN (FTS-BRNN), which can put supervision at every time step. Experiments on the factoid QA task show that our FTS-BRNN can outperform other baselines to achieve the state-of-the-art accuracy.

preprint2016arXiv

Lock-Free Optimization for Non-Convex Problems

Stochastic gradient descent~(SGD) and its variants have attracted much attention in machine learning due to their efficiency and effectiveness for optimization. To handle large-scale problems, researchers have recently proposed several lock-free strategy based parallel SGD~(LF-PSGD) methods for multi-core systems. However, existing works have only proved the convergence of these LF-PSGD methods for convex problems. To the best of our knowledge, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LF-PSGD methods, Hogwild! and AsySVRG, for non-convex problems. Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results.

preprint2016arXiv

SCOPE: Scalable Composite Optimization for Learning on Spark

Many machine learning models, such as logistic regression~(LR) and support vector machine~(SVM), can be formulated as composite optimization problems. Recently, many distributed stochastic optimization~(DSO) methods have been proposed to solve the large-scale composite optimization problems, which have shown better performance than traditional batch methods. However, most of these DSO methods are not scalable enough. In this paper, we propose a novel DSO method, called \underline{s}calable \underline{c}omposite \underline{op}timization for l\underline{e}arning~({SCOPE}), and implement it on the fault-tolerant distributed platform \mbox{Spark}. SCOPE is both computation-efficient and communication-efficient. Theoretical analysis shows that SCOPE is convergent with linear convergence rate when the objective function is convex. Furthermore, empirical results on real datasets show that SCOPE can outperform other state-of-the-art distributed learning methods on Spark, including both batch learning methods and DSO methods.

preprint2015arXiv

A New Relaxation Approach to Normalized Hypergraph Cut

Normalized graph cut (NGC) has become a popular research topic due to its wide applications in a large variety of areas like machine learning and very large scale integration (VLSI) circuit design. Most of traditional NGC methods are based on pairwise relationships (similarities). However, in real-world applications relationships among the vertices (objects) may be more complex than pairwise, which are typically represented as hyperedges in hypergraphs. Thus, normalized hypergraph cut (NHC) has attracted more and more attention. Existing NHC methods cannot achieve satisfactory performance in real applications. In this paper, we propose a novel relaxation approach, which is called relaxed NHC (RNHC), to solve the NHC problem. Our model is defined as an optimization problem on the Stiefel manifold. To solve this problem, we resort to the Cayley transformation to devise a feasible learning algorithm. Experimental results on a set of large hypergraph benchmarks for clustering and partitioning in VLSI domain show that RNHC can outperform the state-of-the-art methods.

preprint2015arXiv

A Parallel algorithm for $\mathcal{X}$-Armed bandits

The target of $\mathcal{X}$-armed bandit problem is to find the global maximum of an unknown stochastic function $f$, given a finite budget of $n$ evaluations. Recently, $\mathcal{X}$-armed bandits have been widely used in many situations. Many of these applications need to deal with large-scale data sets. To deal with these large-scale data sets, we study a distributed setting of $\mathcal{X}$-armed bandits, where $m$ players collaborate to find the maximum of the unknown function. We develop a novel anytime distributed $\mathcal{X}$-armed bandit algorithm. Compared with prior work on $\mathcal{X}$-armed bandits, our algorithm uses a quite different searching strategy so as to fit distributed learning scenarios. Our theoretical analysis shows that our distributed algorithm is $m$ times faster than the classical single-player algorithm. Moreover, the number of communication rounds of our algorithm is only logarithmic in $mn$. The numerical results show that our method can make effective use of every players to minimize the loss. Thus, our distributed approach is attractive and useful.

preprint2015arXiv

Fast Asynchronous Parallel Stochastic Gradient Decent

Stochastic gradient descent~(SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient~(SVRG). Both theoretical and empirical results show that AsySVRG can outperform existing state-of-the-art parallel SGD methods like Hogwild! in terms of convergence rate and computation cost.

preprint2015arXiv

On the Global Convergence of Majorization Minimization Algorithms for Nonconvex Optimization Problems

In this paper, we study the global convergence of majorization minimization (MM) algorithms for solving nonconvex regularized optimization problems. MM algorithms have received great attention in machine learning. However, when applied to nonconvex optimization problems, the convergence of MM algorithms is a challenging issue. We introduce theory of the Kurdyka- Lojasiewicz inequality to address this issue. In particular, we show that many nonconvex problems enjoy the Kurdyka- Lojasiewicz property and establish the global convergence result of the corresponding MM procedure. We also extend our result to a well known method that called CCCP (concave-convex procedure).

preprint2015arXiv

S-PowerGraph: Streaming Graph Partitioning for Natural Graphs by Vertex-Cut

One standard solution for analyzing large natural graphs is to adopt distributed computation on clusters. In distributed computation, graph partitioning (GP) methods assign the vertices or edges of a graph to different machines in a balanced way so that some distributed algorithms can be adapted for. Most of traditional GP methods are offline, which means that the whole graph has been observed before partitioning. However, the offline methods often incur high computation cost. Hence, streaming graph partitioning (SGP) methods, which can partition graphs in an online way, have recently attracted great attention in distributed computation. There exist two typical GP strategies: edge-cut and vertex-cut. Most SGP methods adopt edge-cut, but few vertex-cut methods have been proposed for SGP. However, the vertex-cut strategy would be a better choice than the edge-cut strategy because the degree of a natural graph in general follows a highly skewed power-law distribution. Thus, we propose a novel method, called S-PowerGraph, for SGP of natural graphs by vertex-cut. Our S-PowerGraph method is simple but effective. Experiments on several large natural graphs and synthetic graphs show that our S-PowerGraph can outperform the state-of-the-art baselines.

preprint2015arXiv

Scalable Stochastic Alternating Direction Method of Multipliers

Stochastic alternating direction method of multipliers (ADMM), which visits only one sample or a mini-batch of samples each time, has recently been proved to achieve better performance than batch ADMM. However, most stochastic methods can only achieve a convergence rate $O(1/\sqrt T)$ on general convex problems,where T is the number of iterations. Hence, these methods are not scalable with respect to convergence rate (computation cost). There exists only one stochastic method, called SA-ADMM, which can achieve convergence rate $O(1/T)$ on general convex problems. However, an extra memory is needed for SA-ADMM to store the historic gradients on all samples, and thus it is not scalable with respect to storage cost. In this paper, we propose a novel method, called scalable stochastic ADMM(SCAS-ADMM), for large-scale optimization and learning problems. Without the need to store the historic gradients, SCAS-ADMM can achieve the same convergence rate $O(1/T)$ as the best stochastic method SA-ADMM and batch ADMM on general convex problems. Experiments on graph-guided fused lasso show that SCAS-ADMM can achieve state-of-the-art performance in real applications

Wu-Jun Li

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Controllable Financial Market Generation with Diffusion Guided Meta Agent

Ordered Local Momentum for Asynchronous Distributed Learning under Arbitrary Delays

Buffered Asynchronous SGD for Byzantine Learning

Collaborative Self-Attention for Recommender Systems

ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval

Stagewise Enlargement of Batch Size for SGD-based Learning

TOMA: Topological Map Abstraction for Reinforcement Learning

A Proximal Stochastic Quasi-Newton Algorithm

Deep Cross-Modal Hashing

Feature Learning based Deep Supervised Hashing with Pairwise Labels

Full-Time Supervision based Bidirectional RNN for Factoid Question Answering

Lock-Free Optimization for Non-Convex Problems

SCOPE: Scalable Composite Optimization for Learning on Spark

A New Relaxation Approach to Normalized Hypergraph Cut

A Parallel algorithm for $\mathcal{X}$-Armed bandits

Fast Asynchronous Parallel Stochastic Gradient Decent

On the Global Convergence of Majorization Minimization Algorithms for Nonconvex Optimization Problems

S-PowerGraph: Streaming Graph Partitioning for Natural Graphs by Vertex-Cut

Scalable Stochastic Alternating Direction Method of Multipliers