Source author record

Ji Liu

Ji Liu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

88works

39topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Class of Optimal Directed Graphs for Network Synchronization

In a paper by Nishikawa and Motter, a quantity called the normalized spread of the Laplacian eigenvalues is used to measure the synchronizability of certain network dynamics. Through simulations, and without theoretical validation, it is conjectured that among all simple directed graphs with a fixed number of vertices and arcs, the optimal value of this quantity is achieved if the Laplacian spectrum satisfies a specific pattern. This paper proves this conjecture and further shows that the conjectured spectral condition is not only sufficient but also necessary. Moreover, the paper proves that the optimal Laplacian spectrum is always achievable by a class of almost regular directed graphs, which can be constructed through an inductive algorithm.

preprint2026arXiv

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

preprint2026arXiv

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

Diffusion models have achieved remarkable success in image and video generation. However, their inherently multiple step inference process imposes substantial computational overhead, hindering real-world deployment. Accelerating diffusion models is therefore essential, yet determining how to combine multiple model acceleration techniques remains a significant challenge. To address this issue, we introduce a framework driven by large language models (LLMs) for automated acceleration code generation and evaluation. First, we present DiffBench, a comprehensive benchmark that implements a three stage automated evaluation pipeline across diverse diffusion architectures, optimization combinations and deployment scenarios. Second, we propose DiffAgent, an agent that generates optimal acceleration strategies and codes for arbitrary diffusion models. DiffAgent employs a closed-loop workflow in which a planning component and a debugging component iteratively refine the output of a code generation component, while a genetic algorithm extracts performance feedback from the execution environment to guide subsequent code refinements. We provide a detailed explanation of the DiffBench construction and the design principles underlying DiffAgent. Extensive experiments show that DiffBench offers a thorough evaluation of generated codes and that DiffAgent significantly outperforms existing LLMs in producing effective diffusion acceleration strategies.

preprint2026arXiv

Oriented Triplet $p$-Wave Pairing from Fermi surface Anisotropy and Nonlocal Attraction

Using constrained-path quantum Monte Carlo, we map the ground-state phase diagram versus the nearest-neighbor (NN) attraction $V$ and spin-dependent hopping anisotropy $α$ for the two-dimensional attractive $t$--$U$--$V$ Hubbard model at filling $n\simeq0.85$. We identify an onsite $s$-wave superfluid, a Cooper pair Bose metal with an uncondensed Bose surface, and an oriented equal-spin triplet $p$-wave pairing phase. The NN attraction activates the odd-parity channel, while hopping anisotropy suppresses the competing $s$-wave coherence and selects a $p_x/p_y$ polar axis, and thus lowers the critical $|V_c|$ for the onset of triplet-dominant $p$-wave pairing. A channel-resolved Landau analysis provides a criterion for the Landau $p$-wave scale $V_c^{\mathrm L}(α)$, consistent with the observed anisotropy dependence of $|V_c|$. Our results establish how NN interaction and Fermi surface anisotropy cooperate to generate the oriented triplet $p$-wave pairing, and suggest that cold-atom and altermagnetic platforms could potentially realize this mechanism.

preprint2024arXiv

Efficient Asynchronous Federated Learning with Sparsification and Quantization

While data is distributed in multiple edge devices, Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training, while several devices are selected in each round. However, straggler devices may slow down the training process or even make the system crash during training. Meanwhile, other idle edge devices remain unused. As the bandwidth between the devices and the server is relatively low, the communication of intermediate data becomes a bottleneck. In this paper, we propose Time-Efficient Asynchronous federated learning with Sparsification and Quantization, i.e., TEASQ-Fed. TEASQ-Fed can fully exploit edge devices to asynchronously participate in the training process by actively applying for tasks. We utilize control parameters to choose an appropriate number of parallel edge devices, which simultaneously execute the training tasks. In addition, we introduce a caching mechanism and weighted averaging with respect to model staleness to further improve the accuracy. Furthermore, we propose a sparsification and quantitation approach to compress the intermediate data to accelerate the training. The experimental results reveal that TEASQ-Fed improves the accuracy (up to 16.67% higher) while accelerating the convergence of model training (up to twice faster).

preprint2022arXiv

A Hybrid Observer for Estimating the State of a Distributed Linear System

A hybrid observer is described for estimating the state of a system of the form dot x=Ax, y_i=C_ix, i=1,...,m. The system's state x is simultaneously estimated by m agents assuming agent i senses y_i and receives appropriately defined data from its neighbors. Neighbor relations are characterized by a time-varying directed graph N(t). Agent i updates its estimate x_i of x at event times t_{i1},t_{i2} ... using a local continuous-time linear observer and a local parameter estimator which iterates q times during each event time interval [t_{i(s-1)},t_{is}), s>=1, to obtain an estimate of x(t_{is}). Subject to the assumptions that N(t) is strongly connected, and the system is jointly observable, it is possible to design parameters so that x_i converges to x with a pre-assigned rate. This result holds when agents communicate asynchronously with the assumption that N(t) changes slowly. Exponential convergence is also assured if the event time sequence of the agents are slightly different, although only if the system being observed is exponentially stable; this limitation however, is a robustness issue shared by all open loop state estimators with small modeling errors. The result also holds facing abrupt changes in the number of vertices and arcs in the inter-agent communication graph upon which the algorithm depends.

preprint2022arXiv

Accelerated Federated Learning with Decoupled Adaptive Optimization

The federated learning (FL) framework enables edge clients to collaboratively learn a shared inference model while keeping privacy of training data on clients. Recently, many heuristics efforts have been made to generalize centralized adaptive optimization methods, such as SGDM, Adam, AdaGrad, etc., to federated settings for improving convergence and accuracy. However, there is still a paucity of theoretical principles on where to and how to design and utilize adaptive optimization methods in federated settings. This work aims to develop novel adaptive optimization methods for FL from the perspective of dynamics of ordinary differential equations (ODEs). First, an analytic framework is established to build a connection between federated optimization methods and decompositions of ODEs of corresponding centralized optimizers. Second, based on this analytic framework, a momentum decoupling adaptive optimization method, FedDA, is developed to fully utilize the global momentum on each local iteration and accelerate the training convergence. Last but not least, full batch gradients are utilized to mimic centralized optimization in the end of the training process to ensure the convergence and overcome the possible inconsistency caused by adaptive optimization methods.

preprint2022arXiv

Adversarial Contrastive Self-Supervised Learning

Recently, learning from vast unlabeled data, especially self-supervised learning, has been emerging and attracted widespread attention. Self-supervised learning followed by the supervised fine-tuning on a few labeled examples can significantly improve label efficiency and outperform standard supervised training using fully annotated data. In this work, we present a novel self-supervised deep learning paradigm based on online hard negative pair mining. Specifically, we design a student-teacher network to generate multi-view of the data for self-supervised learning and integrate hard negative pair mining into the training. Then we derive a new triplet-like loss considering both positive sample pairs and mined hard negative sample pairs. Extensive experiments demonstrate the effectiveness of the proposed method and its components on ILSVRC-2012.

preprint2022arXiv

Combining Intra-Risk and Contagion Risk for Enterprise Bankruptcy Prediction Using Graph Neural Networks

Predicting the bankruptcy risk of small and medium-sized enterprises (SMEs) is an important step for financial institutions when making decisions about loans. Existing studies in both finance and AI research fields, however, tend to only consider either the intra-risk or contagion risk of enterprises, ignoring their interactions and combinatorial effects. This study for the first time considers both types of risk and their joint effects in bankruptcy prediction. Specifically, we first propose an enterprise intra-risk encoder based on statistically significant enterprise risk indicators for its intra-risk learning. Then, we propose an enterprise contagion risk encoder based on enterprise relation information from an enterprise knowledge graph for its contagion risk embedding. In particular, the contagion risk encoder includes both the newly proposed Hyper-Graph Neural Networks and Heterogeneous Graph Neural Networks, which can model contagion risk in two different aspects, i.e. common risk factors based on hyperedges and direct diffusion risk from neighbors, respectively. To evaluate the model, we collect real-world multi-sources data on SMEs and build a novel benchmark dataset called SMEsD. We provide open access to the dataset, which is expected to further promote research on financial risk analysis. Experiments on SMEsD against twelve state-of-the-art baselines demonstrate the effectiveness of the proposed model for bankruptcy prediction.

preprint2022arXiv

Cooperative Actor-Critic via TD Error Aggregation

In decentralized cooperative multi-agent reinforcement learning, agents can aggregate information from one another to learn policies that maximize a team-average objective function. Despite the willingness to cooperate with others, the individual agents may find direct sharing of information about their local state, reward, and value function undesirable due to privacy issues. In this work, we introduce a decentralized actor-critic algorithm with TD error aggregation that does not violate privacy issues and assumes that communication channels are subject to time delays and packet dropouts. The cost we pay for making such weak assumptions is an increased communication burden for every agent as measured by the dimension of the transmitted data. Interestingly, the communication burden is only quadratic in the graph size, which renders the algorithm applicable in large networks. We provide a convergence analysis under diminishing step size to verify that the agents maximize the team-average objective function.

preprint2022arXiv

Digging into Primary Financial Market: Challenges and Opportunities of Adopting Blockchain

Since the emergence of blockchain technology, its application in the financial market has always been an area of focus and exploration by all parties. With the characteristics of anonymity, trust, tamper-proof, etc., blockchain technology can effectively solve some problems faced by the financial market, such as trust issues and information asymmetry issues. To deeply understand the application scenarios of blockchain in the financial market, the issue of securities issuance and trading in the primary market is a problem that must be studied clearly. We conducted an empirical study to investigate the main difficulties faced by primary market participants in their business practices and the potential challenges of the deepening application of blockchain technology in the primary market. We adopted a hybrid method combining interviews (qualitative methods) and surveys (quantitative methods) to conduct this research in two stages. In the first stage, we interview 15 major primary market participants with different backgrounds and expertise. In the second phase, we conducted a verification survey of 54 primary market practitioners to confirm various insights from the interviews, including challenges and desired improvements. Our interviews and survey results revealed several significant challenges facing blockchain applications in the primary market: complex due diligence, mismatch, and difficult monitoring. On this basis, we believe that our future research can focus on some aspects of these challenges.

preprint2022arXiv

Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification

Recently, self-attention mechanisms have shown impressive performance in various NLP and CV tasks, which can help capture sequential characteristics and derive global information. In this work, we explore how to extend self-attention modules to better learn subtle feature embeddings for recognizing fine-grained objects, e.g., different bird species or person identities. To this end, we propose a dual cross-attention learning (DCAL) algorithm to coordinate with self-attention learning. First, we propose global-local cross-attention (GLCA) to enhance the interactions between global images and local high-response regions, which can help reinforce the spatial-wise discriminative clues for recognition. Second, we propose pair-wise cross-attention (PWCA) to establish the interactions between image pairs. PWCA can regularize the attention learning of an image by treating another image as distractor and will be removed during inference. We observe that DCAL can reduce misleading attentions and diffuse the attention response to discover more complementary parts for recognition. We conduct extensive evaluations on fine-grained visual categorization and object re-identification. Experiments demonstrate that DCAL performs on par with state-of-the-art methods and consistently improves multiple self-attention baselines, e.g., surpassing DeiT-Tiny and ViT-Base by 2.8% and 2.4% mAP on MSMT17, respectively.

preprint2022arXiv

Dynamic Sparse R-CNN

Sparse R-CNN is a recent strong object detection baseline by set prediction on sparse, learnable proposal boxes and proposal features. In this work, we propose to improve Sparse R-CNN with two dynamic designs. First, Sparse R-CNN adopts a one-to-one label assignment scheme, where the Hungarian algorithm is applied to match only one positive sample for each ground truth. Such one-to-one assignment may not be optimal for the matching between the learned proposal boxes and ground truths. To address this problem, we propose dynamic label assignment (DLA) based on the optimal transport algorithm to assign increasing positive samples in the iterative training stages of Sparse R-CNN. We constrain the matching to be gradually looser in the sequential stages as the later stage produces the refined proposals with improved precision. Second, the learned proposal boxes and features remain fixed for different images in the inference process of Sparse R-CNN. Motivated by dynamic convolution, we propose dynamic proposal generation (DPG) to assemble multiple proposal experts dynamically for providing better initial proposal boxes and features for the consecutive training stages. DPG thereby can derive sample-dependent proposal boxes and features for inference. Experiments demonstrate that our method, named Dynamic Sparse R-CNN, can boost the strong Sparse R-CNN baseline with different backbones for object detection. Particularly, Dynamic Sparse R-CNN reaches the state-of-the-art 47.2% AP on the COCO 2017 validation set, surpassing Sparse R-CNN by 2.2% AP with the same ResNet-50 backbone.

preprint2022arXiv

E^2VTS: Energy-Efficient Video Text Spotting from Unmanned Aerial Vehicles

Unmanned Aerial Vehicles (UAVs) based video text spotting has been extensively used in civil and military domains. UAV's limited battery capacity motivates us to develop an energy-efficient video text spotting solution. In this paper, we first revisit RCNN's crop & resize training strategy and empirically find that it outperforms aligned RoI sampling on a real-world video text dataset captured by UAV. To reduce energy consumption, we further propose a multi-stage image processor that takes videos' redundancy, continuity, and mixed degradation into account. Lastly, the model is pruned and quantized before deployed on Raspberry Pi. Our proposed energy-efficient video text spotting solution, dubbed as E^2VTS, outperforms all previous methods by achieving a competitive tradeoff between energy efficiency and performance. All our codes and pre-trained models are available at https://github.com/wuzhenyusjtu/LPCVC20-VideoTextSpotting.

preprint2022arXiv

FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning Using Shared Data on the Server

Despite achieving remarkable performance, Federated Learning (FL) suffers from two critical challenges, i.e., limited computational resources and low training efficiency. In this paper, we propose a novel FL framework, i.e., FedDUAP, with two original contributions, to exploit the insensitive data on the server and the decentralized data in edge devices to further improve the training efficiency. First, a dynamic server update algorithm is designed to exploit the insensitive data on the server, in order to dynamically determine the optimal steps of the server update for improving the convergence and accuracy of the global model. Second, a layer-adaptive model pruning method is developed to perform unique pruning operations adapted to the different dimensions and importance of multiple layers, to achieve a good balance between efficiency and effectiveness. By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4.8% higher), efficiency (up to 2.8 times faster), and computational cost (up to 61.9% smaller).

preprint2022arXiv

FedHiSyn: A Hierarchical Synchronous Federated Learning Framework for Resource and Data Heterogeneity

Federated Learning (FL) enables training a global model without sharing the decentralized raw data stored on multiple devices to protect data privacy. Due to the diverse capacity of the devices, FL frameworks struggle to tackle the problems of straggler effects and outdated models. In addition, the data heterogeneity incurs severe accuracy degradation of the global model in the FL training process. To address aforementioned issues, we propose a hierarchical synchronous FL framework, i.e., FedHiSyn. FedHiSyn first clusters all available devices into a small number of categories based on their computing capacity. After a certain interval of local training, the models trained in different categories are simultaneously uploaded to a central server. Within a single category, the devices communicate the local updated model weights to each other based on a ring topology. As the efficiency of training in the ring topology prefers devices with homogeneous resources, the classification based on the computing capacity mitigates the impact of straggler effects. Besides, the combination of the synchronous update of multiple categories and the device communication within a single category help address the data heterogeneity issue while achieving high accuracy. We evaluate the proposed framework based on MNIST, EMNIST, CIFAR10 and CIFAR100 datasets and diverse heterogeneous settings of devices. Experimental results show that FedHiSyn outperforms six baseline methods, e.g., FedAvg, SCAFFOLD, and FedAT, in terms of training accuracy and efficiency.

preprint2022arXiv

From Distributed Machine Learning to Federated Learning: A Survey

In recent years, data and computing resources are typically distributed in the devices of end users, various regions or organizations. Because of laws or regulations, the distributed data and computing resources cannot be directly shared among different regions or organizations for machine learning tasks. Federated learning emerges as an efficient approach to exploit distributed data and computing resources, so as to collaboratively train machine learning models, while obeying the laws and regulations and ensuring data security and data privacy. In this paper, we provide a comprehensive survey of existing works for federated learning. We propose a functional architecture of federated learning systems and a taxonomy of related techniques. Furthermore, we present the distributed training, data communication, and security of FL systems. Finally, we analyze their limitations and propose future research directions.

preprint2022arXiv

Hyper-Tune: Towards Efficient Hyper-parameter Tuning at Scale

The ever-growing demand and complexity of machine learning are putting pressure on hyper-parameter tuning systems: while the evaluation cost of models continues to increase, the scalability of state-of-the-arts starts to become a crucial bottleneck. In this paper, inspired by our experience when deploying hyper-parameter tuning in a real-world application in production and the limitations of existing systems, we propose Hyper-Tune, an efficient and robust distributed hyper-parameter tuning framework. Compared with existing systems, Hyper-Tune highlights multiple system optimizations, including (1) automatic resource allocation, (2) asynchronous scheduling, and (3) multi-fidelity optimizer. We conduct extensive evaluations on benchmark datasets and a large-scale real-world dataset in production. Empirically, with the aid of these optimizations, Hyper-Tune outperforms competitive hyper-parameter tuning systems on a wide range of scenarios, including XGBoost, CNN, RNN, and some architectural hyper-parameters for neural networks. Compared with the state-of-the-art BOHB and A-BOHB, Hyper-Tune achieves up to 11.2x and 5.1x speedups, respectively.

preprint2022arXiv

Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond

Deep neural networks have been well-known for their superb handling of various machine learning and artificial intelligence tasks. However, due to their over-parameterized black-box nature, it is often difficult to understand the prediction results of deep models. In recent years, many interpretation tools have been proposed to explain or reveal how deep models make decisions. In this paper, we review this line of research and try to make a comprehensive survey. Specifically, we first introduce and clarify two basic concepts -- interpretations and interpretability -- that people usually get confused about. To address the research efforts in interpretations, we elaborate the designs of a number of interpretation algorithms, from different perspectives, by proposing a new taxonomy. Then, to understand the interpretation results, we also survey the performance metrics for evaluating interpretation algorithms. Further, we summarize the current works in evaluating models' interpretability using "trustworthy" interpretation algorithms. Finally, we review and discuss the connections between deep models' interpretations and other factors, such as adversarial robustness and learning from interpretations, and we introduce several open-source libraries for interpretation algorithms and evaluation approaches.

preprint2022arXiv

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

Although more layers and more parameters generally improve the accuracy of the models, such big models generally have high computational complexity and require big memory, which exceed the capacity of small devices for inference and incurs long training time. In addition, it is difficult to afford long training time and inference time of big models even in high performance servers, as well. As an efficient approach to compress a large deep model (a teacher model) to a compact model (a student model), knowledge distillation emerges as a promising approach to deal with the big models. Existing knowledge distillation methods cannot exploit the elastic available computing resources and correspond to low efficiency. In this paper, we propose an Elastic Deep Learning framework for knowledge Distillation, i.e., EDL-Dist. The advantages of EDL-Dist are three-fold. First, the inference and the training process is separated. Second, elastic available computing resources can be utilized to improve the efficiency. Third, fault-tolerance of the training and inference processes is supported. We take extensive experimentation to show that the throughput of EDL-Dist is up to 3.125 times faster than the baseline method (online knowledge distillation) while the accuracy is similar or higher.

preprint2022arXiv

Learning Bi-typed Multi-relational Heterogeneous Graph via Dual Hierarchical Attention Networks

Bi-type multi-relational heterogeneous graph (BMHG) is one of the most common graphs in practice, for example, academic networks, e-commerce user behavior graph and enterprise knowledge graph. It is a critical and challenge problem on how to learn the numerical representation for each node to characterize subtle structures. However, most previous studies treat all node relations in BMHG as the same class of relation without distinguishing the different characteristics between the intra-class relations and inter-class relations of the bi-typed nodes, causing the loss of significant structure information. To address this issue, we propose a novel Dual Hierarchical Attention Networks (DHAN) based on the bi-typed multi-relational heterogeneous graphs to learn comprehensive node representations with the intra-class and inter-class attention-based encoder under a hierarchical mechanism. Specifically, the former encoder aggregates information from the same type of nodes, while the latter aggregates node representations from its different types of neighbors. Moreover, to sufficiently model node multi-relational information in BMHG, we adopt a newly proposed hierarchical mechanism. By doing so, the proposed dual hierarchical attention operations enable our model to fully capture the complex structures of the bi-typed multi-relational heterogeneous graphs. Experimental results on various tasks against the state-of-the-arts sufficiently confirm the capability of DHAN in learning node representations on the BMHGs.

preprint2022arXiv

Multi-Entanglement Routing Design over Quantum Networks

Quantum networks are considered as a promising future platform for quantum information exchange and quantum applications, which have capabilities far beyond the traditional communication networks. Remote quantum entanglement is an essential component of a quantum network. How to efficiently design a multi-routing entanglement protocol is a fundamental yet challenging problem. In this paper, we study a quantum entanglement routing problem to simultaneously maximize the number of quantum-user pairs and their expected throughput. Our approach is to formulate the problem as two sequential integer programming steps. We propose efficient entanglement routing algorithms for the two integer programming steps and analyze their time complexity and performance bounds. Results of evaluation highlight that our approach outperforms existing solutions in both served quantum-user pairs numbers and the network expected throughput.

preprint2022arXiv

Not All SWAPs Have the Same Cost: A Case for Optimization-Aware Qubit Routing

Despite rapid advances in quantum computing technologies, the qubit connectivity limitation remains to be a critical challenge. Both near-term NISQ quantum computers and relatively long-term scalable quantum architectures do not offer full connectivity. As a result, quantum circuits may not be directly executed on quantum hardware, and a quantum compiler needs to perform qubit routing to make the circuit compatible with the device layout. During the qubit routing step, the compiler inserts SWAP gates and performs circuit transformations. Given the connectivity topology of the target hardware, there are typically multiple qubit routing candidates. The state-of-the-art compilers use a cost function to evaluate the number of SWAP gates for different routes and then select the one with the minimum number of SWAP gates. After qubit routing, the quantum compiler performs gate optimizations upon the circuit with the newly inserted SWAP gates. In this paper, we observe that the aforementioned qubit routing is not optimal, and qubit routing should \textit{not} be independent on subsequent gate optimizations. We find that with the consideration of gate optimizations, not all of the SWAP gates have the same basis-gate cost. These insights lead to the development of our qubit routing algorithm, NASSC (Not All Swaps have the Same Cost). NASSC is the first algorithm that considers the subsequent optimizations during the routing step. Our optimization-aware qubit routing leads to better routing decisions and benefits subsequent optimizations. We also propose a new optimization-aware decomposition for the inserted SWAP gates. Our experiments show that the routing overhead compiled with our routing algorithm is reduced by up to $69.30\%$ ($21.30\%$ on average) in the number of CNOT gates and up to $43.50\%$ ($7.61\%$ on average) in the circuit depth compared with the state-of-the-art scheme, SABRE.

preprint2022arXiv

Reaching a Consensus with Limited Information

In its simplest form the well known consensus problem for a networked family of autonomous agents is to devise a set of protocols or update rules, one for each agent, which can enable all of the agents to adjust or tune their "agreement variable" to the same value by utilizing real-time information obtained from their "neighbors" within the network. The aim of this paper is to study the problem of achieving a consensus in the face of limited information transfer between agents. By this it is meant that instead of each agent receiving an agreement variable or real-valued state vector from each of its neighbors, it receives a linear function of each state instead. The specific problem of interest is formulated and provably correct algorithms are developed for a number of special cases of the problem.

preprint2022arXiv

Resilient Constrained Consensus over Complete Graphs via Feasibility Redundancy

This paper considers a resilient high-dimensional constrained consensus problem and studies a resilient distributed algorithm for complete graphs. For convex constrained sets with a singleton intersection, a sufficient condition on feasibility redundancy and set regularity for reaching a desired consensus exponentially fast in the presence of Byzantine agents is derived, which can be directly applied to polyhedral sets. A necessary condition on feasibility redundancy for the resilient constrained consensus problem to be solvable is also provided.

preprint2022arXiv

Stock Movement Prediction Based on Bi-typed Hybrid-relational Market Knowledge Graph via Dual Attention Networks

Stock Movement Prediction (SMP) aims at predicting listed companies' stock future price trend, which is a challenging task due to the volatile nature of financial markets. Recent financial studies show that the momentum spillover effect plays a significant role in stock fluctuation. However, previous studies typically only learn the simple connection information among related companies, which inevitably fail to model complex relations of listed companies in the real financial market. To address this issue, we first construct a more comprehensive Market Knowledge Graph (MKG) which contains bi-typed entities including listed companies and their associated executives, and hybrid-relations including the explicit relations and implicit relations. Afterward, we propose DanSmp, a novel Dual Attention Networks to learn the momentum spillover signals based upon the constructed MKG for stock prediction. The empirical experiments on our constructed datasets against nine SOTA baselines demonstrate that the proposed DanSmp is capable of improving stock prediction with the constructed MKG.

preprint2022arXiv

Unified Visual Transformer Compression

Vision transformers (ViTs) have gained popularity recently. Even without customized image operators such as convolutions, ViTs can yield competitive performance when properly trained on massive data. However, the computational overhead of ViTs remains prohibitive, due to stacking multi-head self-attention modules and else. Compared to the vast literature and prevailing success in compressing convolutional neural networks, the study of Vision Transformer compression has also just emerged, and existing works focused on one or two aspects of compression. This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation. We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations, under a distillation loss. The optimization problem is then solved using the primal-dual algorithm. Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors. For example, DeiT-Tiny can be trimmed down to 50\% of the original FLOPs almost without losing accuracy. Codes are available online:~\url{https://github.com/VITA-Group/UVC}.

preprint2021arXiv

C-Watcher: A Framework for Early Detection of High-Risk Neighborhoods Ahead of COVID-19 Outbreak

The novel coronavirus disease (COVID-19) has crushed daily routines and is still rampaging through the world. Existing solution for nonpharmaceutical interventions usually needs to timely and precisely select a subset of residential urban areas for containment or even quarantine, where the spatial distribution of confirmed cases has been considered as a key criterion for the subset selection. While such containment measure has successfully stopped or slowed down the spread of COVID-19 in some countries, it is criticized for being inefficient or ineffective, as the statistics of confirmed cases are usually time-delayed and coarse-grained. To tackle the issues, we propose C-Watcher, a novel data-driven framework that aims at screening every neighborhood in a target city and predicting infection risks, prior to the spread of COVID-19 from epicenters to the city. In terms of design, C-Watcher collects large-scale long-term human mobility data from Baidu Maps, then characterizes every residential neighborhood in the city using a set of features based on urban mobility patterns. Furthermore, to transfer the firsthand knowledge (witted in epicenters) to the target city before local outbreaks, we adopt a novel adversarial encoder framework to learn "city-invariant" representations from the mobility-related features for precise early detection of high-risk neighborhoods, even before any confirmed cases known, in the target city. We carried out extensive experiments on C-Watcher using the real-data records in the early stage of COVID-19 outbreaks, where the results demonstrate the efficiency and effectiveness of C-Watcher for early detection of high-risk neighborhoods from a large number of cities.

preprint2021arXiv

Influence of flux limitation on large time behavior in a three-dimensional chemotaxis-Stokes system modeling coral fertilization

In this paper, we consider the following system $$\left\{\begin{array}{ll} n_t+u\cdot\nabla n&=Δn-\nabla\cdot(n\mathcal{S}(|\nabla c|^2)\nabla c)-nm,\\ c_t+u\cdot\nabla c&=Δc-c+m,\\ m_t+u\cdot\nabla m&=Δm-mn,\\ u_t&=Δu+\nabla P+(n+m)\nablaΦ,\qquad \nabla\cdot u=0 \end{array}\right.$$ which models the process of coral fertilization, in a smoothly three-dimensional bounded domain, where $\mathcal{S}$ is a given function fulfilling $$|\mathcal{S}(σ)|\leq K_{\mathcal{S}}(1+σ)^{-\fracθ{2}},\qquad σ\geq 0$$ with some $K_{\mathcal{S}}>0.$ Based on conditional estimates of the quantity $c$ and the gradients thereof, a relatively compressed argument as compared to that proceeding in related precedents shows that if $$θ>0,$$ then for any initial data with proper regularity an associated initial-boundary problem under no-flux/no-flux/no-flux/Dirichlet boundary conditions admits a unique classical solution which is globally bounded, and which also enjoys the stabilization features in the sense that $$\|n(\cdot,t)-n_{\infty}\|_{L^{\infty}(Ω)}+\|c(\cdot,t)-m_{\infty}\|_{W^{1,\infty}(Ω)} +\|m(\cdot,t)-m_{\infty}\|_{W^{1,\infty}(Ω)}+\|u(\cdot,t)\|_{L^{\infty}(Ω)}\rightarrow0 \quad\textrm{as}~t\rightarrow \infty$$ with $n_{\infty}:=\frac{1}{|Ω|}\left\{\int_Ωn_0-\int_Ωm_0\right\}_{+}$ and $m_{\infty}:=\frac{1}{|Ω|}\left\{\int_Ωm_0-\int_Ωn_0\right\}_{+}.$

preprint2021arXiv

Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments

Exploration under sparse reward is a long-standing challenge of model-free reinforcement learning. The state-of-the-art methods address this challenge by introducing intrinsic rewards to encourage exploration in novel states or uncertain environment dynamics. Unfortunately, methods based on intrinsic rewards often fall short in procedurally-generated environments, where a different environment is generated in each episode so that the agent is not likely to visit the same state more than once. Motivated by how humans distinguish good exploration behaviors by looking into the entire episode, we introduce RAPID, a simple yet effective episode-level exploration method for procedurally-generated environments. RAPID regards each episode as a whole and gives an episodic exploration score from both per-episode and long-term views. Those highly scored episodes are treated as good exploration behaviors and are stored in a small ranking buffer. The agent then imitates the episodes in the buffer to reproduce the past good exploration behaviors. We demonstrate our method on several procedurally-generated MiniGrid environments, a first-person-view 3D Maze navigation task from MiniWorld, and several sparse MuJoCo tasks. The results show that RAPID significantly outperforms the state-of-the-art intrinsic reward strategies in terms of sample efficiency and final performance. The code is available at https://github.com/daochenzha/rapid

preprint2021arXiv

SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification

Recently, x-vector has been a successful and popular approach for speaker verification, which employs a time delay neural network (TDNN) and statistics pooling to extract speaker characterizing embedding from variable-length utterances. Improvement upon the x-vector has been an active research area, and enormous neural networks have been elaborately designed based on the x-vector, eg, extended TDNN (E-TDNN), factorized TDNN (F-TDNN), and densely connected TDNN (D-TDNN). In this work, we try to identify the optimal architectures from a TDNN based search space employing neural architecture search (NAS), named SpeechNAS. Leveraging the recent advances in the speaker recognition, such as high-order statistics pooling, multi-branch mechanism, D-TDNN and angular additive margin softmax (AAM) loss with a minimum hyper-spherical energy (MHE), SpeechNAS automatically discovers five network architectures, from SpeechNAS-1 to SpeechNAS-5, of various numbers of parameters and GFLOPs on the large-scale text-independent speaker recognition dataset VoxCeleb1. Our derived best neural network achieves an equal error rate (EER) of 1.02% on the standard test set of VoxCeleb1, which surpasses previous TDNN based state-of-the-art approaches by a large margin. Code and trained weights are in https://github.com/wentaozhu/speechnas.git

Ji Liu

What is connected

Connect this record

See the researcher in context

Building this map preview

88 published item(s)

A Class of Optimal Directed Graphs for Network Synchronization

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

Oriented Triplet $p$-Wave Pairing from Fermi surface Anisotropy and Nonlocal Attraction

Efficient Asynchronous Federated Learning with Sparsification and Quantization

A Hybrid Observer for Estimating the State of a Distributed Linear System

Accelerated Federated Learning with Decoupled Adaptive Optimization

Adversarial Contrastive Self-Supervised Learning

Combining Intra-Risk and Contagion Risk for Enterprise Bankruptcy Prediction Using Graph Neural Networks

Cooperative Actor-Critic via TD Error Aggregation

Digging into Primary Financial Market: Challenges and Opportunities of Adopting Blockchain

Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification

Dynamic Sparse R-CNN

E^2VTS: Energy-Efficient Video Text Spotting from Unmanned Aerial Vehicles

FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning Using Shared Data on the Server

FedHiSyn: A Hierarchical Synchronous Federated Learning Framework for Resource and Data Heterogeneity

From Distributed Machine Learning to Federated Learning: A Survey

Hyper-Tune: Towards Efficient Hyper-parameter Tuning at Scale

Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

Learning Bi-typed Multi-relational Heterogeneous Graph via Dual Hierarchical Attention Networks

Multi-Entanglement Routing Design over Quantum Networks

Not All SWAPs Have the Same Cost: A Case for Optimization-Aware Qubit Routing

Reaching a Consensus with Limited Information

Resilient Constrained Consensus over Complete Graphs via Feasibility Redundancy

Stock Movement Prediction Based on Bi-typed Hybrid-relational Market Knowledge Graph via Dual Attention Networks

Unified Visual Transformer Compression

C-Watcher: A Framework for Early Detection of High-Risk Neighborhoods Ahead of COVID-19 Outbreak

Influence of flux limitation on large time behavior in a three-dimensional chemotaxis-Stokes system modeling coral fertilization

Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments

SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification

A Distributed Observer for a Continuous-Time Linear System with time-varying network

A Game-Theoretic Framework for Multi-Period-Multi-Company Demand Response Management in the Smart Grid

A novel tree-structured point cloud dataset for skeletonization algorithm evaluation

A Private and Finite-Time Algorithm for Solving a Distributed System of Linear Equations

An Investigation of Containment Measures Against the COVID-19 Pandemic in Mainland China

Analysis of a Nonlinear Opinion Dynamics Model with Biased Assimilation

Analysis, Identification, and Validation of Discrete-Time Epidemic Processes

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

Automatic Neural Network Compression by Sparsity-Quantization Joint Learning: A Constrained Optimization-based Approach

Central Server Free Federated Learning over Single-sided Trust Social Networks

Controlling a Networked SIS Model via a Single Input over Undirected Graphs

Data Poisoning Attacks on Federated Machine Learning

Depth Edge Guided CNNs for Sparse Depth Upsampling

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

Finite-Sample Analysis of Proximal Gradient TD Algorithms

GAN Slimming: All-in-One GAN Compression by A Unified Optimization Framework

IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA

Neural Network Activation Quantization with Bitwise Information Bottlenecks

On Spectral Properties of Signed Laplacians with Connections to Eventual Positivity

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

Regularized Off-Policy TD-Learning

ServiceNet: A P2P Service Network

Stochastic Recursive Momentum for Policy Gradient Methods

Stochastic Recursive Variance Reduction for Efficient Smooth Non-Convex Compositional Optimization

Streaming Probabilistic Deep Tensor Factorization

The Weather Impacts the Outbreak of COVID-19 in Mainland China

Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search

Model Compression with Adversarial Robustness: A Unified Optimization Framework

Accelerating Stochastic Composition Optimization

Efficient Estimation of Compressible State-Space Models with Application to Calcium Signal Deconvolution

On a Modified DeGroot-Friedkin Model of Opinion Dynamics

On the Analysis of a Continuous-Time Bi-Virus Model

Prognostics of Surgical Site Infections using Dynamic Health Data

Request-Based Gossiping without Deadlocks

Staleness-aware Async-SGD for Distributed Deep Learning

A Distributed Algorithm for Solving a Linear Algebraic Equation

A Stackelberg Game for Multi-Period Demand Response Management in the Smart Grid

Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties

Complete scaling makes differences along the critical isobar: Molecular dynamics simulations of Lennard-Jones fluid with finite-time scaling

Distributed Evaluation and Convergence of Self-Appraisals in Social Networks

Exclusive Sparsity Norm Minimization with Random Groups via Cone Projection

Parametric cooling of a degenerate Fermi gas in an optical trap