Source author record

Bingsheng He

Bingsheng He appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Hardware Architecture Machine Learning Artificial Intelligence Databases math.OC Computer Vision Cryptography and Security Data Structures and Algorithms eess.AS Information Retrieval Neural and Evolutionary Computing Operating Systems Performance Programming Languages Social and Information Networks Software Engineering Sound

Catalog footprint

What is connected

23works

18topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

High-Level Synthesis (HLS) compiles algorithmic C/C++ descriptions into hardware, with Quality of Results (QoR) -- latency and resource utilization -- critically governed by pragma configurations and code structure. Existing LLM-based HLS approaches train for functional correctness but ignore QoR entirely. We observe that reinforcement learning (RL) for HLS does not require absolute synthesis results -- only relative comparisons between candidates. Based on this insight, we propose \textbf{HLS-Seek}, a QoR-aware NL-to-HLS framework that replaces expensive synthesis-in-the-loop RL with a comparative proxy reward model achieving 99.53\% Pareto-dominance accuracy. To prevent reward hacking, we introduce \textit{uncertainty-aware Monte Carlo (MC) dropout switching} that selectively invokes real Vitis HLS synthesis for low-confidence candidates and online updates the proxy, creating a self-improving reward system. HLS-Seek achieves 81.5\% syntax correctness pass@1 and 81.4\% Func@5 on HLS-eval with only 7B parameters, surpassing GPT-5.1 and other frontier models while achieving 8.5$\times$ faster training than real-reward RL. On QoR evaluation, HLS-Seek achieves the lowest latency on 16/30 kernels and Pareto-dominates HLS-specific baselines on 9 kernels.

preprint2026arXiv

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one round of testbench-log feedback). Three findings emerge. (i)~Software and hardware are fundamentally different engineering tasks: the same agent loses 37\% to 58\% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain. (ii)~Failures concentrate on design control-flow / finite state machine (FSM) bugs, verification testbench bugs, and hard cases that demand cross-hierarchy signal-flow tracking and coordinated multi-file edits. (iii)~Localization granularity matters far more than localization itself: a perfect file-level oracle yields only $+1.4$\% because the agent then breaks files that did not need editing, while a single round of test case feedback lifts resolved rate by $42$\% to $45$\% because the test case tells \emph{where} the bug is and \emph{what} the fix has to look like.

preprint2026arXiv

Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation

LLM-based generation of SystemVerilog Assertions (SVA) is often reported as nearing saturation, with the strongest specialized model reaching ${\sim}76\%$ accuracy on NL2SVA-Human. We show that this aggregate hides a temporal gap: models that appear strong overall still collapse to a few implication templates on bounded-delay and liveness specifications. The core issue is that the dominant recipe, supervised fine-tuning on NL/SVA pairs, optimizes token-level mimicry rather than the \emph{property equivalence} that defines SVA correctness. We introduce \emph{Reward-Weighted On-Policy Distillation} (RWOPD), an on-policy distillation method that samples student rollouts, scores them with an open SymbiYosys+Z3 Property-Equivalence Checker (PEC), and applies a verifier-reward-weighted forward-KL gradient from a frozen 14B teacher on verifier-passable rollouts. This keeps the supervision dense at every response token while grounding both selection and loss weight in property-equivalent behavior. RWOPD distills CodeV-SVA-14B into a Qwen2.5-Coder-7B-Instruct student that sets a new state of the art on NL2SVA-Human and NL2SVA-Machine across pass@1, pass@5, and pass@10, surpassing both specialized prior SOTA models and 671B general-purpose baselines.

preprint2026arXiv

RidgeWalker: Perfectly Pipelined Graph Random Walks on FPGAs

Graph Random Walks (GRWs) offer efficient approximations of key graph properties and have been widely adopted in many applications. However, GRW workloads are notoriously difficult to accelerate due to their strong data dependencies, irregular memory access patterns, and imbalanced execution behavior. While recent work explores FPGA-based accelerators for GRWs, existing solutions fall far short of hardware potential due to inefficient pipelining and static scheduling. This paper presents RidgeWalker, a high-performance GRW accelerator designed for datacenter FPGAs. The key insight behind RidgeWalker is that the Markov property of GRWs allows decomposition into stateless, fine-grained tasks that can be executed out-of-order without compromising correctness. Building on this, RidgeWalker introduces an asynchronous pipeline architecture with a feedback-driven scheduler grounded in queuing theory, enabling perfect pipelining and adaptive load balancing. We prototype RidgeWalker on datacenter FPGAs and evaluated it across a range of GRW algorithms and real-world graph datasets. Experimental results demonstrate that RidgeWalker achieves an average speedup of 7.0x over state-of-the-art FPGA solutions and 8.1x over GPU solutions, with peak speedups of up to 71.0x and 22.9x, respectively. The source code is publicly available at https://github.com/Xtra-Computing/RidgeWalker.

preprint2026arXiv

Robust Multimodal Recommendation via Graph Retrieval-Enhanced Modality Completion

Multimodal data plays a critical role in web-based recommendation systems, where information from diverse modalities such as vision and text enhances representation learning. However, real-world multimodal datasets often suffer from modality incompleteness due to sensor failures, annotation scarcity, or privacy constraints, which substantially degrade model performance and reliability. One effective solution to address this issue is modality completion, which reconstructs missing features to provide modality-complete graphs for downstream tasks. Given a query node with missing multimodal features, existing modality completion methods typically infer information from the node itself or its neighbors to reconstruct the missing modality. However, these methods may overlook semantically relevant context in the graph, which contains valuable cues that are non-trivial to capture through simple methods like neighborhood aggregation. In this work, we propose GRE-MC, a Graph Retrieval-Enhanced Modality Completion framework, to overcome these limitations. By introducing a modality-aware subgraph retrieval mechanism, GRE-MC selects semantically relevant subgraphs from the entire graph, providing richer contextual information for completing missing modalities. Subsequently, a graph transformer jointly encodes the query node and the retrieved subgraph via global attention to complete the missing features, while a learnable sparse-routing codebook regularizes latent embeddings into compact bases for improved robustness. Extensive experiments on multimodal recommendation benchmarks demonstrate that GRE-MC consistently outperforms state-of-the-art methods, validating the effectiveness of subgraph retrieval and joint-encoding graph transformer for robust modality completion.

preprint2022arXiv

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

Multi-tenant machine learning services have become emerging data-intensive workloads in data centers with heavy usage of GPU resources. Due to the large scale, many tuning parameters and heavy resource usage, it is usually impractical to evaluate and benchmark those machine learning services on real clusters. In this demonstration, we present AnalySIM, a cluster simulator that allows efficient design explorations for multi-tenant machine learning services. Specifically, by trace-driven cluster workload simulation, AnalySIM can easily test and analyze various scheduling policies in a number of performance metrics such as GPU resource utilization. AnalySIM simulates the cluster computational resource based on both physical topology and logical partition. The tool has been used in SenseTime to understand the impact of different scheduling policies with the trace from a real production cluster of over 1000 GPUs. We find that preemption and migration are able to significantly reduce average job completion time and mitigate the resource fragmentation problem.

preprint2022arXiv

An In-Depth Study of Continuous Subgraph Matching (Complete Version)

Continuous subgraph matching (CSM) algorithms find the occurrences of a given pattern on a stream of data graphs online. A number of incremental CSM algorithms have been proposed. However, a systematical study on these algorithms is missing to identify their advantages and disadvantages on a wide range of workloads. Therefore, we first propose to model CSM as incremental view maintenance (IVM) to capture the design space of existing algorithms. Then, we implement six representative CSM algorithms, including IncIsoMatch, SJ-Tree, Graphflow, IEDyn, TurboFlux, and SymBi, in a common framework based on IVM. We further conduct extensive experiments to evaluate the overall performance of competing algorithms as well as study the effectiveness of individual techniques to pinpoint the key factors leading to the performance differences. We obtain the following new insights into the performance: (1) existing algorithms start the search from an edge in the query graph that maps to an updated data edge, potentially leading to many invalid partial results; (2) all matching orders are based on simple heuristics, which appear ineffective at times; (3) index updates dominate the query time on some queries; and (4) the algorithm with constant delay enumeration bears significant index update cost. Consequently, no algorithm dominates the others in all cases. Therefore, we give a few recommendations based on our experiment results. In particular, the SymBi index is useful for sparse queries or long running queries. The matching orders of IEDyn and TurboFlux work well on tree queries, those of Graphflow on dense queries or when both query and data graphs are sparse, and otherwise, we recommend SymBi's matching orders.

preprint2022arXiv

Indefinite linearized augmented Lagrangian method for convex programming with linear inequality constraints

The augmented Lagrangian method (ALM) is a benchmark for convex programming problems with linear constraints; ALM and its variants for linearly equality-constrained convex minimization models have been well studied in the literature. However, much less attention has been paid to ALM for efficiently solving linearly inequality-constrained convex minimization models. In this paper, we exploit an enlightening reformulation of the newly developed indefinite linearized ALM for the equality-constrained convex optimization problem, and present a new indefinite linearized ALM scheme for efficiently solving the convex optimization problem with linear inequality constraints. The proposed method enjoys great advantages, especially for large-scale optimization cases, in two folds mainly: first, it largely simplifies the challenging key subproblem of the classic ALM by employing its linearized reformulation, while keeping low complexity in computation; second, we show that only a smaller proximity regularization term is needed for provable convergence, which allows a bigger step-size and hence significantly better performance. Moreover, we show the global convergence of the proposed scheme upon its equivalent compact expression of prediction-correction, along with a worst-case $\mathcal{O}(1/N)$ convergence rate. Numerical results on some application problems demonstrate that a smaller regularization term can lead to a better experimental performance, which further confirms the theoretical results presented in this study.

preprint2022arXiv

On construction of splitting contraction algorithms in a prediction-correction framework for separable convex optimization

In the past decade, we had developed a series of splitting contraction algorithms for separable convex optimization problems, at the root of the alternating direction method of multipliers. Convergence of these algorithms was studied under specific model-tailored conditions, while these conditions can be conceptually abstracted as two generic conditions when these algorithms are all unified as a prediction-correction framework. In this paper, in turn, we showcase a constructive way for specifying the generic convergence-guaranteeing conditions, via which new splitting contraction algorithms can be generated automatically. It becomes possible to design more application-tailored splitting contraction algorithms by specifying the prediction-correction framework, while proving their convergence is a routine.

preprint2022arXiv

Practical Vertical Federated Learning with Unsupervised Representation Learning

As societal concerns on data privacy recently increase, we have witnessed data silos among multiple parties in various applications. Federated learning emerges as a new learning paradigm that enables multiple parties to collaboratively train a machine learning model without sharing their raw data. Vertical federated learning, where each party owns different features of the same set of samples and only a single party has the label, is an important and challenging topic in federated learning. Communication costs among different parties have been a major hurdle for practical vertical learning systems. In this paper, we propose a novel communication-efficient vertical federated learning algorithm named FedOnce, which requires only one-shot communication among parties. To improve model accuracy and provide privacy guarantee, FedOnce features unsupervised learning representations in the federated setting and privacy-preserving techniques based on moments accountant. The comprehensive experiments on 10 datasets demonstrate that FedOnce achieves close performance compared to state-of-the-art vertical federated learning algorithms with much lower communication costs. Meanwhile, our privacy-preserving technique significantly outperforms the state-of-the-art approaches under the same privacy budget.

preprint2022arXiv

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines

The use of FPGAs for efficient graph processing has attracted significant interest. Recent memory subsystem upgrades including the introduction of HBM in FPGAs promise to further alleviate memory bottlenecks. However, modern multi-channel HBM requires much more processing pipelines to fully utilize its bandwidth potential. Existing designs do not scale well, resulting in underutilization of the HBM facilities even when all other resources are fully consumed. In this paper, we re-examined the graph processing workloads and found much diversity in processing. We also found that the diverse workloads can be easily classified into two types, namely dense and sparse partitions. This motivates us to propose a resource-efficient heterogeneous pipeline architecture. Our heterogeneous architecture comprises of two types of pipelines: Little pipelines to process dense partitions with good locality and Big pipelines to process sparse partitions with the extremely poor locality. Unlike traditional monolithic pipeline designs, the heterogeneous pipelines are tailored for more specific memory access patterns, and hence are more lightweight, allowing the architecture to scale up to more effectively with limited resources. In addition, we propose a model-guided task scheduling method that schedules partitions to the right pipeline types, generates the most efficient pipeline combination and balances workloads. Furthermore, we develop an automated open-source framework, called ReGraph, which automates the entire development process. ReGraph outperforms state-of-the-art FPGA accelerators by up to 5.9 times in terms of performance and 12times in terms of resource efficiency.

preprint2022arXiv

The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems

This paper presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of federated learning systems. We have developed reference implementations, and evaluated the important aspects of federated learning, including model accuracy, communication cost, throughput and convergence time. Through these evaluations, we discovered some interesting findings such as federated learning can effectively increase end-to-end throughput.

preprint2022arXiv

The Serverless Computing Survey: A Technical Primer for Design Architecture

The development of cloud infrastructures inspires the emergence of cloud-native computing. As the most promising architecture for deploying microservices, serverless computing has recently attracted more and more attention in both industry and academia. Due to its inherent scalability and flexibility, serverless computing becomes attractive and more pervasive for ever-growing Internet services. Despite the momentum in the cloud-native community, the existing challenges and compromises still wait for more advanced research and solutions to further explore the potentials of the serverless computing model. As a contribution to this knowledge, this article surveys and elaborates the research domains in the serverless context by decoupling the architecture into four stack layers: Virtualization, Encapsule, System Orchestration, and System Coordination. Inspired by the security model, we highlight the key implications and limitations of these works in each layer, and make suggestions for potential challenges to the field of future serverless computing.

preprint2021arXiv

TransMask: A Compact and Fast Speech Separation Model Based on Transformer

Speech separation is an important problem in speech processing, which targets to separate and generate clean speech from a mixed audio containing speech from different speakers. Empowered by the deep learning technologies over sequence-to-sequence domain, recent neural speech separation models are now capable of generating highly clean speech audios. To make these models more practical by reducing the model size and inference time while maintaining high separation quality, we propose a new transformer-based speech separation approach, called TransMask. By fully un-leashing the power of self-attention on long-term dependency exception, we demonstrate the size of TransMask is more than 60% smaller and the inference is more than 2 times faster than state-of-the-art solutions. TransMask fully utilizes the parallelism during inference, and achieves nearly linear inference time within reasonable input audio lengths. It also outperforms existing solutions on output speech audio quality, achieving SDR above 16 over Librimix benchmark.

preprint2020arXiv

Accelerating Generative Neural Networks on Unmodified Deep Learning Processors -- A Software Approach

Generative neural network is a new category of neural networks and it has been widely utilized in applications such as content generation, unsupervised learning, segmentation and pose estimation. It typically involves massive computing-intensive deconvolution operations that cannot be fitted to conventional neural network processors directly. However, prior works mainly investigated specialized hardware architectures through intensive hardware modifications to the existing deep learning processors to accelerate deconvolution together with the convolution. In contrast, this work proposes a novel deconvolution implementation with a software approach and enables fast and efficient deconvolution execution on the legacy deep learning processors. Our proposed method reorganizes the computation of deconvolution and allows the deep learning processors to treat it as the standard convolution by splitting the original deconvolution filters into multiple small filters. Compared to prior acceleration schemes, the implemented acceleration scheme achieves 2.41x - 4.34x performance speedup and reduces the energy consumption by 27.7% - 54.5% on a set of realistic benchmarks. In addition, we also applied the deconvolution computing approach to the off-the-shelf commodity deep learning processors. The performance of deconvolution also exhibits significant performance speedup over prior deconvolution implementations.

preprint2015arXiv

On Performance Debugging of Unnecessary Lock Contentions on Multicore Processors: A Replay-based Approach

Locks have been widely used as an effective synchronization mechanism among processes and threads. However, we observe that a large number of false inter-thread dependencies (i.e., unnecessary lock contentions) exist during the program execution on multicore processors, thereby incurring significant performance overhead. This paper presents a performance debugging framework, PERFPLAY, to facilitate a comprehensive and in-depth understanding of the performance impact of unnecessary lock contentions. The core technique of our debugging framework is trace replay. Specifically, PERFPLAY records the program execution trace, on the basis of which the unnecessary lock contentions can be identified through trace analysis. We then propose a novel technique of trace transformation to transform these identified unnecessary lock contentions in the original trace into the correct pattern as a new trace free of unnecessary lock contentions. Through replaying both traces, PERFPLAY can quantify the performance impact of unnecessary lock contentions. To demonstrate the effectiveness of our debugging framework, we study five real-world programs and PARSEC benchmarks. Our experimental results demonstrate the significant performance overhead of unnecessary lock contentions, and the effectiveness of PERFPLAY in identifying the performance critical unnecessary lock contentions in real applications.

preprint2014arXiv

A Taxonomy and Survey on eScience as a Service in the Cloud

Cloud computing has recently evolved as a popular computing infrastructure for many applications. Scientific computing, which was mainly hosted in private clusters and grids, has started to migrate development and deployment to the public cloud environment. eScience as a service becomes an emerging and promising direction for science computing. We review recent efforts in developing and deploying scientific computing applications in the cloud. In particular, we introduce a taxonomy specifically designed for scientific computing in the cloud, and further review the taxonomy with four major kinds of science applications, including life sciences, physics sciences, social and humanities sciences, and climate and earth sciences. Our major finding is that, despite existing efforts in developing cloud-based eScience, eScience still has a long way to go to fully unlock the power of cloud computing paradigm. Therefore, we present the challenges and opportunities in the future development of cloud-based eScience services, and call for collaborations and innovations from both the scientific and computer system communities to address those challenges.

preprint2014arXiv

Monetary Cost Optimizations for Hosting Workflow-as-a-Service in IaaS Clouds

Recently, we have witnessed workflows from science and other data-intensive applications emerging on Infrastructure-asa-Service (IaaS) clouds, and many workflow service providers offering workflow as a service (WaaS). The major concern of WaaS providers is to minimize the monetary cost of executing workflows in the IaaS cloud. While there have been previous studies on this concern, most of them assume static task execution time and static pricing scheme, and have the QoS notion of satisfying a deterministic deadline. However, cloud environment is dynamic, with performance dynamics caused by the interference from concurrent executions and price dynamics like spot prices offered by Amazon EC2. Therefore, we argue that WaaS providers should have the notion of offering probabilistic performance guarantees for individual workflows on IaaS clouds. We develop a probabilistic scheduling framework called Dyna to minimize the monetary cost while offering probabilistic deadline guarantees. The framework includes an A*-based instance configuration method for performance dynamics, and a hybrid instance configuration refinement for utilizing spot instances. Experimental results with three real-world scientific workflow applications on Amazon EC2 demonstrate (1) the accuracy of our framework on satisfying the probabilistic deadline guarantees required by the users; (2) the effectiveness of our framework on reducing monetary cost in comparison with the existing approaches.

preprint2014arXiv

Rank-Aware Dynamic Migrations and Adaptive Demotions for DRAM Power Management

Modern DRAM architectures allow a number of low-power states on individual memory ranks for advanced power management. Many previous studies have taken advantage of demotions on low-power states for energy saving. However, most of the demotion schemes are statically performed on a limited number of pre-selected low-power states, and are suboptimal for different workloads and memory architectures. Even worse, the idle periods are often too short for effective power state transitions, especially for memory intensive applications. Wrong decisions on power state transition incur significant energy and delay penalties. In this paper, we propose a novel memory system design named RAMZzz with rank-aware energy saving optimizations including dynamic page migrations and adaptive demotions. Specifically, we group the pages with similar access locality into the same rank with dynamic page migrations. Ranks have their hotness: hot ranks are kept busy for high utilization and cold ranks can have more lengthy idle periods for power state transitions. We further develop adaptive state demotions by considering all low-power states for each rank and a prediction model to estimate the power-down timeout among states. We experimentally compare our algorithm with other energy saving policies with cycle-accurate simulation. Experiments with benchmark workloads show that RAMZzz achieves significant improvement on energy-delay2 and energy consumption over other energy saving techniques.

preprint2013arXiv

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Graphics processors, or GPUs, have recently been widely used as accelerators in the shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Despite the recently improved runtime support for concurrent GPU kernel executions, the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system with dynamic slicing and scheduling techniques to improve the throughput of concurrent kernel executions on the GPU. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices and to fully utilize the GPU resources. We develop a novel and effective Markov chain based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31.1% and 23.4% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.

preprint2013arXiv

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

With the ease-of-programming, flexibility and yet efficiency, MapReduce has become one of the most popular frameworks for building big-data applications. MapReduce was originally designed for distributed-computing, and has been extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is the latest product released by Intel based on the Many Integrated Core Architecture. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. In our work, we utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique for the map phase to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality due to the coherent L2 caches. Finally, for a given application, our framework can either automatically detect suitable techniques to apply or provide guideline for users at compilation time. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi.

preprint2013arXiv

Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture

Query co-processing on graphics processors (GPUs) has become an effective means to improve the performance of main memory databases. However, the relatively low bandwidth and high latency of the PCI-e bus are usually bottleneck issues for co-processing. Recently, coupled CPU-GPU architectures have received a lot of attention, e.g. AMD APUs with the CPU and the GPU integrated into a single chip. That opens up new opportunities for optimizing query co-processing. In this paper, we experimentally revisit hash joins, one of the most important join algorithms for main memory databases, on a coupled CPU-GPU architecture. Particularly, we study the fine-grained co-processing mechanisms on hash joins with and without partitioning. The co-processing outlines an interesting design space. We extend existing cost models to automatically guide decisions on the design space. Our experimental results on a recent AMD APU show that (1) the coupled architecture enables fine-grained co-processing and cache reuses, which are inefficient on discrete CPU-GPU architectures; (2) the cost model can automatically guide the design and tuning knobs in the design space; (3) fine-grained co-processing achieves up to 53%, 35% and 28% performance improvement over CPU-only, GPU-only and conventional CPU-GPU co-processing, respectively. We believe that the insights and implications from this study are initial yet important for further research on query co-processing on coupled CPU-GPU architectures.

preprint2011arXiv

High-Throughput Transaction Executions on Graphics Processors

OLTP (On-Line Transaction Processing) is an important business system sector in various traditional and emerging online services. Due to the increasing number of users, OLTP systems require high throughput for executing tens of thousands of transactions in a short time period. Encouraged by the recent success of GPGPU (General-Purpose computation on Graphics Processors), we propose GPUTx, an OLTP engine performing high-throughput transaction executions on the GPU for in-memory databases. Compared with existing GPGPU studies usually optimizing a single task, transaction executions require handling many small tasks concurrently. Specifically, we propose the bulk execution model to group multiple transactions into a bulk and to execute the bulk on the GPU as a single task. The transactions within the bulk are executed concurrently on the GPU. We study three basic execution strategies (one with locks and the other two lock-free), and optimize them with the GPU features including the hardware support of atomic operations, the massive thread parallelism and the SPMD (Single Program Multiple Data) execution. We evaluate GPUTx on a recent NVIDIA GPU in comparison with its counterpart on a quad-core CPU. Our experimental results show that optimizations on GPUTx significantly improve the throughput, and the optimized GPUTx achieves 4-10 times higher throughput than its CPU-based counterpart on public transaction processing benchmarks.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.13536:author:5:bingsheng-he

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.13501:author:4:bingsheng-he

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.00670:author:5:bingsheng-he

Imported May 20, 2026Synced May 20, 2026

arxivconfidence 95%

external id: arxiv:2605.15226:author:4:bingsheng-he

Imported May 20, 2026Synced May 20, 2026

4 works

Hongshi Tan

Researcher

Hongshi Tan contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Qingyun Zou

Researcher

Qingyun Zou contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Weng-Fai Wong

Researcher

Weng-Fai Wong contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Yao Chen

Researcher

Yao Chen contributes to research discovery and scholarly infrastructure.

Open to collaborate

Bingsheng He

What is connected

Connect this record

See the researcher in context

Building this map preview

23 published item(s)

HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation

RidgeWalker: Perfectly Pipelined Graph Random Walks on FPGAs

Robust Multimodal Recommendation via Graph Retrieval-Enhanced Modality Completion

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

An In-Depth Study of Continuous Subgraph Matching (Complete Version)

Indefinite linearized augmented Lagrangian method for convex programming with linear inequality constraints

On construction of splitting contraction algorithms in a prediction-correction framework for separable convex optimization

Practical Vertical Federated Learning with Unsupervised Representation Learning

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines

The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems

The Serverless Computing Survey: A Technical Primer for Design Architecture

TransMask: A Compact and Fast Speech Separation Model Based on Transformer

Accelerating Generative Neural Networks on Unmodified Deep Learning Processors -- A Software Approach

On Performance Debugging of Unnecessary Lock Contentions on Multicore Processors: A Replay-based Approach

A Taxonomy and Survey on eScience as a Service in the Cloud

Monetary Cost Optimizations for Hosting Workflow-as-a-Service in IaaS Clouds

Rank-Aware Dynamic Migrations and Adaptive Demotions for DRAM Power Management

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture

High-Throughput Transaction Executions on Graphics Processors