Researcher profile

Zhibin Wang

Zhibin Wang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker's executable plan, which serves as the actionable outcome of the Thinker's reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.

preprint2023arXiv

Unicron: Economizing Self-Healing LLM Training at Scale

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.

preprint2022arXiv

FAKD: Feature Augmented Knowledge Distillation for Semantic Segmentation

In this work, we explore data augmentations for knowledge distillation on semantic segmentation. To avoid over-fitting to the noise in the teacher network, a large number of training examples is essential for knowledge distillation. Imagelevel argumentation techniques like flipping, translation or rotation are widely used in previous knowledge distillation framework. Inspired by the recent progress on semantic directions on feature-space, we propose to include augmentations in feature space for efficient distillation. Specifically, given a semantic direction, an infinite number of augmentations can be obtained for the student in the feature space. Furthermore, the analysis shows that those augmentations can be optimized simultaneously by minimizing an upper bound for the losses defined by augmentations. Based on the observation, a new algorithm is developed for knowledge distillation in semantic segmentation. Extensive experiments on four semantic segmentation benchmarks demonstrate that the proposed method can boost the performance of current knowledge distillation methods without any significant overhead. Code is available at: https://github.com/jianlong-yuan/FAKD.

preprint2022arXiv

Interference Management for Over-the-Air Federated Learning in Multi-Cell Wireless Networks

Federated learning (FL) over resource-constrained wireless networks has recently attracted much attention. However, most existing studies consider one FL task in single-cell wireless networks and ignore the impact of downlink/uplink inter-cell interference on the learning performance. In this paper, we investigate FL over a multi-cell wireless network, where each cell performs a different FL task and over-the-air computation (AirComp) is adopted to enable fast uplink gradient aggregation. We conduct convergence analysis of AirComp-assisted FL systems, taking into account the inter-cell interference in both the downlink and uplink model/gradient transmissions, which reveals that the distorted model/gradient exchanges induce a gap to hinder the convergence of FL. We characterize the Pareto boundary of the error-induced gap region to quantify the learning performance trade-off among different FL tasks, based on which we formulate an optimization problem to minimize the sum of error-induced gaps in all cells. To tackle the coupling between the downlink and uplink transmissions as well as the coupling among multiple cells, we propose a cooperative multi-cell FL optimization framework to achieve efficient interference management for downlink and uplink transmission design. Results demonstrate that our proposed algorithm achieves much better average learning performance over multiple cells than non-cooperative baseline schemes.

preprint2022arXiv

Point RCNN: An Angle-Free Framework for Rotated Object Detection

Rotated object detection in aerial images is still challenging due to arbitrary orientations, large scale and aspect ratio variations, and extreme density of objects. Existing state-of-the-art rotated object detection methods mainly rely on angle-based detectors. However, angle regression can easily suffer from the long-standing boundary problem. To tackle this problem, we propose a purely angle-free framework for rotated object detection, called Point RCNN, which mainly consists of PointRPN and PointReg. In particular, PointRPN generates accurate rotated RoIs (RRoIs) by converting the learned representative points with a coarse-to-fine manner, which is motivated by RepPoints. Based on the learned RRoIs, PointReg performs corner points refinement for more accurate detection. In addition, aerial images are often severely unbalanced in categories, and existing methods almost ignore this issue. In this paper, we also experimentally verify that re-sampling the images of the rare categories will stabilize training and further improve the detection performance. Experiments demonstrate that our Point RCNN achieves the new state-of-the-art detection performance on commonly used aerial datasets, including DOTA-v1.0, DOTA-v1.5, and HRSC2016.

preprint2022arXiv

Poseur: Direct Human Pose Regression with Transformers

We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.

preprint2022arXiv

Semantic Data Augmentation based Distance Metric Learning for Domain Generalization

Domain generalization (DG) aims to learn a model on one or more different but related source domains that could be generalized into an unseen target domain. Existing DG methods try to prompt the diversity of source domains for the model's generalization ability, while they may have to introduce auxiliary networks or striking computational costs. On the contrary, this work applies the implicit semantic augmentation in feature space to capture the diversity of source domains. Concretely, an additional loss function of distance metric learning (DML) is included to optimize the local geometry of data distribution. Besides, the logits from cross entropy loss with infinite augmentations is adopted as input features for the DML loss in lieu of the deep features. We also provide a theoretical analysis to show that the logits can approximate the distances defined on original features well. Further, we provide an in-depth analysis of the mechanism and rational behind our approach, which gives us a better understanding of why leverage logits in lieu of features can help domain generalization. The proposed DML loss with the implicit augmentation is incorporated into a recent DG method, that is, Fourier Augmented Co-Teacher framework (FACT). Meanwhile, our method also can be easily plugged into various DG methods. Extensive experiments on three benchmarks (Digits-DG, PACS and Office-Home) have demonstrated that the proposed method is able to achieve the state-of-the-art performance.

preprint2021arXiv

Object Detection Made Simpler by Eliminating Heuristic NMS

We show a simple NMS-free, end-to-end object detection framework, of which the network is a minimal modification to a one-stage object detector such as the FCOS detection model [Tian et al. 2019]. We attain on par or even improved detection accuracy compared with the original one-stage detector. It performs detection at almost the same inference speed, while being even simpler in that now the post-processing NMS (non-maximum suppression) is eliminated during inference. If the network is capable of identifying only one positive sample for prediction for each ground-truth object instance in an image, then NMS would become unnecessary. This is made possible by attaching a compact PSS head for automatic selection of the single positive sample for each instance (see Fig. 1). As the learning objective involves both one-to-many and one-to-one label assignments, there is a conflict in the labels of some training examples, making the learning challenging. We show that by employing a stop-gradient operation, we can successfully tackle this issue and train the detector. On the COCO dataset, our simple design achieves superior performance compared to both the FCOS baseline detector with NMS post-processing and the recent end-to-end NMS-free detectors. Our extensive ablation studies justify the rationale of the design choices.

preprint2020arXiv

Federated Learning via Intelligent Reflecting Surface

Over-the-air computation (AirComp) based federated learning (FL) is capable of achieving fast model aggregation by exploiting the waveform superposition property of multiple access channels. However, the model aggregation performance is severely limited by the unfavorable wireless propagation channels. In this paper, we propose to leverage intelligent reflecting surface (IRS) to achieve fast yet reliable model aggregation for AirComp-based FL. To optimize the learning performance, we formulate an optimization problem that jointly optimizes the device selection, the aggregation beamformer at the base station (BS), and the phase shifts at the IRS to maximize the number of devices participating in the model aggregation of each communication round under certain mean-squared-error (MSE) requirements. To tackle the formulated highly-intractable problem, we propose a two-step optimization framework. Specifically, we induce the sparsity of device selection in the first step, followed by solving a series of MSE minimization problems to find the maximum feasible device set in the second step. We then propose an alternating optimization framework, supported by the difference-of-convex-functions programming algorithm for low-rank optimization, to efficiently design the aggregation beamformers at the BS and phase shifts at the IRS. Simulation results will demonstrate that our proposed algorithm and the deployment of an IRS can achieve a lower training loss and higher FL prediction accuracy than the baseline algorithms.