Researcher profile

Hong-Han Shuai

Hong-Han Shuai contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

World Action Models (WAMs) enable decision-making through imagined rollouts by predicting future observations and actions. However, the reliability of these imagined futures remains under-examined: is a generated future merely visually plausible, or is it dynamically compatible with the action sequence it claims to model? In this work, we identify action-state consistency, the alignment between predicted actions and induced state transitions, as a missing reliability axis for WAMs. Through a systematic study across representative joint-prediction and inverse-dynamics models, we find that action-state consistency systematically separates successful and failed rollouts across many tasks and follows similar success-failure trends as learned value estimates. These results suggest that consistency captures decision-relevant structure beyond visual realism. We further identify background collapse as an important boundary condition, where low-dynamics failed trajectories can become deceptively consistent because static futures are easier to predict. Building on these findings, we introduce a value-free consensus strategy for test-time selection, which ranks candidate rollouts by agreement among predicted futures. This strategy improves success rates on RoboCasa and RoboTwin 2.0 without additional training or reward modeling. Taken together, our findings establish action-state consistency as both a diagnostic tool for evaluating WAM reliability and a practical signal for value-free planning.

preprint2022arXiv

Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness

A notable challenge in Multi-Document Summarization (MDS) is the extremely-long length of the input. In this paper, we present an extract-then-abstract Transformer framework to overcome the problem. Specifically, we leverage pre-trained language models to construct a hierarchical extractor for salient sentence selection across documents and an abstractor for rewriting the selected contents as summaries. However, learning such a framework is challenging since the optimal contents for the abstractor are generally unknown. Previous works typically create pseudo extraction oracle to enable the supervised learning for both the extractor and the abstractor. Nevertheless, we argue that the performance of such methods could be restricted due to the insufficient information for prediction and inconsistent objectives between training and testing. To this end, we propose a loss weighting mechanism that makes the model aware of the unequal importance for the sentences not in the pseudo extraction oracle, and leverage the fine-tuned abstractor to generate summary references as auxiliary signals for learning the extractor. Moreover, we propose a reinforcement learning method that can efficiently apply to the extractor for harmonizing the optimization between training and testing. Experiment results show that our framework substantially outperforms strong baselines with comparable model sizes and achieves the best results on the Multi-News, Multi-XScience, and WikiCatSum corpora.

preprint2022arXiv

Vision Transformers: State of the Art and Research Challenges

Transformers have achieved great success in natural language processing. Due to the powerful capability of self-attention mechanism in transformers, researchers develop the vision transformers for a variety of computer vision tasks, such as image recognition, object detection, image segmentation, pose estimation, and 3D reconstruction. This paper presents a comprehensive overview of the literature on different architecture designs and training tricks (including self-supervised learning) for vision transformers. Our goal is to provide a systematic review with the open research opportunities.

preprint2021arXiv

Spatiotemporal Dilated Convolution with Uncertain Matching for Video-based Crowd Estimation

In this paper, we propose a novel SpatioTemporal convolutional Dense Network (STDNet) to address the video-based crowd counting problem, which contains the decomposition of 3D convolution and the 3D spatiotemporal dilated dense convolution to alleviate the rapid growth of the model size caused by the Conv3D layer. Moreover, since the dilated convolution extracts the multiscale features, we combine the dilated convolution with the channel attention block to enhance the feature representations. Due to the error that occurs from the difficulty of labeling crowds, especially for videos, imprecise or standard-inconsistent labels may lead to poor convergence for the model. To address this issue, we further propose a new patch-wise regression loss (PRL) to improve the original pixel-wise loss. Experimental results on three video-based benchmarks, i.e., the UCSD, Mall and WorldExpo'10 datasets, show that STDNet outperforms both image- and video-based state-of-the-art methods. The source codes are released at \url{https://github.com/STDNet/STDNet}.

preprint2021arXiv

Template-Free Try-on Image Synthesis via Semantic-guided Optimization

The virtual try-on task is so attractive that it has drawn considerable attention in the field of computer vision. However, presenting the three-dimensional (3D) physical characteristic (e.g., pleat and shadow) based on a 2D image is very challenging. Although there have been several previous studies on 2D-based virtual try-on work, most 1) required user-specified target poses that are not user-friendly and may not be the best for the target clothing, and 2) failed to address some problematic cases, including facial details, clothing wrinkles and body occlusions. To address these two challenges, in this paper, we propose an innovative template-free try-on image synthesis (TF-TIS) network. The TF-TIS first synthesizes the target pose according to the user-specified in-shop clothing. Afterward, given an in-shop clothing image, a user image, and a synthesized pose, we propose a novel model for synthesizing a human try-on image with the target clothing in the best fitting pose. The qualitative and quantitative experiments both indicate that the proposed TF-TIS outperforms the state-of-the-art methods, especially for difficult cases.

preprint2020arXiv

Attractive or Faithful? Popularity-Reinforced Learning for Inspired Headline Generation

With the rapid proliferation of online media sources and published news, headlines have become increasingly important for attracting readers to news articles, since users may be overwhelmed with the massive information. In this paper, we generate inspired headlines that preserve the nature of news articles and catch the eye of the reader simultaneously. The task of inspired headline generation can be viewed as a specific form of Headline Generation (HG) task, with the emphasis on creating an attractive headline from a given news article. To generate inspired headlines, we propose a novel framework called POpularity-Reinforced Learning for inspired Headline Generation (PORL-HG). PORL-HG exploits the extractive-abstractive architecture with 1) Popular Topic Attention (PTA) for guiding the extractor to select the attractive sentence from the article and 2) a popularity predictor for guiding the abstractor to rewrite the attractive sentence. Moreover, since the sentence selection of the extractor is not differentiable, techniques of reinforcement learning (RL) are utilized to bridge the gap with rewards obtained from a popularity score predictor. Through quantitative and qualitative experiments, we show that the proposed PORL-HG significantly outperforms the state-of-the-art headline generation models in terms of attractiveness evaluated by both human (71.03%) and the predictor (at least 27.60%), while the faithfulness of PORL-HG is also comparable to the state-of-the-art generation model.

preprint2020arXiv

MER-GCN: Micro Expression Recognition Based on Relation Modeling with Graph Convolutional Network

Micro-Expression (ME) is the spontaneous, involuntary movement of a face that can reveal the true feeling. Recently, increasing researches have paid attention to this field combing deep learning techniques. Action units (AUs) are the fundamental actions reflecting the facial muscle movements and AU detection has been adopted by many researches to classify facial expressions. However, the time-consuming annotation process makes it difficult to correlate the combinations of AUs to specific emotion classes. Inspired by the nodes relationship building Graph Convolutional Networks (GCN), we propose an end-to-end AU-oriented graph classification network, namely MER-GCN, which uses 3D ConvNets to extract AU features and applies GCN layers to discover the dependency laying between AU nodes for ME categorization. To our best knowledge, this work is the first end-to-end architecture for Micro-Expression Recognition (MER) using AUs based GCN. The experimental results show that our approach outperforms CNN-based MER networks.

preprint2020arXiv

Optimizing Item and Subgroup Configurations for Social-Aware VR Shopping

Shopping in VR malls has been regarded as a paradigm shift for E-commerce, but most of the conventional VR shopping platforms are designed for a single user. In this paper, we envisage a scenario of VR group shopping, which brings major advantages over conventional group shopping in brick-and-mortar stores and Web shopping: 1) configure flexible display of items and partitioning of subgroups to address individual interests in the group, and 2) support social interactions in the subgroups to boost sales. Accordingly, we formulate the Social-aware VR Group-Item Configuration (SVGIC) problem to configure a set of displayed items for flexibly partitioned subgroups of users in VR group shopping. We prove SVGIC is NP-hard to approximate within $\frac{32}{31} - ε$. We design an approximation algorithm based on the idea of Co-display Subgroup Formation (CSF) to configure proper items for display to different subgroups of friends. Experimental results on real VR datasets and a user study with hTC VIVE manifest that our algorithms outperform baseline approaches by at least 30.1% of solution quality.