Researcher profile

Bo He

Bo He contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.

preprint2022arXiv

ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization

Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. Without the boundary information of action segments, existing methods mostly rely on multiple instance learning (MIL), where the predictions of unlabeled instances (i.e., video snippets) are supervised by classifying labeled bags (i.e., untrimmed videos). However, this formulation typically treats snippets in a video as independent instances, ignoring the underlying temporal structures within and across action segments. To address this problem, we propose \system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction. Furthermore, a multi-step refinement strategy is proposed to progressively improve action proposals along the model training process. Extensive experiments on THUMOS-14 and ActivityNet-v1.3 demonstrate the effectiveness of our approach, establishing new state of the art on both datasets. The code and models are publicly available at~\url{https://github.com/boheumd/ASM-Loc}.

preprint2022arXiv

ColdGuess: A General and Effective Relational Graph Convolutional Network to Tackle Cold Start Cases

Low-quality listings and bad actor behavior in online retail websites threatens e-commerce business as these result in sub-optimal buying experience and erode customer trust. When a new listing is created, how to tell it has good-quality? Is the method effective, fast, and scalable? Previous approaches often have three limitations/challenges: (1) unable to handle cold start problems where new sellers/listings lack sufficient selling histories. (2) inability of scoring hundreds of millions of listings at scale, or compromise performance for scalability. (3) has space challenges from large-scale graph with giant e-commerce business size. To overcome these limitations/challenges, we proposed ColdGuess, an inductive graph-based risk predictor built upon a heterogeneous seller product graph, which effectively identifies risky seller/product/listings at scale. ColdGuess tackles the large-scale graph by consolidated nodes, and addresses the cold start problems using homogeneous influence1. The evaluation on real data demonstrates that ColdGuess has stable performance as the number of unknown features increases. It outperforms the lightgbm2 by up to 34 pcp ROC-AUC in a cold start case when a new seller sells a new product . The resulting system, ColdGuess, is effective, adaptable to changing risky seller behavior, and is already in production

preprint2022arXiv

High flexoelectric constants in Janus transition-metal dichalcogenides

Due to their combination of mechanical stiffness and flexibility, two-dimensional (2D) materials have received significant interest as potential electromechanical materials. Flexoelectricity is an electromechanical coupling between strain gradient and polarization. Unlike piezoelectricity, which exists only in non-centrosymmetric materials, flexoelectricity theoretically exists in all dielectric materials. However, most work on the electromechanical energy conversion potential of 2D materials has focused on their piezoelectric, and not flexoelectric behavior and properties. In the present work, we demonstrate that the intrinsic structural asymmetry present in monolayer Janus transition metal dichalcogenides (TMDCs) enables significant flexoelectric properties. We report these flexoelectric properties using a recently developed charge-dipole model that couples with classical molecular dynamics simulations. By employing a prescribed bending deformation, we directly calculate the flexoelectric constants while eliminating the piezoelectric contribution to the polarization. We find that the flexoelectric response of a Janus TMDC is positively correlated to its initial degree of asymmetry, which contributes to stronger $σ-σ$ interactions as the initial degree of asymmetry rises. In addition, the high transfer of charge across atoms in Janus TMDCs leads to larger electric fields due to $π-σ$ coupling. These enhanced $σ-σ$ and $π-σ$ interactions are found to cause the flexoelectric coefficients of the Janus TMDCs to be several times higher than traditional TMDCs such as MoS$_{2}$, whose flexoelectric constant is already ten times larger than graphene.

preprint2022arXiv

Intrinsic bending flexoelectric constants in two-dimensional materials

Flexoelectricity is a form of electromechanical coupling that has recently emerged because, unlike piezoelectricity, it is theoretically possible in any dielectric material. Two-dimensional (2D) materials have also garnered significant interest because of their unusual electromechanical properties and high flexibility, but the intrinsic flexoelectric properties of these materials remain unresolved. In this work, using atomistic modeling accounting for charge-dipole interactions, we report the intrinsic flexoelectric constants for a range of two-dimensional materials, including graphene allotropes, nitrides, graphene analogs of group-IV elements, and the transition metal dichalcogenides (TMDCs). We accomplish this through a proposed mechanical bending scheme that eliminates the piezoelectric contribution to the total polarization, which enables us to directly measure the flexoelectric constants. While flat 2D materials like graphene have low flexoelectric constants due to weak $π-σ$ interactions, buckling is found to increase the flexoelectric constants in monolayer group-IV elements. Finally, due to significantly enhanced charge transfer coupled with structural asymmetry due to bending, the TMDCs are found to have the largest flexoelectric constants, including MoS$_{2}$ having a flexoelectric constant ten times larger than graphene.

preprint2022arXiv

Learning Semantic Correspondence with Sparse Annotations

Finding dense semantic correspondence is a fundamental problem in computer vision, which remains challenging in complex scenes due to background clutter, extreme intra-class variation, and a severe lack of ground truth. In this paper, we aim to address the challenge of label sparsity in semantic correspondence by enriching supervision signals from sparse keypoint annotations. To this end, we first propose a teacher-student learning paradigm for generating dense pseudo-labels and then develop two novel strategies for denoising pseudo-labels. In particular, we use spatial priors around the sparse annotations to suppress the noisy pseudo-labels. In addition, we introduce a loss-driven dynamic label selection strategy for label denoising. We instantiate our paradigm with two variants of learning strategies: a single offline teacher setting, and mutual online teachers setting. Our approach achieves notable improvements on three challenging benchmarks for semantic correspondence and establishes the new state-of-the-art. Project page: https://shuaiyihuang.github.io/publications/SCorrSAN.

preprint2021arXiv

GTA: Global Temporal Attention for Video Action Understanding

Self-attention learns pairwise interactions to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. We first demonstrate that the entangled modeling of spatio-temporal information by flattening all pixels is sub-optimal, failing to capture temporal relationships among frames explicitly. To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. We apply GTA on both pixels and semantically similar regions to capture temporal relationships at different levels of spatial granularity. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA directly learns a global attention matrix that is intended to encode temporal structures that generalize across different samples. We further augment GTA with a cross-channel multi-head fashion to exploit channel interactions for better temporal modeling. Extensive experiments on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.

preprint2020arXiv

Deep Interactive Reinforcement Learning for Path Following of Autonomous Underwater Vehicle

Autonomous underwater vehicle (AUV) plays an increasingly important role in ocean exploration. Existing AUVs are usually not fully autonomous and generally limited to pre-planning or pre-programming tasks. Reinforcement learning (RL) and deep reinforcement learning have been introduced into the AUV design and research to improve its autonomy. However, these methods are still difficult to apply directly to the actual AUV system because of the sparse rewards and low learning efficiency. In this paper, we proposed a deep interactive reinforcement learning method for path following of AUV by combining the advantages of deep reinforcement learning and interactive RL. In addition, since the human trainer cannot provide human rewards for AUV when it is running in the ocean and AUV needs to adapt to a changing environment, we further propose a deep reinforcement learning method that learns from both human rewards and environmental rewards at the same time. We test our methods in two path following tasks---straight line and sinusoids curve following of AUV by simulating in the Gazebo platform. Our experimental results show that with our proposed deep interactive RL method, AUV can converge faster than a DQN learner from only environmental reward. Moreover, AUV learning with our deep RL from both human and environmental rewards can also achieve a similar or even better performance than that with the deep interactive RL method and can adapt to the actual environment by further learning from environmental rewards.