Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
16works
0followers
20topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

16 published item(s)

preprint2026arXiv

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

preprint2024arXiv

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code is made available at \url{https://github.com/baaivision/vid2vid-zero}.

preprint2022arXiv

Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Detection transformers have recently shown promising object detection results and attracted increasing attention. However, how to develop effective domain adaptation techniques to improve its cross-domain performance remains unexplored and unclear. In this paper, we delve into this topic and empirically find that direct feature distribution alignment on the CNN backbone only brings limited improvements, as it does not guarantee domain-invariant sequence features in the transformer for prediction. To address this issue, we propose a novel Sequence Feature Alignment (SFA) method that is specially designed for the adaptation of detection transformers. Technically, SFA consists of a domain query-based feature alignment (DQFA) module and a token-wise feature alignment (TDA) module. In DQFA, a novel domain query is used to aggregate and align global context from the token sequence of both domains. DQFA reduces the domain discrepancy in global feature representations and object relations when deploying in the transformer encoder and decoder, respectively. Meanwhile, TDA aligns token features in the sequence from both domains, which reduces the domain gaps in local and instance-level feature representations in the transformer encoder and decoder, respectively. Besides, a novel bipartite matching consistency loss is proposed to enhance the feature discriminability for robust object detection. Experiments on three challenging benchmarks show that SFA outperforms state-of-the-art domain adaptive object detection methods. Code has been made available at: https://github.com/encounter1997/SFA.

preprint2022arXiv

Safeguarding NOMA Networks via Reconfigurable Dual-Functional Surface under Imperfect CSI

This paper investigates the use of the reconfigurable dual-functional surface to guarantee the full-space secure transmission in non-orthogonal multiple access (NOMA) networks. In the presence of eavesdroppers, the downlink communication from the base station to the legitimate users is safeguarded by the simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS), where three practical operating protocols, namely energy splitting (ES), mode selection (MS), and time splitting (TS), are studied. The joint optimization of power allocation, active and passive beamforming is investigated to maximize the secrecy energy efficiency (SEE), taking into account the imperfect channel state information (CSI) of all channels. For ES, by approximating the semi-infinite constraints with the S-procedure and general sign-definiteness, the problem is solved by an alternating optimization framework. Besides, the proposed algorithm is extended to the MS protocol by solving a mixed-integer non-convex problem. While for TS, a two-layer iterative method is proposed. Simulation results show that: 1) The proposed STAR-RIS assisted NOMA networks are able to provide up to 33.6\% higher SEE than conventional RIS counterparts; 2) TS and ES protocols are generally preferable for low and high power domain, respectively; 3) The accuracy of CSI estimation and the bit resolution power consumption are crucial to reap the SEE benefits offered by STAR-RIS.

preprint2022arXiv

Supervised Homogeneity Fusion: a Combinatorial Approach

Fusing regression coefficients into homogenous groups can unveil those coefficients that share a common value within each group. Such groupwise homogeneity reduces the intrinsic dimension of the parameter space and unleashes sharper statistical accuracy. We propose and investigate a new combinatorial grouping approach called $L_0$-Fusion that is amenable to mixed integer optimization (MIO). On the statistical aspect, we identify a fundamental quantity called grouping sensitivity that underpins the difficulty of recovering the true groups. We show that $L_0$-Fusion achieves grouping consistency under the weakest possible requirement of the grouping sensitivity: if this requirement is violated, then the minimax risk of group misspecification will fail to converge to zero. Moreover, we show that in the high-dimensional regime, one can apply $L_0$-Fusion coupled with a sure screening set of features without any essential loss of statistical efficiency, while reducing the computational cost substantially. On the algorithmic aspect, we provide a MIO formulation for $L_0$-Fusion along with a warm start strategy. Simulation and real data analysis demonstrate that $L_0$-Fusion exhibits superiority over its competitors in terms of grouping accuracy.

preprint2022arXiv

TGRMPT: A Head-Shoulder Aided Multi-Person Tracker and a New Large-Scale Dataset for Tour-Guide Robot

A service robot serving safely and politely needs to track the surrounding people robustly, especially for Tour-Guide Robot (TGR). However, existing multi-object tracking (MOT) or multi-person tracking (MPT) methods are not applicable to TGR for the following reasons: 1. lacking relevant large-scale datasets; 2. lacking applicable metrics to evaluate trackers. In this work, we target the visual perceptual tasks for TGR and present the TGRDB dataset, a novel large-scale multi-person tracking dataset containing roughly 5.6 hours of annotated videos and over 450 long-term trajectories. Besides, we propose a more applicable metric to evaluate trackers using our dataset. As part of our work, we present TGRMPT, a novel MPT system that incorporates information from head shoulder and whole body, and achieves state-of-the-art performance. We have released our codes and dataset in https://github.com/wenwenzju/TGRMPT.

preprint2022arXiv

The coupling of an EUV coronal wave and ion acceleration in a Fermi-LAT behind-the-limb solar flare

We present the Fermi-LAT observations of the behind-the-limb (BTL) flare of July 17, 2021 and the joint detection of this flare by STIX onboard Solar Orbiter. The separation between Earth and the Solar Orbiter was 99.2$^{\circ}$ at 05:00 UT, allowing STIX to have a front view of the flare. The location of the flare was ~S20E140 in Stonyhurst heliographic coordinates making this the most distant behind-the-limb flare ever detected in $>$100 MeV gamma-rays. The LAT detection lasted for $\sim$16 minutes, the peak flux was $ 3.6 \pm 0.8 $ (10$^{-5}$) ph cm$^{-2}$ s$^{-1}$ with a significance $>$15$σ$. A coronal wave was observed from both STEREO-A and SDO in extreme ultraviolet (EUV) with an onset on the visible disk in coincidence with the LAT onset. A complex type II radio burst was observed by GLOSS also in coincidence with the onset of the LAT emission indicating the presence of a shock wave. We discuss the relation between the time derivative of the EUV wave intensity profile at 193\angstrom\ as observed by STEREO-A and the LAT flux to show that the appearance of the coronal wave at the visible disk and the acceleration of protons as traced by the observed $>$100 MeV gamma-ray emission are coupled. We also report how this coupling is present in the data from 3 other BTL flares detected by Fermi-LAT suggesting that the protons driving the gamma-ray emission of BTL solar flares and the coronal wave share a common origin.

preprint2022arXiv

Towards Data-Efficient Detection Transformers

Detection Transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at \url{https://github.com/encounter1997/DE-DETRs}.

preprint2021arXiv

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised, heterogeneous models (i.e., different algorithms with varying hyperparameters) for further combination and analysis, rather than relying on a single model. How to accelerate the training and scoring on new-coming samples by outlyingness (referred as prediction throughout the paper) with a large number of unsupervised, heterogeneous OD models? In this study, we propose a modular acceleration system, called SUOD, to address it. The proposed system focuses on three complementary acceleration aspects (data reduction for high-dimensional data, approximation for costly models, and taskload imbalance optimization for distributed environment), while maintaining performance accuracy. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD's effectiveness in heterogeneous OD acceleration, along with a real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm. We open-source SUOD for reproducibility and accessibility.

preprint2020arXiv

A Comprehensive Study on Temporal Modeling for Online Action Detection

Online action detection (OAD) is a practical yet challenging task, which has attracted increasing attention in recent years. A typical OAD system mainly consists of three modules: a frame-level feature extractor which is usually based on pre-trained deep Convolutional Neural Networks (CNNs), a temporal modeling module, and an action classifier. Among them, the temporal modeling module is crucial which aggregates discriminative information from historical and current features. Though many temporal modeling methods have been developed for OAD and other topics, their effects are lack of investigation on OAD fairly. This paper aims to provide a comprehensive study on temporal modeling for OAD including four meta types of temporal modeling methods, \ie temporal pooling, temporal convolution, recurrent neural networks, and temporal attention, and uncover some good practices to produce a state-of-the-art OAD system. Many of them are explored in OAD for the first time, and extensively evaluated with various hyper parameters. Furthermore, based on our comprehensive study, we present several hybrid temporal modeling methods, which outperform the recent state-of-the-art methods with sizable margins on THUMOS-14 and TVSeries.

preprint2020arXiv

Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models

The state-of-the-art pre-trained language representation models, such as Bidirectional Encoder Representations from Transformers (BERT), rarely incorporate commonsense knowledge or other knowledge explicitly. We propose a pre-training approach for incorporating commonsense knowledge into language representation models. We construct a commonsense-related multi-choice question answering dataset for pre-training a neural language representation model. The dataset is created automatically by our proposed "align, mask, and select" (AMS) method. We also investigate different pre-training tasks. Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieve significant improvements over previous state-of-the-art models on two commonsense-related benchmarks, including CommonsenseQA and Winograd Schema Challenge. We also observe that fine-tuned models after the proposed pre-training approach maintain comparable performance on other NLP tasks, such as sentence classification and natural language inference tasks, compared to the original BERT models. These results verify that the proposed approach, while significantly improving commonsense-related NLP tasks, does not degrade the general language representation capabilities.

preprint2020arXiv

Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection

With the increased applications of automatic speech recognition (ASR) in recent years, it is essential to automatically insert punctuation marks and remove disfluencies in transcripts, to improve the readability of the transcripts as well as the performance of subsequent applications, such as machine translation, dialogue systems, and so forth. In this paper, we propose a Controllable Time-delay Transformer (CT-Transformer) model that jointly completes the punctuation prediction and disfluency detection tasks in real time. The CT-Transformer model facilitates freezing partial outputs with controllable time delay to fulfill the real-time constraints in partial decoding required by subsequent applications. We further propose a fast decoding strategy to minimize latency while maintaining competitive performance. Experimental results on the IWSLT2011 benchmark dataset and an in-house Chinese annotated dataset demonstrate that the proposed approach outperforms the previous state-of-the-art models on F-scores and achieves a competitive inference speed.

preprint2020arXiv

Sequential Neural Networks for Noetic End-to-End Response Selection

The noetic end-to-end response selection challenge as one track in the 7th Dialog System Technology Challenges (DSTC7) aims to push the state of the art of utterance classification for real world goal-oriented dialog systems, for which participants need to select the correct next utterances from a set of candidates for the multi-turn context. This paper presents our systems that are ranked top 1 on both datasets under this challenge, one focused and small (Advising) and the other more diverse and large (Ubuntu). Previous state-of-the-art models use hierarchy-based (utterance-level and token-level) neural networks to explicitly model the interactions among different turns' utterances for context modeling. In this paper, we investigate a sequential matching model based only on chain sequence for multi-turn response selection. Our results demonstrate that the potentials of sequential matching approaches have not yet been fully exploited in the past for multi-turn response selection. In addition to ranking top 1 in the challenge, the proposed model outperforms all previous models, including state-of-the-art hierarchy-based models, on two large-scale public multi-turn response selection benchmark datasets.

preprint2020arXiv

Transfer Learning for Context-Aware Spoken Language Understanding

Spoken language understanding (SLU) is a key component of task-oriented dialogue systems. SLU parses natural language user utterances into semantic frames. Previous work has shown that incorporating context information significantly improves SLU performance for multi-turn dialogues. However, collecting a large-scale human-labeled multi-turn dialogue corpus for the target domains is complex and costly. To reduce dependency on the collection and annotation effort, we propose a Context Encoding Language Transformer (CELT) model facilitating exploiting various context information for SLU. We explore different transfer learning approaches to reduce dependency on data collection and annotation. In addition to unsupervised pre-training using large-scale general purpose unlabeled corpora, such as Wikipedia, we explore unsupervised and supervised adaptive training approaches for transfer learning to benefit from other in-domain and out-of-domain dialogue corpora. Experimental results demonstrate that the proposed model with the proposed transfer learning approaches achieves significant improvement on the SLU performance over state-of-the-art models on two large-scale single-turn dialogue benchmarks and one large-scale multi-turn dialogue benchmark.

preprint2020arXiv

TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation

Video action anticipation aims to predict future action categories from observed frames. Current state-of-the-art approaches mainly resort to recurrent neural networks to encode history information into hidden states, and predict future actions from the hidden representations. It is well known that the recurrent pipeline is inefficient in capturing long-term information which may limit its performance in predication task. To address this problem, this paper proposes a simple yet efficient Temporal Transformer with Progressive Prediction (TTPP) framework, which repurposes a Transformer-style architecture to aggregate observed features, and then leverages a light-weight network to progressively predict future features and actions. Specifically, predicted features along with predicted probabilities are accumulated into the inputs of subsequent prediction. We evaluate our approach on three action datasets, namely TVSeries, THUMOS-14, and TV-Human-Interaction. Additionally we also conduct a comprehensive study for several popular aggregation and prediction strategies. Extensive results show that TTPP not only outperforms the state-of-the-art methods but also more efficient.

preprint2019arXiv

Polymeric Liquid Layer Densified by Surface Acoustic Wave

With the application of surface acoustic wave (SAW) of 39.5 MHz to a model polymer liquid film,polyisobutylene, deposited on the solid substrates, the liquid film is densified, proved by the decrease of film thickness and the increase of refractive index, measured by ellipsometry. Rotational motion of fluorescent probes doped inside the liquid film, measured by polarization-resolved single molecule fluorescence microscopy, is retarded and the dynamical heterogeneity is reduced. It is demonstrated that the application of SAW of high frequency makes the thin polymeric liquid film densified and more dynamically homogeneous.