Researcher profile

Angela Yao

Angela Yao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
20works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

20 published item(s)

preprint2026arXiv

Decouple and Cache: KV Cache Construction for Streaming Video Understanding

Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache's state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.

preprint2026arXiv

Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration -- suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.

preprint2026arXiv

Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models

This paper tackles compositional personalization of vision-language models (VLMs). In this problem, multiple user-defined concepts must be recognized or described jointly at test time. We introduce Gate-and-Merge, a zero-shot framework that enables compositional personalization without the need for co-occurrence training. During personalization, each concept is learned independently as a lightweight LoRA adapter, paired with a concept token. The base model remains unchanged and concepts are kept disentangled. At inference, we enable composition by merging concept-specific LoRA updates directly in weight space. To suppress irrelevant activations and prevent interference, a gating mechanism is employed to estimate textual and visual cues and select only the modules that contribute to the prediction. We further stabilize composition by combining only the most meaningful and mutually consistent updates, helping preserve each concept's identity. Our quantitative and qualitative analyses show consistent gains in performance across multiple personalization tasks in both single-concept and compositional settings.

preprint2026arXiv

LightAVSeg: Lightweight Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.

preprint2026arXiv

Noise-Robust Tiny Object Localization with Flows

Despite significant advances in generic object detection, a persistent performance gap remains for tiny objects compared to normal-scale objects. We demonstrate that tiny objects are highly sensitive to annotation noise, where optimizing strict localization objectives risks noise overfitting. To address this, we propose Tiny Object Localization with Flows (TOLF), a noise-robust localization framework leveraging normalizing flows for flexible error modeling and uncertainty-guided optimization. Our method captures complex, non-Gaussian prediction distributions through flow-based error modeling, enabling robust learning under noisy supervision. An uncertainty-aware gradient modulation mechanism further suppresses learning from high-uncertainty, noise-prone samples, mitigating overfitting while stabilizing training. Extensive experiments across three datasets validate our approach's effectiveness. Especially, TOLF boosts the DINO baseline by 1.2% AP on the AI-TOD dataset.

preprint2022arXiv

A Closer Look at Branch Classifiers of Multi-exit Architectures

Multi-exit architectures consist of a backbone and branch classifiers that offer shortened inference pathways to reduce the run-time of deep neural networks. In this paper, we analyze different branching patterns that vary in their allocation of computational complexity for the branch classifiers. Constant-complexity branching keeps all branches the same, while complexity-increasing and complexity-decreasing branching place more complex branches later or earlier in the backbone respectively. Through extensive experimentation on multiple backbones and datasets, we find that complexity-decreasing branches are more effective than constant-complexity or complexity-increasing branches, which achieve the best accuracy-cost trade-off. We investigate a cause by using knowledge consistency to probe the effect of adding branches onto a backbone. Our findings show that complexity-decreasing branching yields the least disruption to the feature abstraction hierarchy of the backbone, which explains the effectiveness of the branching patterns.

preprint2022arXiv

A Generalized & Robust Framework For Timestamp Supervision in Temporal Action Segmentation

In temporal action segmentation, Timestamp supervision requires only a handful of labelled frames per video sequence. For unlabelled frames, previous works rely on assigning hard labels, and performance rapidly collapses under subtle violations of the annotation assumptions. We propose a novel Expectation-Maximization (EM) based approach that leverages the label uncertainty of unlabelled frames and is robust enough to accommodate possible annotation errors. With accurate timestamp annotations, our proposed method produces SOTA results and even exceeds the fully-supervised setup in several metrics and datasets. When applied to timestamp annotations with missing action segments, our method presents stable performance. To further test our formulation's robustness, we introduce the new challenging annotation setup of Skip-tag supervision. This setup relaxes constraints and requires annotations of any fixed number of random frames in a video, making it more flexible than Timestamp supervision while remaining competitive.

preprint2022arXiv

Accelerating Video Object Segmentation with Compressed Video

We propose an efficient plug-and-play acceleration framework for semi-supervised video object segmentation by exploiting the temporal redundancies in videos presented by the compressed bitstream. Specifically, we propose a motion vector-based warping method for propagating segmentation masks from keyframes to other frames in a bi-directional and multi-hop manner. Additionally, we introduce a residual-based correction module that can fix wrongly propagated segmentation masks from noisy or erroneous motion vectors. Our approach is flexible and can be added on top of several existing video object segmentation algorithms. We achieved highly competitive results on DAVIS17 and YouTube-VOS on various base models with substantial speed-ups of up to 3.5X with minor drops in accuracy.

preprint2022arXiv

Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities

Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections. Assembly101 is the first multi-view action dataset, with simultaneous static (8) and egocentric (4) recordings. Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses. We benchmark on three action understanding tasks: recognition, anticipation and temporal segmentation. Additionally, we propose a novel task of detecting mistakes. The unique recording format and rich set of annotations allow us to investigate generalization to new toys, cross-view transfer, long-tailed distributions, and pose vs. appearance. We envision that Assembly101 will serve as a new challenge to investigate various activity understanding problems.

preprint2022arXiv

Discrete-Constrained Regression for Local Counting Models

Local counts, or the number of objects in a local area, is a continuous value by nature. Yet recent state-of-the-art methods show that formulating counting as a classification task performs better than regression. Through a series of experiments on carefully controlled synthetic data, we show that this counter-intuitive result is caused by imprecise ground truth local counts. Factors such as biased dot annotations and incorrectly matched Gaussian kernels used to generate ground truth counts introduce deviations from the true local counts. Standard continuous regression is highly sensitive to these errors, explaining the performance gap between classification and regression. To mitigate the sensitivity, we loosen the regression formulation from a continuous scale to a discrete ordering and propose a novel discrete-constrained (DC) regression. Applied to crowd counting, DC-regression is more accurate than both classification and standard regression on three public benchmarks. A similar advantage also holds for the age estimation task, verifying the overall effectiveness of DC-regression.

preprint2022arXiv

Learning Deep Morphological Networks with Neural Architecture Search

Deep Neural Networks (DNNs) are generated by sequentially performing linear and non-linear processes. Using a combination of linear and non-linear procedures is critical for generating a sufficiently deep feature space. The majority of non-linear operators are derivations of activation functions or pooling functions. Mathematical morphology is a branch of mathematics that provides non-linear operators for a variety of image processing problems. We investigate the utility of integrating these operations in an end-to-end deep learning framework in this paper. DNNs are designed to acquire a realistic representation for a particular job. Morphological operators give topological descriptors that convey salient information about the shapes of objects depicted in images. We propose a method based on meta-learning to incorporate morphological operators into DNNs. The learned architecture demonstrates how our novel morphological operations significantly increase DNN performance on various tasks, including picture classification and edge detection.

preprint2022arXiv

Leveraging Action Affinity and Continuity for Semi-supervised Temporal Action Segmentation

We present a semi-supervised learning approach to the temporal action segmentation task. The goal of the task is to temporally detect and segment actions in long, untrimmed procedural videos, where only a small set of videos are densely labelled, and a large collection of videos are unlabelled. To this end, we propose two novel loss functions for the unlabelled data: an action affinity loss and an action continuity loss. The action affinity loss guides the unlabelled samples learning by imposing the action priors induced from the labelled set. Action continuity loss enforces the temporal continuity of actions, which also provides frame-wise classification supervision. In addition, we propose an Adaptive Boundary Smoothing (ABS) approach to build coarser action boundaries for more robust and reliable learning. The proposed loss functions and ABS were evaluated on three benchmarks. Results show that they significantly improved action segmentation performance with a low amount (5% and 10%) of labelled data and achieved comparable results to full supervision with 50% labelled data. Furthermore, ABS succeeded in boosting performance when integrated into fully-supervised learning.

preprint2022arXiv

Multi-Scale Memory-Based Video Deblurring

Video deblurring has achieved remarkable progress thanks to the success of deep neural networks. Most methods solve for the deblurring end-to-end with limited information propagation from the video sequence. However, different frame regions exhibit different characteristics and should be provided with corresponding relevant information. To achieve fine-grained deblurring, we designed a memory branch to memorize the blurry-sharp feature pairs in the memory bank, thus providing useful information for the blurry query input. To enrich the memory of our memory bank, we further designed a bidirectional recurrency and multi-scale strategy based on the memory bank. Experimental results demonstrate that our model outperforms other state-of-the-art methods while keeping the model complexity and inference time low. The code is available at https://github.com/jibo27/MemDeblur.

preprint2022arXiv

Perception-Distortion Balanced ADMM Optimization for Single-Image Super-Resolution

In image super-resolution, both pixel-wise accuracy and perceptual fidelity are desirable. However, most deep learning methods only achieve high performance in one aspect due to the perception-distortion trade-off, and works that successfully balance the trade-off rely on fusing results from separately trained models with ad-hoc post-processing. In this paper, we propose a novel super-resolution model with a low-frequency constraint (LFc-SR), which balances the objective and perceptual quality through a single model and yields super-resolved images with high PSNR and perceptual scores. We further introduce an ADMM-based alternating optimization method for the non-trivial learning of the constrained model. Experiments showed that our method, without cumbersome post-processing procedures, achieved the state-of-the-art performance. The code is available at https://github.com/Yuehan717/PDASR.

preprint2022arXiv

TemporalUV: Capturing Loose Clothing with Temporally Coherent UV Coordinates

We propose a novel approach to generate temporally coherent UV coordinates for loose clothing. Our method is not constrained by human body outlines and can capture loose garments and hair. We implemented a differentiable pipeline to learn UV mapping between a sequence of RGB inputs and textures via UV coordinates. Instead of treating the UV coordinates of each frame separately, our data generation approach connects all UV coordinates via feature matching for temporal stability. Subsequently, a generative model is trained to balance the spatial quality and temporal stability. It is driven by supervised and unsupervised losses in both UV and image spaces. Our experiments show that the trained models output high-quality UV coordinates and generalize to new poses. Once a sequence of UV coordinates has been inferred by our model, it can be used to flexibly synthesize new looks and modified visual styles. Compared to existing methods, our approach reduces the computational workload to animate new outfits by several orders of magnitude.

preprint2022arXiv

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

Video question answering requires the models to understand and reason about both the complex video and language data to correctly derive the answers. Existing efforts have been focused on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and word sequences. Despite their success, these methods are essentially revolving around the sequential nature of video- and question-contents, providing little insight to the problem of question-answering and lacking interpretability as well. In this work, we argue that while video is presented in frame sequence, the visual elements (e.g., objects, actions, activities and events) are not sequential but rather hierarchical in semantic space. To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues. Despite the simplicity, our extensive experiments demonstrate the superiority of such conditional hierarchical graph architecture, with clear performance improvements over prior methods and also better generalization across different type of questions. Further analyses also demonstrate the model's reliability as it shows meaningful visual-textual evidences for the predicted answers.

preprint2020arXiv

Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction

We study how well different types of approaches generalise in the task of 3D hand pose estimation under single hand scenarios and hand-object interaction. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole space densely, despite recent efforts in collecting large-scale training datasets. This sampling problem is even more severe when hands are interacting with objects and/or inputs are RGB rather than depth images, as RGB images also vary with lighting conditions and colors. To address these issues, we designed a public challenge (HANDS'19) to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set. More exactly, HANDS'19 is designed (a) to evaluate the influence of both depth and color modalities on 3D hand pose estimation, under the presence or absence of objects; (b) to assess the generalisation abilities w.r.t. four main axes: shapes, articulations, viewpoints, and objects; (c) to explore the use of a synthetic hand model to fill the gaps of current datasets. Through the challenge, the overall accuracy has dramatically improved over the baseline, especially on extrapolation tasks, from 27mm to 13mm mean joint error. Our analyses highlight the impacts of: Data pre-processing, ensemble approaches, the use of a parametric 3D hand model (MANO), and different HPE methods/backbones.

preprint2020arXiv

Neural network compression via learnable wavelet transforms

Wavelets are well known for data compression, yet have rarely been applied to the compression of neural networks. This paper shows how the fast wavelet transform can be used to compress linear layers in neural networks. Linear layers still occupy a significant portion of the parameters in recurrent neural networks (RNNs). Through our method, we can learn both the wavelet bases and corresponding coefficients to efficiently represent the linear layers of RNNs. Our wavelet compressed RNNs have significantly fewer parameters yet still perform competitively with the state-of-the-art on synthetic and real-world RNN benchmarks. Wavelet optimization adds basis flexibility, without large numbers of extra weights. Source code is available at https://github.com/v0lta/Wavelet-network-compression.

preprint2020arXiv

Sequence Prediction using Spectral RNNs

Fourier methods have a long and proven track record as an excellent tool in data processing. As memory and computational constraints gain importance in embedded and mobile applications, we propose to combine Fourier methods and recurrent neural network architectures. The short-time Fourier transform allows us to efficiently process multiple samples at a time. Additionally, weight reductions trough low pass filtering is possible. We predict time series data drawn from the chaotic Mackey-Glass differential equation and real-world power load and motion capture data.

preprint2020arXiv

Temporal Aggregate Representations for Long-Range Video Understanding

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.