Researcher profile

Yifei Huang

Yifei Huang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/

preprint2026arXiv

Placement Delivery Array for Cache-Aided MIMO Systems

We consider a $(G,L,K,M,N)$ cache-aided multiple-input multiple-output (MIMO) network, where a server equipped with $L$ antennas and a library of $N$ equal-size files communicates with $K$ users, each equipped with $G$ antennas and a cache of size $M$ files, over a wireless interference channel. Each user requests an arbitrary file from the library. The goal is to design coded caching schemes that simultaneously achieve the maximum sum degrees of freedom (sum-DoF) and low subpacketization. In this paper, we first introduce a unified combinatorial structure, termed the MIMO placement delivery array (MIMO-PDA), which characterizes uncoded placement and one-shot zero-forcing delivery. By analyzing the combinatorial properties of MIMO-PDAs, we derive a sum-DoF upper bound of $\min\{KG, Gt+G\lceil L/G \rceil\}$, where $t=KM/N$, which coincides with the optimal DoF characterization in prior work by Tehrani \emph{et al.}. Based on this upper bound, we present two novel constructions of MIMO-PDAs that achieve the maximum sum-DoF. The first construction achieves linear subpacketization under stringent parameter constraints, while the second achieves ordered exponential subpacketization under substantially milder constraints. Theoretical analysis and numerical comparisons demonstrate that the second construction exponentially reduces subpacketization compared to existing schemes while preserving the maximum sum-DoF.

preprint2026arXiv

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person's visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., "leftward", "right"), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.

preprint2026arXiv

TRAM: A Transverse Relaxation Time-Aware Qubit Mapping Algorithm for NISQ Devices

Noisy intermediate-scale quantum (NISQ) devices impose dual challenges on quantum circuit execution: limited qubit connectivity requires extensive SWAP-gate routing, while time-dependent decoherence progressively degrades quantum information. Existing qubit mapping algorithms optimize for hardware topology and static calibration metrics but systematically neglect transverse relaxation dynamics (T2), creating a fundamental gap between compiler decisions and evolving noise characteristics. We present TRAM (Transverse Relaxation Time-Aware Qubit Mapping), a coherence-guided compilation framework that elevates decoherence mitigation to a primary optimization objective. TRAM integrates calibration-informed community detection to construct noise-resilient qubit partitions, generates time-weighted initial mappings that anticipate coherence decay, and dynamically schedules SWAP operations to minimize cumulative error accumulation. Evaluated on Qiskit-based simulators with realistic noise models, TRAM outperforms SABRE by 3.59% in fidelity, reduces gate count by 11.49%, and shortens circuit depth by 12.28%, establishing coherence-aware optimization as essential for practical quantum compilation in the NISQ era.

preprint2022arXiv

CLRNet: Cross Layer Refinement Network for Lane Detection

Lane is critical in the vision navigation system of the intelligent vehicle. Naturally, lane is a traffic sign with high-level semantics, whereas it owns the specific local pattern which needs detailed low-level features to localize accurately. Using different feature levels is of great importance for accurate lane detection, but it is still under-explored. In this work, we present Cross Layer Refinement Network (CLRNet) aiming at fully utilizing both high-level and low-level features in lane detection. In particular, it first detects lanes with high-level semantic features then performs refinement based on low-level features. In this way, we can exploit more contextual information to detect lanes while leveraging local detailed lane features to improve localization accuracy. We present ROIGather to gather global context, which further enhances the feature representation of lanes. In addition to our novel network design, we introduce Line IoU loss which regresses the lane line as a whole unit to improve the localization accuracy. Experiments demonstrate that the proposed method greatly outperforms the state-of-the-art lane detection approaches.

preprint2022arXiv

Ego4D: Around the World in 3,000 Hours of Egocentric Video

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

preprint2021arXiv

Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling

Visual storytelling is a task of generating relevant and interesting stories for given image sequences. In this work we aim at increasing the diversity of the generated stories while preserving the informative content from the images. We propose to foster the diversity and informativeness of a generated story by using a concept selection module that suggests a set of concept candidates. Then, we utilize a large scale pre-trained model to convert concepts and images into full stories. To enrich the candidate concepts, a commonsense knowledge graph is created for each image sequence from which the concept candidates are proposed. To obtain appropriate concepts from the graph, we propose two novel modules that consider the correlation among candidate concepts and the image-concept correlation. Extensive automatic and human evaluation results demonstrate that our model can produce reasonable concepts. This enables our model to outperform the previous models by a large margin on the diversity and informativeness of the story, while retaining the relevance of the story to the image sequence.

preprint2021arXiv

Feynman-path type simulation using stabilizer projector decomposition of unitaries

We propose a classical simulation method for quantum circuits based on decomposing unitary gates into a sum of stabilizer projectors. By only decomposing the non-Clifford gates, we take advantage of the Gottesman-Knill theorem and build a bridge between stabilizer-based simulation and Feynman-path-type simulation. We give two variants of this method: stabilizer-based path-integral recursion (SPIR) and stabilizer projector contraction (SPC). We also analyze further advantages and disadvantages of our method compared to the Bravyi-Gosset algorithm and recursive Feynman path-integral algorithms. We construct a parametrized circuit ensemble and identify the parameter regime in this ensemble where our method offers superior performance. We also estimate the time cost for simulating quantum supremacy experiments with our method and motivate potential improvements of the method.

preprint2021arXiv

Goal-Oriented Gaze Estimation for Zero-Shot Learning

Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen classes. Since semantic knowledge is built on attributes shared between different classes, which are highly local, strong prior for localization of object attribute is beneficial for visual-semantic embedding. Interestingly, when recognizing unseen images, human would also automatically gaze at regions with certain semantic clue. Therefore, we introduce a novel goal-oriented gaze estimation module (GEM) to improve the discriminative attribute localization based on the class-level attributes for ZSL. We aim to predict the actual human gaze location to get the visual attention regions for recognizing a novel object guided by attribute description. Specifically, the task-dependent attention is learned with the goal-oriented GEM, and the global image features are simultaneously optimized with the regression of local attribute features. Experiments on three ZSL benchmarks, i.e., CUB, SUN and AWA2, show the superiority or competitiveness of our proposed method against the state-of-the-art ZSL methods. The ablation analysis on real gaze data CUB-VWSW also validates the benefits and accuracy of our gaze estimation module. This work implies the promising benefits of collecting human gaze dataset and automatic gaze estimation algorithms on high-level computer vision tasks. The code is available at https://github.com/osierboy/GEM-ZSL.

preprint2021arXiv

Rethinking Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics

Margin enlargement over training data has been an important strategy since perceptrons in machine learning for the purpose of boosting the robustness of classifiers toward a good generalization ability. Yet Breiman (1999) showed a dilemma that a uniform improvement on margin distribution does NOT necessarily reduces generalization errors. In this paper, we revisit Breiman's dilemma in deep neural networks with recently proposed spectrally normalized margins, from a novel perspective based on phase transitions of normalized margin distributions in training dynamics. Normalized margin distribution of a classifier over the data, can be divided into two parts: low/small margins such as some negative margins for misclassified samples vs. high/large margins for high confident correctly classified samples, that often behave differently during the training process. Low margins for training and test datasets are often effectively reduced in training, along with reductions of training and test errors; while high margins may exhibit different dynamics, reflecting the trade-off between expressive power of models and complexity of data. When data complexity is comparable to the model expressiveness, high margin distributions for both training and test data undergo similar decrease-increase phase transitions during training. In such cases, one can predict the trend of generalization or test error by margin-based generalization bounds with restricted Rademacher complexities, shown in two ways in this paper with early stopping time exploiting such phase transitions. On the other hand, over-expressive models may have both low and high training margins undergoing uniform improvements, with a distinct phase transition in test margin dynamics. This reconfirms the Breiman's dilemma associated with overparameterized neural networks where margins fail to predict overfitting.

preprint2020arXiv

Situation Awareness and Information Fusion in Sales and Customer Engagement: A Paradigm Shift

With today's savvy and empowered customers, sales requires more judgment and becomes more cognitively intense than ever before. We argue that Situation Awareness (SA) is at the center of effective sales and customer engagement in this new era, and Information Fusion (IF) is the key for developing the next generation of decision support systems for digital and AI transformation, leveraging the ubiquitous virtual presence of sales and customer engagement which provides substantially richer capacity to access information. We propose a vision and path for the paradigm shift from Customer Relationship Management (CRM) to the new paradigm of IF. We argue this new paradigm solves major problems of the current CRM paradigm: (1) it reduces the burden of manual data entry and enables more reliable, comprehensive and up-to-date data and knowledge, (2) it enhances individual and team SA and alleviates information silos with increased knowledge transferability, and (3) it enables a more powerful ecosystem of applications by providing common shared layer of computable knowledge assets.