Source author record

Yuheng Wang

Yuheng Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computer Vision Computation and Language eess.SP eess.SY Human-Computer Interaction Logic in Computer Science Machine Learning Methodology Systems and Control

Catalog footprint

What is connected

7works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.

preprint2026arXiv

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

preprint2022arXiv

Knowledge Authoring with Factual English

Knowledge representation and reasoning (KRR) systems represent knowledge as collections of facts and rules. Like databases, KRR systems contain information about domains of human activities like industrial enterprises, science, and business. KRRs can represent complex concepts and relations, and they can query and manipulate information in sophisticated ways. Unfortunately, the KRR technology has been hindered by the fact that specifying the requisite knowledge requires skills that most domain experts do not have, and professional knowledge engineers are hard to find. One solution could be to extract knowledge from English text, and a number of works have attempted to do so (OpenSesame, Google's Sling, etc.). Unfortunately, at present, extraction of logical facts from unrestricted natural language is still too inaccurate to be used for reasoning, while restricting the grammar of the language (so-called controlled natural language, or CNL) is hard for the users to learn and use. Nevertheless, some recent CNL-based approaches, such as the Knowledge Authoring Logic Machine (KALM), have shown to have very high accuracy compared to others, and a natural question is to what extent the CNL restrictions can be lifted. In this paper, we address this issue by transplanting the KALM framework to a neural natural language parser, mStanza. Here we limit our attention to authoring facts and queries and therefore our focus is what we call factual English statements. Authoring other types of knowledge, such as rules, will be considered in our followup work. As it turns out, neural network based parsers have problems of their own and the mistakes they make range from part-of-speech tagging to lemmatization to dependency errors. We present a number of techniques for combating these problems and test the new system, KALMFL (i.e., KALM for factual language), on a number of benchmarks, which show KALMFL achieves correctness in excess of 95%.

preprint2022arXiv

Risk-averse autonomous systems: A brief history and recent developments from the perspective of optimal control

We present an historical overview about the connections between the analysis of risk and the control of autonomous systems. We offer two main contributions. Our first contribution is to propose three overlapping paradigms to classify the vast body of literature: the worst-case, risk-neutral, and risk-averse paradigms. We consider an appropriate assessment for the risk of an autonomous system to depend on the application at hand. In contrast, it is typical to assess risk using an expectation, variance, or probability alone. Our second contribution is to unify the concepts of risk and autonomous systems. We achieve this by connecting approaches for quantifying and optimizing the risk that arises from a system's behaviour across academic fields. The survey is highly multidisciplinary. We include research from the communities of reinforcement learning, stochastic and robust control theory, operations research, and formal verification. We describe both model-based and model-free methods, with emphasis on the former. Lastly, we highlight fruitful areas for further research. A key direction is to blend risk-averse model-based and model-free methods to enhance the real-time adaptive capabilities of systems to improve human and environmental welfare.

preprint2022arXiv

SSD-KD: A Self-supervised Diverse Knowledge Distillation Method for Lightweight Skin Lesion Classification Using Dermoscopic Images

Skin cancer is one of the most common types of malignancy, affecting a large population and causing a heavy economic burden worldwide. Over the last few years, computer-aided diagnosis has been rapidly developed and make great progress in healthcare and medical practices due to the advances in artificial intelligence. However, most studies in skin cancer detection keep pursuing high prediction accuracies without considering the limitation of computing resources on portable devices. In this case, knowledge distillation (KD) has been proven as an efficient tool to help improve the adaptability of lightweight models under limited resources, meanwhile keeping a high-level representation capability. To bridge the gap, this study specifically proposes a novel method, termed SSD-KD, that unifies diverse knowledge into a generic KD framework for skin diseases classification. Our method models an intra-instance relational feature representation and integrates it with existing KD research. A dual relational knowledge distillation architecture is self-supervisedly trained while the weighted softened outputs are also exploited to enable the student model to capture richer knowledge from the teacher model. To demonstrate the effectiveness of our method, we conduct experiments on ISIC 2019, a large-scale open-accessed benchmark of skin diseases dermoscopic images. Experiments show that our distilled lightweight model can achieve an accuracy as high as 85% for the classification tasks of 8 different skin diseases with minimal parameters and computing requirements. Ablation studies confirm the effectiveness of our intra- and inter-instance relational knowledge integration strategy. Compared with state-of-the-art knowledge distillation techniques, the proposed method demonstrates improved performances for multi-diseases classification on the large-scale dermoscopy database.

preprint2022arXiv

Toward Scalable Risk Analysis for Stochastic Systems Using Extreme Value Theory

We aim to analyze the behaviour of a finite-time stochastic system, whose model is not available, in the context of more rare and harmful outcomes. Standard estimators are not effective in making predictions about such outcomes due to their rarity. Instead, we use Extreme Value Theory (EVT), the theory of the long-term behaviour of normalized maxima of random variables. We quantify risk using the upper-semideviation $ρ(Y) = E(\max\{Y - μ,0\})$ of an integrable random variable $Y$ with mean $μ= E(Y)$. $ρ(Y)$ is the risk-aware part of the common mean-upper-semideviation functional $μ+ λρ(Y)$ with $λ\in [0,1]$. To assess more rare and harmful outcomes, we propose an EVT-based estimator for $ρ(Y)$ in a given fraction of the worst cases. We show that our estimator enjoys a closed-form representation in terms of the popular conditional value-at-risk functional. In experiments, we illustrate the extrapolation power of our estimator using a small number of i.i.d. samples ($<$50). Our approach is useful for estimating the risk of finite-time systems when models are inaccessible and data collection is expensive. The numerical complexity does not grow with the size of the state space.

preprint2020arXiv

Long-Range Gesture Recognition Using Millimeter Wave Radar

Millimeter wave (mmWave) based gesture recognition technology provides a good human computer interaction (HCI) experience. Prior works focus on the close-range gesture recognition, but fall short in range extension, i.e., they are unable to recognize gestures more than one meter away from considerable noise motions. In this paper, we design a long-range gesture recognition model which utilizes a novel data processing method and a customized artificial Convolutional Neural Network (CNN). Firstly, we break down gestures into multiple reflection points and extract their spatial-temporal features which depict gesture details. Secondly, we design a CNN to learn changing patterns of extracted features respectively and output the recognition result. We thoroughly evaluate our proposed system by implementing on a commodity mmWave radar. Besides, we also provide more extensive assessments to demonstrate that the proposed system is practical in several real-world scenarios.

Yuheng Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Knowledge Authoring with Factual English

Risk-averse autonomous systems: A brief history and recent developments from the perspective of optimal control

SSD-KD: A Self-supervised Diverse Knowledge Distillation Method for Lightweight Skin Lesion Classification Using Dermoscopic Images

Toward Scalable Risk Analysis for Stochastic Systems Using Extreme Value Theory

Long-Range Gesture Recognition Using Millimeter Wave Radar