Source author record

Liyuan Li

Liyuan Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computer Vision cond-mat.mes-hall cond-mat.mtrl-sci cond-mat.str-el Human-Computer Interaction Robotics

Catalog footprint

What is connected

5works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric

Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with metrics on gt, the reliability of LCM, and the relation of VL-LCM with response distribution justify the validity and applicability of VL-LCM even without gt annotation. Our findings suggest that, beyond accuracy, logical consistency could be employed for both accuracy and reliability. VL-LCM can also be employed for MLLM selection, validation, and reliable answer justification in novel tasks without gt annotation.

preprint2022arXiv

Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition

Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and discrimination of attribute action semantics. Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's modeling efficiency in capturing latent semantics and global dependencies, we investigate two frameworks that combine CNN vision backbone and Transformer Encoder to enhance fine-grained action recognition: 1) a vision-based encoder to learn latent temporal semantics, and 2) a multi-modal video-text cross encoder to exploit additional text input and learn cross association between visual and text semantics. Our experimental results show that both our Transformer encoder frameworks effectively learn latent temporal semantics and cross-modality association, with improved recognition performance over CNN vision model. We achieve new state-of-the-art performance on the FineGym benchmark dataset for both proposed architectures.

preprint2022arXiv

TAILOR: Teaching with Active and Incremental Learning for Object Registration

When deploying a robot to a new task, one often has to train it to detect novel objects, which is time-consuming and labor-intensive. We present TAILOR -- a method and system for object registration with active and incremental learning. When instructed by a human teacher to register an object, TAILOR is able to automatically select viewpoints to capture informative images by actively exploring viewpoints, and employs a fast incremental learning algorithm to learn new objects without potential forgetting of previously learned objects. We demonstrate the effectiveness of our method with a KUKA robot to learn novel objects used in a real-world gearbox assembly task through natural interactions.

preprint2020arXiv

Maximizing spin-orbit torque efficiency of Ta(O)/Py via modulating oxygen-induced interface orbital hybridization

Spin-orbit torques due to interfacial Rashba and spin Hall effects have been widely considered as a potentially more efficient approach than the conventional spin-transfer torque to control the magnetization of ferromagnets. We report a comprehensive study of spin-orbit torque efficiency in Ta(O)/Ni81Fe19 bilayers by tuning low-oxidation of \b{eta}-phase tantalum, and find that the spin Hall angle θDL increases from ~ -0.18 of the pure Ta/Py to the maximum value ~ -0.30 of Ta(O)/Py with 7.8% oxidation. Furthermore, we distinguish the efficiency of the spin-orbit torque generated by the bulk spin Hall effect and by interfacial Rashba effect, respectively, via a series of Py/Cu(0-2 nm)/Ta(O) control experiments. The latter has more than twofold enhancement, and even more significant than that of the former at the optimum oxidation level. Our results indicate that 65% enhancement of the efficiency should be related to the modulation of the interfacial Rashba-like spin-orbit torque due to oxygen-induced orbital hybridization cross the interface. Our results suggest that the modulation of interfacial coupling via oxygen-induced orbital hybridization can be an alternative method to boost the change-spin conversion rate.

preprint2016arXiv

Eye-2-I: Eye-tracking for just-in-time implicit user profiling

For many applications, such as targeted advertising and content recommendation, knowing users' traits and interests is a prerequisite. User profiling is a helpful approach for this purpose. However, current methods, i.e. self-reporting, web-activity monitoring and social media mining are either intrusive or require data over long periods of time. Recently, there is growing evidence in cognitive science that a variety of users' profile is significantly correlated with eye-tracking data. We propose a novel just-in-time implicit profiling method, Eye-2-I, which learns the user's interests, demographic and personality traits from the eye-tracking data while the user is watching videos. Although seemingly conspicuous by closely monitoring the user's eye behaviors, our method is unobtrusive and privacy-preserving owing to its unique characteristics, including (1) fast speed - the profile is available by the first video shot, typically few seconds, and (2) self-contained - not relying on historical data or functional modules. [Bug found. As a proof-of-concept, our method is evaluated in a user study with 51 subjects. It achieved a mean accuracy of 0.89 on 37 attributes of user profile with 9 minutes of eye-tracking data.]