Source author record

Hanchao Yu

Hanchao Yu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision eess.IV Machine Learning Artificial Intelligence Information Retrieval

Catalog footprint

What is connected

8works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Inference time techniques such as decoding time scaling and self refinement have been shown to substantially improve mathematical reasoning in large language models (LLMs), largely attributed to emergent self correction and self verification behaviors often elicited through reinforcement learning (RL). In this work, we ask whether the same recipe transfers to vision language models (VLMs), especially RL finetuned variants that claim strong visual mathematical reasoning. Through extensive evaluation, we reach three main findings that differ markedly from text only models. First, generation time capability matters more than verification and refinement: simple majority voting consistently and substantially outperforms verification centric strategies such as best of N with self verification. Second, behaviors often associated with RL tuned models at inference time, such as the 'Aha moment,' do not yield reliable reasoning performance improvements. Third, visual information is not effectively integrated into the model's self verification process. Overall, our analysis highlights a key limitation: current RL trained VLMs derive limited benefit from self verification in the visual modality, which constrains the effectiveness of inference time scaling for visual mathematical reasoning.

preprint2026arXiv

Efficient Sequential Recommendation for Long Term User Interest Via Personalization

Recent years have witnessed success of sequential modeling, generative recommender, and large language model for recommendation. Though the scaling law has been validated for sequential models, it showed inefficiency in computational capacity when considering real-world applications like recommendation, due to the non-linear(quadratic) increasing nature of the transformer model. To improve the efficiency of the sequential model, we introduced a novel approach to sequential recommendation that leverages personalization techniques to enhance efficiency and performance. Our method compresses long user interaction histories into learnable tokens, which are then combined with recent interactions to generate recommendations. This approach significantly reduces computational costs while maintaining high recommendation accuracy. Our method could be applied to existing transformer based recommendation models, e.g., HSTU and HLLM. Extensive experiments on multiple sequential models demonstrate its versatility and effectiveness. Source code is available at \href{https://github.com/facebookresearch/PerSRec}{https://github.com/facebookresearch/PerSRec}.

preprint2026arXiv

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an <eos> embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.

preprint2021arXiv

Study Group Learning: Improving Retinal Vessel Segmentation Trained with Noisy Labels

Retinal vessel segmentation from retinal images is an essential task for developing the computer-aided diagnosis system for retinal diseases. Efforts have been made on high-performance deep learning-based approaches to segment the retinal images in an end-to-end manner. However, the acquisition of retinal vessel images and segmentation labels requires onerous work from professional clinicians, which results in smaller training dataset with incomplete labels. As known, data-driven methods suffer from data insufficiency, and the models will easily over-fit the small-scale training data. Such a situation becomes more severe when the training vessel labels are incomplete or incorrect. In this paper, we propose a Study Group Learning (SGL) scheme to improve the robustness of the model trained on noisy labels. Besides, a learned enhancement map provides better visualization than conventional methods as an auxiliary tool for clinicians. Experiments demonstrate that the proposed method further improves the vessel segmentation performance in DRIVE and CHASE$\_$DB1 datasets, especially when the training labels are noisy.

preprint2020arXiv

Anatomy-Aware Cardiac Motion Estimation

Cardiac motion estimation is critical to the assessment of cardiac function. Myocardium feature tracking (FT) can directly estimate cardiac motion from cine MRI, which requires no special scanning procedure. However, current deep learning-based FT methods may result in unrealistic myocardium shapes since the learning is solely guided by image intensities without considering anatomy. On the other hand, motion estimation through learning is challenging because ground-truth motion fields are almost impossible to obtain. In this study, we propose a novel Anatomy-Aware Tracker (AATracker) for cardiac motion estimation that preserves anatomy by weak supervision. A convolutional variational autoencoder (VAE) is trained to encapsulate realistic myocardium shapes. A baseline dense motion tracker is trained to approximate the motion fields and then refined to estimate anatomy-aware motion fields under the weak supervision from the VAE. We evaluate the proposed method on long-axis cardiac cine MRI, which has more complex myocardium appearances and motions than short-axis. Compared with other methods, AATracker significantly improves the tracking performance and provides visually more realistic tracking results, demonstrating the effectiveness of the proposed weakly-supervision scheme in cardiac motion estimation.

preprint2020arXiv

FOAL: Fast Online Adaptive Learning for Cardiac Motion Estimation

Motion estimation of cardiac MRI videos is crucial for the evaluation of human heart anatomy and function. Recent researches show promising results with deep learning-based methods. In clinical deployment, however, they suffer dramatic performance drops due to mismatched distributions between training and testing datasets, commonly encountered in the clinical environment. On the other hand, it is arguably impossible to collect all representative datasets and to train a universal tracker before deployment. In this context, we proposed a novel fast online adaptive learning (FOAL) framework: an online gradient descent based optimizer that is optimized by a meta-learner. The meta-learner enables the online optimizer to perform a fast and robust adaptation. We evaluated our method through extensive experiments on two public clinical datasets. The results showed the superior performance of FOAL in accuracy compared to the offline-trained tracking method. On average, the FOAL took only $0.4$ second per video for online optimization.

preprint2020arXiv

Measure Anatomical Thickness from Cardiac MRI with Deep Neural Networks

Accurate estimation of shape thickness from medical images is crucial in clinical applications. For example, the thickness of myocardium is one of the key to cardiac disease diagnosis. While mathematical models are available to obtain accurate dense thickness estimation, they suffer from heavy computational overhead due to iterative solvers. To this end, we propose novel methods for dense thickness estimation, including a fast solver that estimates thickness from binary annular shapes and an end-to-end network that estimates thickness directly from raw cardiac images.We test the proposed models on three cardiac datasets and one synthetic dataset, achieving impressive results and generalizability on all. Thickness estimation is performed without iterative solvers or manual correction, which is 100 times faster than the mathematical model. We also analyze thickness patterns on different cardiac pathologies with a standard clinical model and the results demonstrate the potential clinical value of our method for thickness based cardiac disease diagnosis.

preprint2020arXiv

Motion Pyramid Networks for Accurate and Efficient Cardiac Motion Estimation

Cardiac motion estimation plays a key role in MRI cardiac feature tracking and function assessment such as myocardium strain. In this paper, we propose Motion Pyramid Networks, a novel deep learning-based approach for accurate and efficient cardiac motion estimation. We predict and fuse a pyramid of motion fields from multiple scales of feature representations to generate a more refined motion field. We then use a novel cyclic teacher-student training strategy to make the inference end-to-end and further improve the tracking performance. Our teacher model provides more accurate motion estimation as supervision through progressive motion compensations. Our student model learns from the teacher model to estimate motion in a single step while maintaining accuracy. The teacher-student knowledge distillation is performed in a cyclic way for a further performance boost. Our proposed method outperforms a strong baseline model on two public available clinical datasets significantly, evaluated by a variety of metrics and the inference time. New evaluation metrics are also proposed to represent errors in a clinically meaningful manner.

Hanchao Yu

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Efficient Sequential Recommendation for Long Term User Interest Via Personalization

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Study Group Learning: Improving Retinal Vessel Segmentation Trained with Noisy Labels

Anatomy-Aware Cardiac Motion Estimation

FOAL: Fast Online Adaptive Learning for Cardiac Motion Estimation

Measure Anatomical Thickness from Cardiac MRI with Deep Neural Networks

Motion Pyramid Networks for Accurate and Efficient Cardiac Motion Estimation