Source author record

Yun Ye

Yun Ye appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Human-Computer Interaction

Catalog footprint

What is connected

10works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Are eHMIs always helpful? Investigating how eHMIs interfere with pedestrian behavior on multi-lane streets: An eye-tracking virtual reality experiment

Appropriate communication is crucial for efficient and safe interactions between pedestrians and autonomous vehicles (AVs). External human-machine interfaces (eHMIs) on AVs, which can be categorized as allocentric or egocentric, are considered a promising solution. While the effectiveness of eHMIs has been extensively studied, in complex environments, such as unsignalized multi-lane streets, their potential to interfere with pedestrian crossing behavior remains underexplored. Hence, a virtual reality-based experiment was conducted to examine how different types of eHMIs displayed on AVs affect the crossing behavior of pedestrians in multi-lane streets environments, with a focus on the gaze patterns of pedestrians during crossing. The results revealed that the presence of eHMIs significantly influenced the cognitive load on pedestrians and increased the possibility of distraction, even misleading pedestrians in cases involving multiple AVs on multi-lane streets. Notably, allocentric eHMIs induced higher cognitive loads and greater distraction in pedestrians than egocentric eHMIs. This was primarily evidenced by longer gaze time and higher proportions of attention for the eHMI on the interacting vehicle, as well as a broader distribution of gaze toward vehicles in the non-interacting lane. However, misleading behavior was mainly triggered by eHMI signals from yielding vehicles in the non-interacting lane. Under such asymmetric signal configurations, egocentric eHMIs resulted in a higher misjudgment rate than allocentric eHMIs. These findings highlight the importance of enhancing eHMI designs to balance the clarity and consistency of the displayed information across different perspectives, especially in complex multi-lane traffic scenarios. This study provides valuable insights regarding the application and standardization of future eHMI systems for AVs.

preprint2026arXiv

Enhancing Safety in Automated Ports: A Virtual Reality Study of Pedestrian-Autonomous Vehicle Interactions under Time Pressure, Visual Constraints, and Varying Vehicle Size

Autonomous driving improves traffic efficiency but presents safety challenges in complex port environments. This study investigates how environmental factors, traffic factors, and pedestrian characteristics influence interaction safety between autonomous vehicles and pedestrians in ports. Using virtual reality (VR) simulations of typical port scenarios, 33 participants completed pedestrian crossing tasks under varying visibility, vehicle sizes, and time pressure conditions. Results indicate that low-visibility conditions, partial occlusions and larger vehicle sizes significantly increase perceived risk, prompting pedestrians to wait longer and accept larger gaps. Specifically, pedestrians tended to accept larger gaps and waited longer when interacting with large autonomous truck platoons, reflecting heightened caution due to their perceived threat. However, local obstructions also reduce post-encroachment time, compressing safety margins. Individual attributes such as age, gender, and driving experience further shape decision-making, while time pressure undermines compensatory behaviors and increases risk. Based on these findings, safety strategies are proposed, including installing wide-angle cameras at multiple viewpoints, enabling real-time vehicle-infrastructure communication, enhancing port lighting and signage, and strengthening pedestrian safety training. This study offers practical recommendations for improving the safety and deployment of vision-based autonomous systems in port settings.

preprint2026arXiv

Wait or cross? Understanding the influence of behavioral tendency, trust, and risk perception on pedestrian gap-acceptance of automated truck platoons

Although automated trucks have the potential to improve freight efficiency, reduce costs, and address driver shortages, organizing two or more trucks in a convoy has raised considerable concerns for pedestrian safety. This study conducted a controlled experiment to examine the influence of behavioral tendency, trust, and risk perception on pedestrian intention to cross in front of an automated truck platoon. A total of 603 subjects participated in the virtual reality video-based questionnaire survey. By fusing the merits of structural equation modeling and artificial neural networks, a two-stage, hybrid model was developed to examine complex relationships between latent variables and gap-acceptance behaviors. Our results indicated that subjects watched an average of five vehicle gaps before starting crossing and the average time gap accepted was about 5.35 seconds. Risk perception not only played the most dominant role in shaping pedestrian crossing decisions, but also served as the strong bone, mediating the effects of behavioral tendency and trust on gap-acceptance. Participants who frequently violated traffic rules were more likely to accept a smaller time gap, while those who showed positive behaviors to other road users tended to wait for a larger time gap. Participants who often committed errors, showed aggressive behaviors, and held greater trust in the safety of automated trucks generally reported a lower level of risk for road-crossing in front of automated truck platoons. Built on these findings, a range of tailored countermeasures were proposed to ensure safer and smother interactions between pedestrians and automated truck platoons.

preprint2022arXiv

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Autonomous driving perceives its surroundings for decision making, which is one of the most complex scenarios in visual perception. The success of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for fundamentally pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet performs 3D object detection in Bird-Eye-View (BEV), where most target values are defined and route planning can be handily performed. We merely reuse existing modules to build its framework but substantially develop its performance by constructing an exclusive data augmentation strategy and upgrading the Non-Maximum Suppression strategy. In the experiment, BEVDet offers an excellent trade-off between accuracy and time-efficiency. As a fast version, BEVDet-Tiny scores 31.2% mAP and 39.2% NDS on the nuScenes val set. It is comparable with FCOS3D, but requires just 11% computational budget of 215.3 GFLOPs and runs 9.2 times faster at 15.6 FPS. Another high-precision version dubbed BEVDet-Base scores 39.3% mAP and 47.2% NDS, significantly exceeding all published results. With a comparable inference speed, it surpasses FCOS3D by a large margin of +9.8% mAP and +10.0% NDS. The source code is publicly available for further research at https://github.com/HuangJunJie2017/BEVDet .

preprint2022arXiv

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Self-supervised monocular methods can efficiently learn depth information of weakly textured surfaces or reflective objects. However, the depth accuracy is limited due to the inherent ambiguity in monocular geometric modeling. In contrast, multi-frame depth estimation methods improve the depth accuracy thanks to the success of Multi-View Stereo (MVS), which directly makes use of geometric constraints. Unfortunately, MVS often suffers from texture-less regions, non-Lambertian surfaces, and moving objects, especially in real-world video sequences without known camera motion and depth supervision. Therefore, we propose MOVEDepth, which exploits the MOnocular cues and VElocity guidance to improve multi-frame Depth learning. Unlike existing methods that enforce consistency between MVS depth and monocular depth, MOVEDepth boosts multi-frame depth learning by directly addressing the inherent problems of MVS. The key of our approach is to utilize monocular depth as a geometric priority to construct MVS cost volume, and adjust depth candidates of cost volume under the guidance of predicted camera velocity. We further fuse monocular depth and MVS depth by learning uncertainty in the cost volume, which results in a robust depth estimation against ambiguity in multi-view geometry. Extensive experiments show MOVEDepth achieves state-of-the-art performance: Compared with Monodepth2 and PackNet, our method relatively improves the depth accuracy by 20\% and 19.8\% on the KITTI benchmark. MOVEDepth also generalizes to the more challenging DDAD benchmark, relatively outperforming ManyDepth by 7.2\%. The code is available at https://github.com/JeffWang987/MOVEDepth.

preprint2022arXiv

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

Learning-based Multi-View Stereo (MVS) methods warp source images into the reference camera frustum to form 3D volumes, which are fused as a cost volume to be regularized by subsequent networks. The fusing step plays a vital role in bridging 2D semantics and 3D spatial associations. However, previous methods utilize extra networks to learn 2D information as fusing cues, underusing 3D spatial correlations and bringing additional computation costs. Therefore, we present MVSTER, which leverages the proposed epipolar Transformer to learn both 2D semantics and 3D spatial associations efficiently. Specifically, the epipolar Transformer utilizes a detachable monocular depth estimator to enhance 2D semantics and uses cross-attention to construct data-dependent 3D associations along epipolar line. Additionally, MVSTER is built in a cascade structure, where entropy-regularized optimal transport is leveraged to propagate finer depth estimations in each stage. Extensive experiments show MVSTER achieves state-of-the-art reconstruction performance with significantly higher efficiency: Compared with MVSNet and CasMVSNet, our MVSTER achieves 34% and 14% relative improvements on the DTU benchmark, with 80% and 51% relative reductions in running time. MVSTER also ranks first on Tanks&Temples-Advanced among all published works. Code is released at https://github.com/JeffWang987.

preprint2022arXiv

WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

Face benchmarks empower the research community to train and evaluate high-performance face recognition systems. In this paper, we contribute a new million-scale recognition benchmark, containing uncurated 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. Firstly, we collect 4M name lists and download 260M faces from the Internet. Then, a Cleaning Automatically utilizing Self-Training (CAST) pipeline is devised to purify the tremendous WebFace260M, which is efficient and scalable. To the best of our knowledge, the cleaned WebFace42M is the largest public face recognition training set and we expect to close the data gap between academia and industry. Referring to practical deployments, Face Recognition Under Inference Time conStraint (FRUITS) protocol and a new test set with rich attributes are constructed. Besides, we gather a large-scale masked face sub-set for biometrics assessment under COVID-19. For a comprehensive evaluation of face matchers, three recognition tasks are performed under standard, masked and unbiased settings, respectively. Equipped with this benchmark, we delve into million-scale face recognition problems. A distributed framework is developed to train face recognition models efficiently without tampering with the performance. Enabled by WebFace42M, we reduce 40% failure rate on the challenging IJB-C set and rank 3rd among 430 entries on NIST-FRVT. Even 10% data (WebFace4M) shows superior performance compared with the public training sets. Furthermore, comprehensive baselines are established under the FRUITS-100/500/1000 milliseconds protocols. The proposed benchmark shows enormous potential on standard, masked and unbiased face recognition scenarios. Our WebFace260M website is https://www.face-benchmark.org.

preprint2021arXiv

WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition

In this paper, we contribute a new million-scale face benchmark containing noisy 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. Firstly, we collect 4M name list and download 260M faces from the Internet. Then, a Cleaning Automatically utilizing Self-Training (CAST) pipeline is devised to purify the tremendous WebFace260M, which is efficient and scalable. To the best of our knowledge, the cleaned WebFace42M is the largest public face recognition training set and we expect to close the data gap between academia and industry. Referring to practical scenarios, Face Recognition Under Inference Time conStraint (FRUITS) protocol and a test set are constructed to comprehensively evaluate face matchers. Equipped with this benchmark, we delve into million-scale face recognition problems. A distributed framework is developed to train face recognition models efficiently without tampering with the performance. Empowered by WebFace42M, we reduce relative 40% failure rate on the challenging IJB-C set, and ranks the 3rd among 430 entries on NIST-FRVT. Even 10% data (WebFace4M) shows superior performance compared with public training set. Furthermore, comprehensive baselines are established on our rich-attribute test set under FRUITS-100ms/500ms/1000ms protocol, including MobileNet, EfficientNet, AttentionNet, ResNet, SENet, ResNeXt and RegNet families. Benchmark website is https://www.face-benchmark.org.

preprint2020arXiv

Channel Pruning via Optimal Thresholding

Structured pruning, especially channel pruning is widely used for the reduced computational cost and the compatibility with off-the-shelf hardware devices. Among existing works, weights are typically removed using a predefined global threshold, or a threshold computed from a predefined metric. The predefined global threshold based designs ignore the variation among different layers and weights distribution, therefore, they may often result in sub-optimal performance caused by over-pruning or under-pruning. In this paper, we present a simple yet effective method, termed Optimal Thresholding (OT), to prune channels with layer dependent thresholds that optimally separate important from negligible channels. By using OT, most negligible or unimportant channels are pruned to achieve high sparsity while minimizing performance degradation. Since most important weights are preserved, the pruned model can be further fine-tuned and quickly converge with very few iterations. Our method demonstrates superior performance, especially when compared to the state-of-the-art designs at high levels of sparsity. On CIFAR-100, a pruned and fine-tuned DenseNet-121 by using OT achieves 75.99% accuracy with only 1.46e8 FLOPs and 0.71M parameters.

preprint2020arXiv

PipeNet: Selective Modal Pipeline of Fusion Network for Multi-Modal Face Anti-Spoofing

Face anti-spoofing has become an increasingly important and critical security feature for authentication systems, due to rampant and easily launchable presentation attacks. Addressing the shortage of multi-modal face dataset, CASIA recently released the largest up-to-date CASIA-SURF Cross-ethnicity Face Anti-spoofing(CeFA) dataset, covering 3 ethnicities, 3 modalities, 1607 subjects, and 2D plus 3D attack types in four protocols, and focusing on the challenge of improving the generalization capability of face anti-spoofing in cross-ethnicity and multi-modal continuous data. In this paper, we propose a novel pipeline-based multi-stream CNN architecture called PipeNet for multi-modal face anti-spoofing. Unlike previous works, Selective Modal Pipeline (SMP) is designed to enable a customized pipeline for each data modality to take full advantage of multi-modal data. Limited Frame Vote (LFV) is designed to ensure stable and accurate prediction for video classification. The proposed method wins the third place in the final ranking of Chalearn Multi-modal Cross-ethnicity Face Anti-spoofing Recognition Challenge@CVPR2020. Our final submission achieves the Average Classification Error Rate (ACER) of 2.21 with Standard Deviation of 1.26 on the test set.

Yun Ye

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Are eHMIs always helpful? Investigating how eHMIs interfere with pedestrian behavior on multi-lane streets: An eye-tracking virtual reality experiment

Enhancing Safety in Automated Ports: A Virtual Reality Study of Pedestrian-Autonomous Vehicle Interactions under Time Pressure, Visual Constraints, and Varying Vehicle Size

Wait or cross? Understanding the influence of behavioral tendency, trust, and risk perception on pedestrian gap-acceptance of automated truck platoons

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition

Channel Pruning via Optimal Thresholding

PipeNet: Selective Modal Pipeline of Fusion Network for Multi-Modal Face Anti-Spoofing