Researcher profile

Yilei Shi

Yilei Shi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

As multimodal content continues to expand at a rapid pace, audio retrieval has emerged as a key enabling technology for media search, content organization, and intelligent assistants. However, most existing benchmarks concentrate on semantic matching and fail to capture the fact that real-world queries often demand advanced reasoning abilities, including negation understanding, temporal ordering, concurrent event recognition, and duration discrimination. To address this gap, we introduce ReasonAudio, the first reasoning-intensive benchmark for Text-Audio Retrieval, comprising 1,000 queries and 10,000 composite audio clips across five fundamental reasoning tasks: Negation, Order, Overlap, Duration, and Mix. Despite their intuitive nature for humans and straightforward construction, these tasks pose significant challenges to current models. Our evaluation of ten state-of-the-art models reveals the following findings: All models struggle with reasoning-intensive audio retrieval, performing particularly poorly on Negation and Duration while showing relatively better results on Overlap and Order. Moreover, Multimodal Large Language Model-based embedding models fail to inherit the reasoning capabilities of their backbones through contrastive fine-tuning, suggesting that current training paradigms are insufficient to preserve reasoning capacity in retrieval settings

preprint2026arXiv

Reconstructing Building Height from Spaceborne TomoSAR Point Clouds Using a Dual-Topology Network

Reliable building height estimation is essential for various urban applications. Spaceborne SAR tomography (TomoSAR) provides weather-independent, side-looking observations that capture facade-level structure, offering a promising alternative to conventional optical methods. However, TomoSAR point clouds often suffer from noise, anisotropic point distributions, and data voids on incoherent surfaces, all of which hinder accurate height reconstruction. To address these challenges, we introduce a learning-based framework for converting raw TomoSAR points into high-resolution building height maps. Our dual-topology network alternates between a point branch that models irregular scatterer features and a grid branch that enforces spatial consistency. By jointly processing these representations, the network denoises the input points and inpaints missing regions to produce continuous height estimates. To our knowledge, this is the first proof of concept for large-scale urban height mapping directly from TomoSAR point clouds. Extensive experiments on data from Munich and Berlin validate the effectiveness of our approach. Moreover, we demonstrate that our framework can be extended to incorporate optical satellite imagery, further enhancing reconstruction quality. The source code is available at https://github.com/zhu-xlab/tomosar2height.

preprint2022arXiv

Disentangled Latent Transformer for Interpretable Monocular Height Estimation

Monocular height estimation (MHE) from remote sensing imagery has high potential in generating 3D city models efficiently for a quick response to natural disasters. Most existing works pursue higher performance. However, there is little research exploring the interpretability of MHE networks. In this paper, we target at exploring how deep neural networks predict height from a single monocular image. Towards a comprehensive understanding of MHE networks, we propose to interpret them from multiple levels: 1) Neurons: unit-level dissection. Exploring the semantic and height selectivity of the learned internal deep representations; 2) Instances: object-level interpretation. Studying the effects of different semantic classes, scales, and spatial contexts on height estimation; 3) Attribution: pixel-level analysis. Understanding which input pixels are important for the height estimation. Based on the multi-level interpretation, a disentangled latent Transformer network is proposed towards a more compact, reliable, and explainable deep model for monocular height estimation. Furthermore, a novel unsupervised semantic segmentation task based on height estimation is first introduced in this work. Additionally, we also construct a new dataset for joint semantic segmentation and height estimation. Our work provides novel insights for both understanding and designing MHE models.

preprint2022arXiv

GLF-CR: SAR-Enhanced Cloud Removal with Global-Local Fusion

The challenge of the cloud removal task can be alleviated with the aid of Synthetic Aperture Radar (SAR) images that can penetrate cloud cover. However, the large domain gap between optical and SAR images as well as the severe speckle noise of SAR images may cause significant interference in SAR-based cloud removal, resulting in performance degeneration. In this paper, we propose a novel global-local fusion based cloud removal (GLF-CR) algorithm to leverage the complementary information embedded in SAR images. Exploiting the power of SAR information to promote cloud removal entails two aspects. The first, global fusion, guides the relationship among all local optical windows to maintain the structure of the recovered region consistent with the remaining cloud-free regions. The second, local fusion, transfers complementary information embedded in the SAR image that corresponds to cloudy areas to generate reliable texture details of the missing regions, and uses dynamic filtering to alleviate the performance degradation caused by speckle noise. Extensive evaluation demonstrates that the proposed algorithm can yield high quality cloud-free images and outperform state-of-the-art cloud removal algorithms with a gain about 1.7dB in terms of PSNR on SEN12MS-CR dataset.

preprint2022arXiv

Lebanon Solar Rooftop Potential Assessment using Buildings Segmentation from Aerial Images

Estimating solar rooftop potential at a national level is a fundamental building block for every country to utilize solar power efficiently. Solar rooftop potential assessment relies on several features such as building geometry, location, and surrounding facilities. Hence, national-level approximations that do not take these factors into deep consideration are often inaccurate. This paper introduces Lebanon's first comprehensive footprint and solar rooftop potential maps using deep learning-based instance segmentation to extract buildings' footprints from satellite images. A photovoltaic panels placement algorithm that considers the morphology of each roof is proposed. We show that the average rooftop's solar potential can fulfill the yearly electric needs of a single-family residence while using only 5% of the roof surface. The usage of 50% of a residential apartment rooftop area would achieve energy security for up to 8 households. We also compute the average and total solar rooftop potential per district to localize regions corresponding to the highest and lowest solar rooftop potential yield. Factors such as size, ground coverage ratio and PV_out are carefully investigated for each district. Baalbeck district yielded the highest total solar rooftop potential despite its low built-up area. While, Beirut capital city has the highest average solar rooftop potential due to its extremely populated urban nature. Reported results and analysis reveal solar rooftop potential urban patterns and provides policymakers and key stakeholders with tangible insights. Lebanon's total solar rooftop potential is about 28.1 TWh/year, two times larger than the national energy consumption in 2019.

preprint2022arXiv

Semi-Supervised Building Footprint Generation with Feature and Output Consistency Training

Accurate and reliable building footprint maps are vital to urban planning and monitoring, and most existing approaches fall back on convolutional neural networks (CNNs) for building footprint generation. However, one limitation of these methods is that they require strong supervisory information from massive annotated samples for network learning. State-of-the-art semi-supervised semantic segmentation networks with consistency training can help to deal with this issue by leveraging a large amount of unlabeled data, which encourages the consistency of model output on data perturbation. Considering that rich information is also encoded in feature maps, we propose to integrate the consistency of both features and outputs in the end-to-end network training of unlabeled samples, enabling to impose additional constraints. Prior semi-supervised semantic segmentation networks have established the cluster assumption, in which the decision boundary should lie in the vicinity of low sample density. In this work, we observe that for building footprint generation, the low-density regions are more apparent at the intermediate feature representations within the encoder than the encoder's input or output. Therefore, we propose an instruction to assign the perturbation to the intermediate feature representations within the encoder, which considers the spatial resolution of input remote sensing imagery and the mean size of individual buildings in the study area. The proposed method is evaluated on three datasets with different resolutions: Planet dataset (3 m/pixel), Massachusetts dataset (1 m/pixel), and Inria dataset (0.3 m/pixel). Experimental results show that the proposed approach can well extract more complete building structures and alleviate omission errors.

preprint2022arXiv

The Outcome of the 2022 Landslide4Sense Competition: Advanced Landslide Detection from Multi-Source Satellite Imagery

The scientific outcomes of the 2022 Landslide4Sense (L4S) competition organized by the Institute of Advanced Research in Artificial Intelligence (IARAI) are presented here. The objective of the competition is to automatically detect landslides based on large-scale multiple sources of satellite imagery collected globally. The 2022 L4S aims to foster interdisciplinary research on recent developments in deep learning (DL) models for the semantic segmentation task using satellite imagery. In the past few years, DL-based models have achieved performance that meets expectations on image interpretation, due to the development of convolutional neural networks (CNNs). The main objective of this article is to present the details and the best-performing algorithms featured in this competition. The winning solutions are elaborated with state-of-the-art models like the Swin Transformer, SegFormer, and U-Net. Advanced machine learning techniques and strategies such as hard example mining, self-training, and mix-up data augmentation are also considered. Moreover, we describe the L4S benchmark data set in order to facilitate further comparisons, and report the results of the accuracy assessment online. The data is accessible on \textit{Future Development Leaderboard} for future evaluation at \url{https://www.iarai.ac.at/landslide4sense/challenge/}, and researchers are invited to submit more prediction results, evaluate the accuracy of their methods, compare them with those of other users, and, ideally, improve the landslide detection results reported in this article.

preprint2021arXiv

$\boldsymbolγ$-Net: Superresolving SAR Tomographic Inversion via Deep Learning

Synthetic aperture radar tomography (TomoSAR) has been extensively employed in 3-D reconstruction in dense urban areas using high-resolution SAR acquisitions. Compressive sensing (CS)-based algorithms are generally considered as the state of the art in super-resolving TomoSAR, in particular in the single look case. This superior performance comes at the cost of extra computational burdens, because of the sparse reconstruction, which cannot be solved analytically and we need to employ computationally expensive iterative solvers. In this paper, we propose a novel deep learning-based super-resolving TomoSAR inversion approach, $\boldsymbolγ$-Net, to tackle this challenge. $\boldsymbolγ$-Net adopts advanced complex-valued learned iterative shrinkage thresholding algorithm (CV-LISTA) to mimic the iterative optimization step in sparse reconstruction. Simulations show the height estimate from a well-trained $\boldsymbolγ$-Net approaches the Cramér-Rao lower bound while improving the computational efficiency by 1 to 2 orders of magnitude comparing to the first-order CS-based methods. It also shows no degradation in the super-resolution power comparing to the state-of-the-art second-order TomoSAR solvers, which are much more computationally expensive than the first-order methods. Specifically, $\boldsymbolγ$-Net reaches more than $90\%$ detection rate in moderate super-resolving cases at 25 measurements at 6dB SNR. Moreover, simulation at limited baselines demonstrates that the proposed algorithm outperforms the second-order CS-based method by a fair margin. Test on real TerraSAR-X data with just 6 interferograms also shows high-quality 3-D reconstruction with high-density detected double scatterers.

preprint2020arXiv

Instance segmentation of buildings using keypoints

Building segmentation is of great importance in the task of remote sensing imagery interpretation. However, the existing semantic segmentation and instance segmentation methods often lead to segmentation masks with blurred boundaries. In this paper, we propose a novel instance segmentation network for building segmentation in high-resolution remote sensing images. More specifically, we consider segmenting an individual building as detecting several keypoints. The detected keypoints are subsequently reformulated as a closed polygon, which is the semantic boundary of the building. By doing so, the sharp boundary of the building could be preserved. Experiments are conducted on selected Aerial Imagery for Roof Segmentation (AIRS) dataset, and our method achieves better performance in both quantitative and qualitative results with comparison to the state-of-the-art methods. Our network is a bottom-up instance segmentation method that could well preserve geometric details.