Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
27works
0followers
21topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

27 published item(s)

preprint2026arXiv

A General Framework for Multimodal LLM-Based Multimedia Understanding in Large-Scale Recommendation Systems

Conventional recommendation systems frequently fail to fully exploit the high-dimensional semantic signals inherent in multimedia content, thereby limiting the fidelity of user preference modeling. While Multimodal Large Language Models (MM-LLMs) offer robust mechanisms for interpreting such complex data, their integration into latency-constrained, industrial-scale architectures remains a significant challenge. To address this, we propose a generalized framework for MM-LLM-driven multimedia understanding. Our methodology employs a tripartite architecture encompassing content interpretation, representation extraction, and systematic pipeline integration, instantiated via a LLaMA2-based model that generates descriptive captions subsequently ingested as tokenized categorical features. Empirical evaluation demonstrates the efficacy of this approach, yielding a $0.35\%$ increase in offline AUC and a $0.02\%$ improvement in online metrics at scale, substantiating the practical viability of leveraging MM-LLMs to enhance large-scale recommendation performance.

preprint2026arXiv

Intelligent Nano-Fingerprinting: An Efficient and Precise Approach for Liquid Biopsy

Biological matrices are rich in information related to life processes, serving as invaluable media for assessing an individual's overall physiological status and its dynamic fluctuations, as well as crucial foundations for disease diagnosis. However, the inherent complexity of these matrices, coupled with our incomplete understanding of their full composition, presents significant challenges for comprehensive analysis and accurate diagnostic interpretation. The advent of single-molecule technologies has revolutionized biomedical research, enabling the direct observation of life processes at the molecular scale. We have proposed an Intelligent Nano-Fingerprinting strategy based on single-molecule nanopore technology, designed to capture the global molecular fingerprints of complex plasma matrices. Furthermore, we developed an intelligent algorithmic model capable of achieving precise classification of plasma samples. This approach is characterized by its simplicity, efficiency, and considerable potential for large-scale adoption and transferable applications.

preprint2026arXiv

Learning Geometric Invariance for Gait Recognition

The goal of gait recognition is to extract identity-invariant features of an individual under various gait conditions, e.g., cross-view and cross-clothing. Most gait models strive to implicitly learn the common traits across different gait conditions in a data-driven manner to pull different gait conditions closer for recognition. However, relatively few studies have explicitly explored the inherent relations between different gait conditions. For this purpose, we attempt to establish connections among different gait conditions and propose a new perspective to achieve gait recognition: variations in different gait conditions can be approximately viewed as a combination of geometric transformations. In this case, all we need is to determine the types of geometric transformations and achieve geometric invariance, then identity invariance naturally follows. As an initial attempt, we explore three common geometric transformations (i.e., Reflect, Rotate, and Scale) and design a $\mathcal{R}$eflect-$\mathcal{R}$otate-$\mathcal{S}$cale invariance learning framework, named ${\mathcal{RRS}}$-Gait. Specifically, it first flexibly adjusts the convolution kernel based on the specific geometric transformations to achieve approximate feature equivariance. Then these three equivariant-aware features are respectively fed into a global pooling operation for final invariance-aware learning. Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.

preprint2026arXiv

Meta-Backscatter: Long-Distance Battery-Free Metamaterial-Backscatter Sensing and Communication

Battery-free Internet of Things (BF-IoT) enabled by backscatter communication is a rapidly evolving technology offering advantages of low cost, ultra-low power consumption, and robustness. However, the practical deployment of BF-IoT is significantly constrained by the limited communication range of common backscatter tags, which typically operate with a range of merely a few meters due to inherent round-trip path loss. Meta-backscatter systems that utilize metamaterial tags present a promising solution, retaining the inherent advantages of BF-IoT while breaking the critical communication range barrier. By leveraging densely paved sub-wavelength units to concentrate the reflected signal power, metamaterial tags enable a significant communication range extension over existing BF-IoT tags that employ omni-directional antennas. In this paper, we synthesize the principles and paradigms of metamaterial sensing to establish a unified design framework and a forward-looking research roadmap. Specifically, we first provide an overview of backscatter communication, encompassing its development history, working principles, and tag classification. We then introduce the design methodology for both metamaterial tags and their compatible transceivers. Moreover, we present the implementation of a meta-backscatter system prototype and report the experimental results based on it. Finally, we conclude by highlighting key challenges and outlining potential avenues for future research.

preprint2026arXiv

Quantum tunnelling-integrated optoplasmonic nanotrap enables conductance visualisation of individual proteins

Biological electron transfer (ET) relies on quantum mechanical tunnelling through a dynamically folded protein. Yet, the spatiotemporal coupling between structural fluctuations and electron flux remains poorly understood, largely due to limitations in existing experimental techniques, such as ensemble averaging and non-physiological operating conditions. Here, we introduce a quantum tunnelling-integrated optoplasmonic nanotrap (QTOP-trap), an optoelectronic platform that combines plasmonic optical trapping with real-time quantum tunnelling measurements. This label-free approach enables single-molecule resolution of protein conductance in physiological electrolytes, achieving sub-3 nm spatial precision and 10-μs temporal resolution. By synchronising optoelectronic measurements, QTOP-trap resolves protein-specific conductance signatures and directly correlates tertiary structure dynamics with conductance using a "protein switch" strategy. This methodology establishes a universal framework for dissecting non-equilibrium ET mechanisms in individual conformational-active proteins, with broad implications for bioenergetics research and biomimetic quantum device design.

preprint2026arXiv

ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely on static mixing ratios or heuristic-based replay strategies, which lack adaptability to different environments and varying training dynamics, resulting in suboptimal tradeoff between stability and asymptotic performance. In this work, we propose Reinforcement Learning with Optimized Adaptive Data-mixing (ROAD), a dynamic plug-and-play framework that automates the data replay process. We identify a fundamental objective misalignment in existing approaches. To tackle this, we formulate the data selection problem as a bi-level optimization process, interpreting the data mixing strategy as a meta-decision governing the policy performance (outer-level) during online fine-tuning, while the conventional Q-learning updates operate at the inner level. To make it tractable, we propose a practical algorithm using a multi-armed bandit mechanism. This is guided by a surrogate objective approximating the bi-level gradient, which simultaneously maintains offline priors and prevents value overestimation. Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.

preprint2023arXiv

Towards Exascale Computation for Turbomachinery Flows

A state-of-the-art large eddy simulation code has been developed to solve compressible flows in turbomachinery. The code has been engineered with a high degree of scalability, enabling it to effectively leverage the many-core architecture of the new Sunway system. A consistent performance of 115.8 DP-PFLOPs has been achieved on a high-pressure turbine cascade consisting of over 1.69 billion mesh elements and 865 billion Degree of Freedoms (DOFs). By leveraging a high-order unstructured solver and its portability to large heterogeneous parallel systems, we have progressed towards solving the grand challenge problem outlined by NASA, which involves a time-dependent simulation of a complete engine, incorporating all the aerodynamic and heat transfer components.

preprint2022arXiv

ApolloRL: a Reinforcement Learning Platform for Autonomous Driving

We introduce ApolloRL, an open platform for research in reinforcement learning for autonomous driving. The platform provides a complete closed-loop pipeline with training, simulation, and evaluation components. It comes with 300 hours of real-world data in driving scenarios and popular baselines such as Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) agents. We elaborate in this paper on the architecture and the environment defined in the platform. In addition, we discuss the performance of the baseline agents in the ApolloRL environment.

preprint2022arXiv

BinGo: Pinpointing Concurrency Bugs in Go via Binary Analysis

Golang (also known as Go for short) has become popular in building concurrency programs in distributed systems. As the unique features, Go employs lightweight Goroutines to support highly parallelism in user space. Moreover, Go leverages channels to enable explicit communication among threads. However, recent studies show that concurrency bugs are not uncommon in Go applications. Pinpointing these concurrency bugs in real Go applications is both important and challenging. Existing approaches are mostly based on compiler-aided static or dynamic analysis, which have two limitations. First, existing approaches require the availability and recompilation of the source code, which work well on testing rather than production environments with no source code available for both applications and external libraries. Second, existing approaches work on pure Go code bases only, not programs mixed with Go and other languages. To address these limitations, we develop BinGo, the first tool to identify concurrency bugs in Go applications via dynamic binary analysis. BinGo correlates binary execution with Go semantics and employs novel bug detection algorithms. BinGo is an end-to-end tool that is ready for deployment in the production environment with no modification on source code, compilers, and runtimes in the Go eco-system. Our experiments show that BinGo has a high coverage of concurrency bugs with no false positives. We are able to use BinGo to identify concurrency bugs in real applications with moderate overhead.

preprint2022arXiv

Deep Learning-based Occluded Person Re-identification: A Survey

Occluded person re-identification (Re-ID) aims at addressing the occlusion problem when retrieving the person of interest across multiple cameras. With the promotion of deep learning technology and the increasing demand for intelligent video surveillance, the frequent occlusion in real-world applications has made occluded person Re-ID draw considerable interest from researchers. A large number of occluded person Re-ID methods have been proposed while there are few surveys that focus on occlusion. To fill this gap and help boost future research, this paper provides a systematic survey of occluded person Re-ID. Through an in-depth analysis of the occlusion in person Re-ID, most existing methods are found to only consider part of the problems brought by occlusion. Therefore, we review occlusion-related person Re-ID methods from the perspective of issues and solutions. We summarize four issues caused by occlusion in person Re-ID, i.e., position misalignment, scale misalignment, noisy information, and missing information. The occlusion-related methods addressing different issues are then categorized and introduced accordingly. After that, we summarize and compare the performance of recent occluded person Re-ID methods on four popular datasets: Partial-ReID, Partial-iLIDS, Occluded-ReID, and Occluded-DukeMTMC. Finally, we provide insights on promising future research directions.

preprint2022arXiv

Functional varying index coefficient model for dynamic gene-environment interactions

Rooted in genetics, human complex diseases are largely influenced by environmental factors. Existing literature has shown the power of integrative gene-environment interaction analysis by considering the joint effect of environmental mixtures on a disease risk. In this work, we propose a functional varying index coefficient model for longitudinal measurements of a phenotypic trait together with multiple environmental variables, and assess how the genetic effects on a longitudinal disease trait are nonlinearly modified by a mixture of environmental influences. We derive an estimation procedure for the nonparametric functional varying index coefficients under the quadratic inference function and penalized spline framework. Theoretical results such as estimation consistency and asymptotic normality of the estimates are established. In addition, we propose a hypothesis testing procedure to assess the significance of the nonparametric index coefficient function. We evaluate the performance of our estimation and testing procedure through Monte Carlo simulation studies. The proposed method is illustrated by applying to a real data set from a pain sensitivity study in which SNP effects are nonlinearly modulated by the combination of dosage levels and other environmental variables to affect patients' blood pressure and heart rate.

preprint2022arXiv

Mars Entry Trajectory Planning with Range Discretization and Successive Convexification

This paper develops a sequential convex programming approach for Mars entry trajectory planning by range discretization. To improve the accuracy of numerical integration, the range of entry trajectory is selected as the independent variable rather than time or energy. A dilation factor is employed to normalize the entry dynamics and integration interval of the performance index so that the difficult free-final-time programming problem can be converted to a fixed-final-range optimization problem. The bank angle rate with respect to the range is introduced as the new control input in order to decouple the control from the state and facilitate convexification of constraints on the bank angle and its rate. The nonlinear bank angle rate constraint is further relaxed into a linear one via inequality relaxation. Moreover, the nonconvex minimum-time performance index is convexified by regarding flight time as a state variable. Then, the Mars entry trajectory planning problem can be formulated into the framework of convex programming after linearization. By range discretization and successive convexification, the reformulated Mars entry trajectory planning problem is transcribed into a series of convex optimization sub-problems that can be sequentially solved using the convex programming algorithm. The virtual control and adaptive trust-region techniques are employed to improve the accuracy, robustness, and computation efficiency of the algorithm. Numerical simulations with comparative studies are presented to demonstrate the convergence performance and efficiency of the proposed algorithm.

preprint2022arXiv

Multispectral large-area X-ray imaging enabled by stacked multilayer scintillators

Conventional energy-integration black-white X-ray imaging lacks spectral information of X-ray photons. Although X-ray spectra (energy) can be distinguished by photon-counting technique typically with CdZnTe detectors, it is very challenging to be applied to large-area flat-panel X-ray imaging (FPXI). Herein, we design multi-layer stacked scintillators of different X-ray absorption capabilities and scintillation spectrums, in this scenario, the X-ray energy can be discriminated by detecting the emission spectra of each scintillator, therefore the multispectral X-ray imaging can be easily obtained by color or multispectral visible-light camera in one single shot of X-ray. To verify this idea, stacked multilayer scintillators based on several emerging metal halides were fabricated in the cost-effective and scalable solution process, and proof-of-concept multi-energy FPXI were experimentally demonstrated. The dual-energy X-ray image of a bone-muscle model clearly showed the details that were invisible in conventional energy-integration FPXI. By stacking four layers of specifically designed multilayer scintillators with appropriate thicknesses, a prototype FPXI with four energy channels was realized, proving its extendibility to multispectral or even hyperspectral X-ray imaging. This study provides a facile and effective strategy to realize energy-resolved flat-panel X-ray imaging.

preprint2022arXiv

OJXPerf: Featherlight Object Replica Detection for Java Programs

Memory bloat is an important source of inefficiency in complex production software, especially in software written in managed languages such as Java. Prior approaches to this problem have focused on identifying objects that outlive their life span. Few studies have, however, looked into whether and to what extent myriad objects of the same type are identical. A quantitative assessment of identical objects with code-level attribution can assist developers in refactoring code to eliminate object bloat, and favor reuse of existing object(s). The result is reduced memory pressure, reduced allocation and garbage collection, enhanced data locality, and reduced re-computation, all of which result in superior performance. We develop OJXPerf, a lightweight sampling-based profiler, which probabilistically identifies identical objects. OJXPerf employs hardware performance monitoring units (PMU) in conjunction with hardware debug registers to sample and compare field values of different objects of the same type allocated at the same calling context but potentially accessed at different program points. The result is a lightweight measurement, a combination of object allocation contexts and usage contexts ordered by duplication frequency. This class of duplicated objects is relatively easier to optimize. OJXPerf incurs 9% runtime and 6% memory overheads on average. We empirically show the benefit of OJXPerf by using its profiles to instruct us to optimize a number of Java programs, including well-known benchmarks and real-world applications. The results show a noticeable reduction in memory usage (up to 11%) and a significant speedup (up to 25%).

preprint2022arXiv

Rapid Elastic Architecture Search under Specialized Classes and Resource Constraints

In many real-world applications, we often need to handle various deployment scenarios, where the resource constraint and the superclass of interest corresponding to a group of classes are dynamically specified. How to efficiently deploy deep models for diverse deployment scenarios is a new challenge. Previous NAS approaches seek to design architectures for all classes simultaneously, which may not be optimal for some individual superclasses. A straightforward solution is to search an architecture from scratch for each deployment scenario, which however is computation-intensive and impractical. To address this, we present a novel and general framework, called Elastic Architecture Search (EAS), permitting instant specializations at runtime for diverse superclasses with various resource constraints. To this end, we first propose to effectively train an over-parameterized network via a superclass dropout strategy during training. In this way, the resulting model is robust to the subsequent superclasses dropping at inference time. Based on the well-trained over-parameterized network, we then propose an efficient architecture generator to obtain promising architectures within a single forward pass. Experiments on three image classification datasets show that EAS is able to find more compact networks with better performance while remarkably being orders of magnitude faster than state-of-the-art NAS methods, e.g., outperforming OFA (once-for-all) by 1.3% on Top-1 accuracy at a budget around 361M #MAdds on ImageNet-10. More critically, EAS is able to find compact architectures within 0.1 second for 50 deployment scenarios.

preprint2022arXiv

Temperature Field Inversion of Heat-Source Systems via Physics-Informed Neural Networks

Temperature field inversion of heat-source systems (TFI-HSS) with limited observations is essential to monitor the system health. Although some methods such as interpolation have been proposed to solve TFI-HSS, those existing methods ignore correlations between data constraints and physics constraints, causing the low precision. In this work, we develop a physics-informed neural network-based temperature field inversion (PINN-TFI) method to solve the TFI-HSS task and a coefficient matrix condition number based position selection of observations (CMCN-PSO) method to select optima positions of noise observations. For the TFI-HSS task, the PINN-TFI method encodes constrain terms into the loss function, thus the task is transformed into an optimization problem of minimizing the loss function. In addition, we have found that noise observations significantly affect reconstruction performances of the PINN-TFI method. To alleviate the effect of noise observations, the CMCN-PSO method is proposed to find optimal positions, where the condition number of observations is used to evaluate positions. The results demonstrate that the PINN-TFI method can significantly improve prediction precisions and the CMCN-PSO method can find good positions to acquire a more robust temperature field.

preprint2022arXiv

The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems

This paper presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of federated learning systems. We have developed reference implementations, and evaluated the important aspects of federated learning, including model accuracy, communication cost, throughput and convergence time. Through these evaluations, we discovered some interesting findings such as federated learning can effectively increase end-to-end throughput.

preprint2022arXiv

Transformers Meet Visual Learning Understanding: A Comprehensive Review

Dynamic attention mechanism and global modeling ability make Transformer show strong feature learning ability. In recent years, Transformer has become comparable to CNNs methods in computer vision. This review mainly investigates the current research progress of Transformer in image and video applications, which makes a comprehensive overview of Transformer in visual learning understanding. First, the attention mechanism is reviewed, which plays an essential part in Transformer. And then, the visual Transformer model and the principle of each module are introduced. Thirdly, the existing Transformer-based models are investigated, and their performance is compared in visual learning understanding applications. Three image tasks and two video tasks of computer vision are investigated. The former mainly includes image classification, object detection, and image segmentation. The latter contains object tracking and video classification. It is significant for comparing different models' performance in various tasks on several public benchmark data sets. Finally, ten general problems are summarized, and the developing prospects of the visual Transformer are given in this review.

preprint2021arXiv

A novel meta-learning initialization method for physics-informed neural networks

Physics-informed neural networks (PINNs) have been widely used to solve various scientific computing problems. However, large training costs limit PINNs for some real-time applications. Although some works have been proposed to improve the training efficiency of PINNs, few consider the influence of initialization. To this end, we propose a New Reptile initialization based Physics-Informed Neural Network (NRPINN). The original Reptile algorithm is a meta-learning initialization method based on labeled data. PINNs can be trained with less labeled data or even without any labeled data by adding partial differential equations (PDEs) as a penalty term into the loss function. Inspired by this idea, we propose the new Reptile initialization to sample more tasks from the parameterized PDEs and adapt the penalty term of the loss. The new Reptile initialization can acquire initialization parameters from related tasks by supervised, unsupervised, and semi-supervised learning. Then, PINNs with initialization parameters can efficiently solve PDEs. Besides, the new Reptile initialization can also be used for the variants of PINNs. Finally, we demonstrate and verify the NRPINN considering both forward problems, including solving Poisson, Burgers, and Schrödinger equations, as well as inverse problems, where unknown parameters in the PDEs are estimated. Experimental results show that the NRPINN training is much faster and achieves higher accuracy than PINNs with other initialization methods.

preprint2021arXiv

NumaPerf: Predictive and Full NUMA Profiling

Parallel applications are extremely challenging to achieve the optimal performance on the NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool - NumaPerf - that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also separates cache coherence issues that may require different fix strategies. Based on our extensive evaluation, NumaPerf is able to identify more performance issues than any existing tool, while fixing these bugs leads to up to 5.94x performance speedup.

preprint2020arXiv

An entanglement-based quantum network based on symmetric dispersive optics quantum key distribution

Quantum key distribution (QKD) is a crucial technology for information security in the future. Developing simple and efficient ways to establish QKD among multiple users are important to extend the applications of QKD in communication networks. Herein, we proposed a scheme of symmetric dispersive optics QKD (DO-QKD) and demonstrated an entanglement-based quantum network based on it. In the experiment, a broadband entanglement photon pair source was shared by end users via wavelength and space division multiplexing. The wide spectrum of generated entangled photon pairs was divided into 16 combinations of frequency-conjugate channels. Photon pairs in each channel combination supported a fully-connected subnet with 8 users by a passive beam splitter. Eventually, it showed that an entanglement-based QKD network over 100 users could be supported by one entangled photon pair source in this architecture. It has great potential on applications of local quantum networks with large user number.

preprint2020arXiv

Feature Super-Resolution Based Facial Expression Recognition for Multi-scale Low-Resolution Faces

Facial Expressions Recognition(FER) on low-resolution images is necessary for applications like group expression recognition in crowd scenarios(station, classroom etc.). Classifying a small size facial image into the right expression category is still a challenging task. The main cause of this problem is the loss of discriminative feature due to reduced resolution. Super-resolution method is often used to enhance low-resolution images, but the performance on FER task is limited when on images of very low resolution. In this work, inspired by feature super-resolution methods for object detection, we proposed a novel generative adversary network-based feature level super-resolution method for robust facial expression recognition(FSR-FER). In particular, a pre-trained FER model was employed as feature extractor, and a generator network G and a discriminator network D are trained with features extracted from images of low resolution and original high resolution. Generator network G tries to transform features of low-resolution images to more discriminative ones by making them closer to the ones of corresponding high-resolution images. For better classification performance, we also proposed an effective classification-aware loss re-weighting strategy based on the classification probability calculated by a fixed FER model to make our model focus more on samples that are easily misclassified. Experiment results on Real-World Affective Faces (RAF) Database demonstrate that our method achieves satisfying results on various down-sample factors with a single model and has better performance on low-resolution images compared with methods using image super-resolution and expression recognition separately.

preprint2020arXiv

MultiResolution Attention Extractor for Small Object Detection

Small objects are difficult to detect because of their low resolution and small size. The existing small object detection methods mainly focus on data preprocessing or narrowing the differences between large and small objects. Inspired by human vision "attention" mechanism, we exploit two feature extraction methods to mine the most useful information of small objects. Both methods are based on multiresolution feature extraction. We initially design and explore the soft attention method, but we find that its convergence speed is slow. Then we present the second method, an attention-based feature interaction method, called a MultiResolution Attention Extractor (MRAE), showing significant improvement as a generic feature extractor in small object detection. After each building block in the vanilla feature extractor, we append a small network to generate attention weights followed by a weighted-sum operation to get the final attention maps. Our attention-based feature extractor is 2.0 times the AP of the "hard" attention counterpart (plain architecture) on the COCO small object detection benchmark, proving that MRAE can capture useful location and contextual information through adaptive learning.

preprint2020arXiv

ScalAna: Automating Scaling Loss Detection with Graph Analysis

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads. In this work, we design ScalAna that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a Program Performance Graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect the root cause of scaling loss. We evaluate ScalAna with real applications. Results show that our approach can effectively locate the root cause of scaling loss for real applications and incurs 1.73% overhead on average for up to 2,048 processes. We achieve up to 11.11% performance improvement by fixing the root causes detected by ScalAna on 2,048 processes.

preprint2020arXiv

Variational Inference-Based Dropout in Recurrent Neural Networks for Slot Filling in Spoken Language Understanding

This paper proposes to generalize the variational recurrent neural network (RNN) with variational inference (VI)-based dropout regularization employed for the long short-term memory (LSTM) cells to more advanced RNN architectures like gated recurrent unit (GRU) and bi-directional LSTM/GRU. The new variational RNNs are employed for slot filling, which is an intriguing but challenging task in spoken language understanding. The experiments on the ATIS dataset suggest that the variational RNNs with the VI-based dropout regularization can significantly improve the naive dropout regularization RNNs-based baseline systems in terms of F-measure. Particularly, the variational RNN with bi-directional LSTM/GRU obtains the best F-measure score.

preprint2019arXiv

SLOAM: Semantic Lidar Odometry and Mapping for Forest Inventory

This paper describes an end-to-end pipeline for tree diameter estimation based on semantic segmentation and lidar odometry and mapping. Accurate mapping of this type of environment is challenging since the ground and the trees are surrounded by leaves, thorns and vines, and the sensor typically experiences extreme motion. We propose a semantic feature based pose optimization that simultaneously refines the tree models while estimating the robot pose. The pipeline utilizes a custom virtual reality tool for labeling 3D scans that is used to train a semantic segmentation network. The masked point cloud is used to compute a trellis graph that identifies individual instances and extracts relevant features that are used by the SLAM module. We show that traditional lidar and image based methods fail in the forest environment on both Unmanned Aerial Vehicle (UAV) and hand-carry systems, while our method is more robust, scalable, and automatically generates tree diameter estimations.