Researcher profile

Xuanzhe Liu

Xuanzhe Liu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

A First Look at Bugs in LLM Inference Engines

Large language model-specific inference engines (in short as \emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities of cross-platform compatibility. However, a systematic understanding of these bugs remains lacking. To bridge this gap, we present the first empirical study on bugs in LLM inference engines. We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs. Through a rigorous open coding process, we analyze these bugs to uncover their symptoms, root causes, commonality, fix effort, fix strategies, and temporal evolution. Our findings reveal six bug symptom types and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location within LLM inference engines. Based on these insights, we propose a series of actionable implications for researchers, inference engine vendors, and LLM app developers, along with general guidelines for developing LLM inference engines.

preprint2022arXiv

Benchmarking of DL Libraries and Models on Mobile Devices

Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libs and 15 diversified DL models. We then perform extensive experiments on 10 mobile devices, which help reveal a complete landscape of the current mobile DL libs ecosystem. For example, we find that the best-performing DL lib is severely fragmented across different models and hardware, and the gap between those DL libs can be rather huge. In fact, the impacts of DL libs can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Finally, atop the observations, we summarize practical implications to different roles in the DL lib ecosystem.

preprint2022arXiv

Federated Neural Architecture Search

To preserve user privacy while enabling mobile intelligence, techniques have been proposed to train deep neural networks on decentralized data. However, training over decentralized data makes the design of neural architecture quite difficult as it already was. Such difficulty is further amplified when designing and deploying different neural architectures for heterogeneous mobile platforms. In this work, we propose an automatic neural architecture search into the decentralized training, as a new DNN training paradigm called Federated Neural Architecture Search, namely federated NAS. To deal with the primary challenge of limited on-client computational and communication resources, we present FedNAS, a highly optimized framework for efficient federated NAS. FedNAS fully exploits the key opportunity of insufficient model candidate re-training during the architecture search process, and incorporates three key optimizations: parallel candidates training on partial clients, early dropping candidates with inferior performance, and dynamic round numbers. Tested on large-scale datasets and typical CNN architectures, FedNAS achieves comparable model accuracy as state-of-the-art NAS algorithm that trains models with centralized data, and also reduces the client cost by up to two orders of magnitude compared to a straightforward design of federated NAS.

preprint2022arXiv

Mandheling: Mixed-Precision On-Device DNN Training with DSP Offloading

This paper proposes Mandheling, the first system that enables highly resource-efficient on-device training by orchestrating the mixed-precision training with on-chip Digital Signal Processing (DSP) offloading. Mandheling fully explores the advantages of DSP in integer-based numerical calculation by four novel techniques: (1) a CPU-DSP co-scheduling scheme to mitigate the overhead from DSP-unfriendly operators; (2) a self-adaptive rescaling algorithm to reduce the overhead of dynamic rescaling in backward propagation; (3) a batch-splitting algorithm to improve the DSP cache efficiency; (4) a DSP-compute subgraph reusing mechanism to eliminate the preparation overhead on DSP. We have fully implemented Mandheling and demonstrate its effectiveness through extensive experiments. The results show that, compared to the state-of-the-art DNN engines from TFLite and MNN, Mandheling reduces the per-batch training time by 5.5$\times$ and the energy consumption by 8.9$\times$ on average. In end-to-end training tasks, Mandheling reduces up to 10.7$\times$ convergence time and 13.1$\times$ energy consumption, with only 1.9%-2.7% accuracy loss compared to the FP32 precision setting.

preprint2022arXiv

Software Engineering for Serverless Computing

Serverless computing is an emerging cloud computing paradigm that has been applied to various domains, including machine learning, scientific computing, video processing, etc. To develop serverless computing-based software applications (a.k.a., serverless applications), developers follow the new cloud-based software architecture, where they develop event-driven applications without the need for complex and error-prone server management. The great demand for developing serverless applications poses unique challenges to software developers. However, Software Engineering (SE) has not yet wholeheartedly tackled these challenges. In this paper, we outline a vision for how SE can facilitate the development of serverless applications and call for actions by the SE research community to reify this vision. Specifically, we discuss possible directions in which researchers and cloud providers can facilitate serverless computing from the SE perspective, including configuration management, data security, application migration, performance, testing and debugging, etc.

preprint2021arXiv

A First Look at Deep Learning Apps on Smartphones

We are in the dawn of deep learning explosion for smartphones. To bridge the gap between research and practice, we present the first empirical study on 16,500 the most popular Android apps, demystifying how smartphone apps exploit deep learning in the wild. To this end, we build a new static tool that dissects apps and analyzes their deep learning functions. Our study answers threefold questions: what are the early adopter apps of deep learning, what do they use deep learning for, and how do their deep learning models look like. Our study has strong implications for app developers, smartphone vendors, and deep learning R\&D. On one hand, our findings paint a promising picture of deep learning for smartphones, showing the prosperity of mobile deep learning frameworks as well as the prosperity of apps building their cores atop deep learning. On the other hand, our findings urge optimizations on deep learning models deployed on smartphones, the protection of these models, and validation of research ideas on these models.

preprint2021arXiv

An Empirical Study on Deployment Faults of Deep Learning Based Mobile Applications

Deep Learning (DL) is finding its way into a growing number of mobile software applications. These software applications, named as DL based mobile applications (abbreviated as mobile DL apps) integrate DL models trained using large-scale data with DL programs. A DL program encodes the structure of a desirable DL model and the process by which the model is trained using training data. Due to the increasing dependency of current mobile apps on DL, software engineering (SE) for mobile DL apps has become important. However, existing efforts in SE research community mainly focus on the development of DL models and extensively analyze faults in DL programs. In contrast, faults related to the deployment of DL models on mobile devices (named as deployment faults of mobile DL apps) have not been well studied. Since mobile DL apps have been used by billions of end users daily for various purposes including for safety-critical scenarios, characterizing their deployment faults is of enormous importance. To fill the knowledge gap, this paper presents the first comprehensive study on the deployment faults of mobile DL apps. We identify 304 real deployment faults from Stack Overflow and GitHub, two commonly used data sources for studying software faults. Based on the identified faults, we construct a fine-granularity taxonomy consisting of 23 categories regarding to fault symptoms and distill common fix strategies for different fault types. Furthermore, we suggest actionable implications and research avenues that could further facilitate the deployment of DL models on mobile devices.

preprint2021arXiv

DeepWear: Adaptive Local Offloading for On-Wearable Deep Learning

Due to their on-body and ubiquitous nature, wearables can generate a wide range of unique sensor data creating countless opportunities for deep learning tasks. We propose DeepWear, a deep learning (DL) framework for wearable devices to improve the performance and reduce the energy footprint. DeepWear strategically offloads DL tasks from a wearable device to its paired handheld device through local network. Compared to the remote-cloud-based offloading, DeepWear requires no Internet connectivity, consumes less energy, and is robust to privacy breach. DeepWear provides various novel techniques such as context-aware offloading, strategic model partition, and pipelining support to efficiently utilize the processing capacity from nearby paired handhelds. Deployed as a user-space library, DeepWear offers developer-friendly APIs that are as simple as those in traditional DL libraries such as TensorFlow. We have implemented DeepWear on the Android OS and evaluated it on COTS smartphones and smartwatches with real DL models. DeepWear brings up to 5.08X and 23.0X execution speedup, as well as 53.5% and 85.5% energy saving compared to wearable-only and handheld-only strategies, respectively.

preprint2021arXiv

Exploring the Generalizability of Spatio-Temporal Traffic Prediction: Meta-Modeling and an Analytic Framework

The Spatio-Temporal Traffic Prediction (STTP) problem is a classical problem with plenty of prior research efforts that benefit from traditional statistical learning and recent deep learning approaches. While STTP can refer to many real-world problems, most existing studies focus on quite specific applications, such as the prediction of taxi demand, ridesharing order, traffic speed, and so on. This hinders the STTP research as the approaches designed for different applications are hardly comparable, and thus how an application-driven approach can be generalized to other scenarios is unclear. To fill in this gap, this paper makes three efforts: (i) we propose an analytic framework, called STAnalytic, to qualitatively investigate STTP approaches regarding their design considerations on various spatial and temporal factors, aiming to make different application-driven approaches comparable; (ii) we design a spatio-temporal meta-model, called STMeta, which can flexibly integrate generalizable temporal and spatial knowledge identified by STAnalytic, (iii) we build an STTP benchmark platform including ten real-life datasets with five scenarios to quantitatively measure the generalizability of STTP approaches. In particular, we implement STMeta with different deep learning techniques, and STMeta demonstrates better generalizability than state-of-the-art approaches by achieving lower prediction error on average across all the datasets.

preprint2021arXiv

VM Matters: A Comparison of WASM VMs and EVMs in the Performance of Blockchain Smart Contracts

WebAssemly is an emerging runtime for Web applications and has been supported in almost all browsers. Recently, WebAssembly is further regarded to be a the next-generation environment for blockchain applications, and has been adopted by Ethereum, namely eWASM, to replace the state-of-the-art EVM. However, whether and how well current eWASM outperforms EVM on blockchain clients is still unknown. This paper conducts the first measurement study, to measure the performance on WASM VM and EVM for executing smart contracts on blockchain. To our surprise, the current WASM VM does not perform in expected performance. The overhead introduced by WASM is really non-trivial. Our results highlight the challenges when deploying WASM in practice, and provide insightful implications for improvement space.

preprint2020arXiv

Approximate Query Service on Autonomous IoT Cameras

Elf is a runtime for an energy-constrained camera to continuously summarize video scenes as approximate object counts. Elf's novelty centers on planning the camera's count actions under energy constraint. (1) Elf explores the rich action space spanned by the number of sample image frames and the choice of per-frame object counters; it unifies errors from both sources into one single bounded error. (2) To decide count actions at run time, Elf employs a learning-based planner, jointly optimizing for past and future videos without delaying result materialization. Tested with more than 1,000 hours of videos and under realistic energy constraints, Elf continuously generates object counts within only 11% of the true counts on average. Alongside the counts, Elf presents narrow errors shown to be bounded and up to 3.4x smaller than competitive baselines. At a higher level, Elf makes a case for advancing the geographic frontier of video analytics.

preprint2020arXiv

Characterizing EOSIO Blockchain

EOSIO has become one of the most popular blockchain platforms since its mainnet launch in June 2018. In contrast to the traditional PoW-based systems (e.g., Bitcoin and Ethereum), which are limited by low throughput, EOSIO is the first high throughput Delegated Proof of Stake system that has been widely adopted by many applications. Although EOSIO has millions of accounts and billions of transactions, little is known about its ecosystem, especially related to security and fraud. In this paper, we perform a large-scale measurement study of the EOSIO blockchain and its associated DApps. We gather a large-scale dataset of EOSIO and characterize activities including money transfers, account creation and contract invocation. Using our insights, we then develop techniques to automatically detect bots and fraudulent activity. We discover thousands of bot accounts (over 30\% of the accounts in the platform) and a number of real-world attacks (301 attack accounts). By the time of our study, 80 attack accounts we identified have been confirmed by DApp teams, causing 828,824 EOS tokens losses (roughly 2.6 million US\$) in total.

preprint2020arXiv

DeepCache: Principled Cache for Mobile Deep Vision

We present DeepCache, a principled cache design for deep learning inference in continuous mobile vision. DeepCache benefits model execution efficiency by exploiting temporal locality in input video streams. It addresses a key challenge raised by mobile vision: the cache must operate under video scene variation, while trading off among cacheability, overhead, and loss in model accuracy. At the input of a model, DeepCache discovers video temporal locality by exploiting the video's internal structure, for which it borrows proven heuristics from video compression; into the model, DeepCache propagates regions of reusable results by exploiting the model's internal structure. Notably, DeepCache eschews applying video heuristics to model internals which are not pixels but high-dimensional, difficult-to-interpret data. Our implementation of DeepCache works with unmodified deep learning models, requires zero developer's manual effort, and is therefore immediately deployable on off-the-shelf mobile devices. Our experiments show that DeepCache saves inference execution time by 18% on average and up to 47%. DeepCache reduces system energy consumption by 20% on average.