Researcher profile

Zhen Ming

Zhen Ming contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap

The rise of AI-assisted software engineering (SE 2.0), powered by Foundation Models (FMs) and FM-powered coding assistants, has shown promise in improving developer productivity. However, it has also exposed inherent limitations, such as cognitive overload on developers and inefficiencies. We propose a shift towards Software Engineering 3.0 (SE 3.0), an AI-native approach characterized by intent-centric, conversation-oriented development between human developers and AI teammates. SE 3.0 envisions AI systems evolving beyond task-driven copilots into intelligent collaborators, capable of deeply understanding and reasoning about software engineering principles and intents. We outline the key components of the SE 3.0 technology stack, which includes Teammate.next for adaptive and personalized AI partnership, IDE.next for intent-centric conversation-oriented development, Compiler.next for multi-objective code synthesis, and Runtime.next for SLA-aware execution with edge-computing support. Our vision addresses the inefficiencies and cognitive strain of SE 2.0 by fostering a symbiotic relationship between human developers and AI, maximizing their complementary strengths. We also present a roadmap of challenges that must be overcome to realize our vision of SE 3.0. This paper lays the foundation for future discussions on the role of AI in the next era of software engineering.

preprint2023arXiv

Bugs in Machine Learning-based Systems: A Faultload Benchmark

The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.

preprint2022arXiv

An Empirical Study of Challenges in Converting Deep Learning Models

There is an increase in deploying Deep Learning (DL)-based software systems in real-world applications. Usually DL models are developed and trained using DL frameworks that have their own internal mechanisms/formats to represent and train DL models, and usually those formats cannot be recognized by other frameworks. Moreover, trained models are usually deployed in environments different from where they were developed. To solve the interoperability issue and make DL models compatible with different frameworks/environments, some exchange formats are introduced for DL models, like ONNX and CoreML. However, ONNX and CoreML were never empirically evaluated by the community to reveal their prediction accuracy, performance, and robustness after conversion. Poor accuracy or non-robust behavior of converted models may lead to poor quality of deployed DL-based software systems. We conduct, in this paper, the first empirical study to assess ONNX and CoreML for converting trained DL models. In our systematic approach, two popular DL frameworks, Keras and PyTorch, are used to train five widely used DL models on three popular datasets. The trained models are then converted to ONNX and CoreML and transferred to two runtime environments designated for such formats, to be evaluated. We investigate the prediction accuracy before and after conversion. Our results unveil that the prediction accuracy of converted models are at the same level of originals. The performance (time cost and memory consumption) of converted models are studied as well. The size of models are reduced after conversion, which can result in optimized DL-based software deployment. Converted models are generally assessed as robust at the same level of originals. However, obtained results show that CoreML models are more vulnerable to adversarial attacks compared to ONNX.

preprint2022arXiv

Can I use this publicly available dataset to build commercial AI software? -- A Case Study on Publicly Available Image Datasets

Publicly available datasets are one of the key drivers for commercial AI software. The use of publicly available datasets is governed by dataset licenses. These dataset licenses outline the rights one is entitled to on a given dataset and the obligations that one must fulfil to enjoy such rights without any license compliance violations. Unlike standardized Open Source Software (OSS) licenses, existing dataset licenses are defined in an ad-hoc manner and do not clearly outline the rights and obligations associated with their usage. Further, a public dataset may be hosted in multiple locations and created from multiple data sources each of which may have different licenses. Hence, existing approaches on checking OSS license compliance cannot be used. In this paper, we propose a new approach to assessing the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software. We conduct a case study with our approach on 6 commonly used publicly available image datasets. Our results show that there exists potential risks of license violations associated with all of the studied datasets if they were used for commercial purposes.

preprint2022arXiv

Towards a consistent interpretation of AIOps models

Artificial Intelligence for IT Operations (AIOps) has been adopted in organizations in various tasks, including interpreting models to identify indicators of service failures. To avoid misleading practitioners, AIOps model interpretations should be consistent (i.e., different AIOps models on the same task agree with one another on feature importance). However, many AIOps studies violate established practices in the machine learning community when deriving interpretations, such as interpreting models with suboptimal performance, though the impact of such violations on the interpretation consistency has not been studied. In this paper, we investigate the consistency of AIOps model interpretation along three dimensions: internal consistency, external consistency, and time consistency. We conduct a case study on two AIOps tasks: predicting Google cluster job failures, and Backblaze hard drive failures. We find that the randomness from learners, hyperparameter tuning, and data sampling should be controlled to generate consistent interpretations. AIOps models with AUCs greater than 0.75 yield more consistent interpretation compared to low-performing models. Finally, AIOps models that are constructed with the Sliding Window or Full History approaches have the most consistent interpretation with the trends presented in the entire datasets. Our study provides valuable guidelines for practitioners to derive consistent AIOps model interpretation.

preprint2022arXiv

Towards Build Verifiability for Java-based Systems

Build verifiability refers to the property that the build of a software system can be verified by independent third parties and it is crucial for the trustworthiness of a software system. Various efforts towards build verifiability have been made to C/C++-based systems, yet the techniques for Java-based systems are not systematic and are often specific to a particular build tool (e.g., Maven). In this study, we present a systematic approach towards build verifiability on Java-based systems. Our approach consists of three parts: a unified build process, a tool that dynamically controls non-determinism during the build process, and another tool that eliminates non-equivalences by post-processing the build artifacts. We apply our approach on 46 unverified open source projects from Reproducible Central and 13 open source projects that are widely used by Huawei commercial products. As a result, 91% of the unverified Reproducible Central projects and 100% of the commercially adopted OSS projects are successfully verified with our approach. In addition, based on our experience in analyzing thousands of builds for both commercial and open source Java-based systems, we present 14 patterns that introduce non-equivalences in generated build artifacts and their respective mitigation strategies. Among these patterns, 11 (78%) are unique for Java-based system, whereas the remaining 3 (22%) are common with C/C++-based systems. The approach and the findings of this paper are useful for both practitioners and researchers who are interested in build verifiability.

preprint2022arXiv

Towards Training Reproducible Deep Learning Models

Reproducibility is an increasing concern in Artificial Intelligence (AI), particularly in the area of Deep Learning (DL). Being able to reproduce DL models is crucial for AI-based systems, as it is closely tied to various tasks like training, testing, debugging, and auditing. However, DL models are challenging to be reproduced due to issues like randomness in the software (e.g., DL algorithms) and non-determinism in the hardware (e.g., GPU). There are various practices to mitigate some of the aforementioned issues. However, many of them are either too intrusive or can only work for a specific usage context. In this paper, we propose a systematic approach to training reproducible DL models. Our approach includes three main parts: (1) a set of general criteria to thoroughly evaluate the reproducibility of DL models for two different domains, (2) a unified framework which leverages a record-and-replay technique to mitigate software-related randomness and a profile-and-patch technique to control hardware-related non-determinism, and (3) a reproducibility guideline which explains the rationales and the mitigation strategies on conducting a reproducible training process for DL models. Case study results show our approach can successfully reproduce six open source and one commercial DL models.

preprint2020arXiv

An Exploratory Study on Machine Learning Model Stores

Recent advances in Artificial Intelligence, especially in Machine Learning (ML), have brought applications previously considered as science fiction (e.g., virtual personal assistants and autonomous cars) into the reach of millions of everyday users. Since modern ML technologies like deep learning require considerable technical expertise and resource to build custom models, reusing existing models trained by experts has become essential. This is why in the past year model stores have been introduced, which, similar to mobile app stores, offer organizations and developers access to pre-trained models and/or their code to train, evaluate, and predict samples. This paper conducts an exploratory study on three popular model stores (AWS marketplace, Wolfram neural net repository, and ModelDepot) that compares the information elements (features and policies) provided by model stores to those used by the two popular mobile app stores (Google Play and Apple's App Store). We have found that the model information elements vary among the different model stores, with 65% elements shared by all three studied stores. Model stores share five information elements with mobile app stores, while eight elements are unique to model stores and four elements unique to app stores. Only few models were available on multiple model stores. Our findings allow to better understand the differences between ML models and "regular" source code components or applications, and provide inspiration to identify software engineering practices (e.g., in requirements and delivery) specific to ML applications.