Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
36works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

36 published item(s)

preprint2026arXiv

A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar

The rise of agentic AI is reshaping software engineering in two intertwined directions: agents are increasingly applied to support software engineering tasks, and Agentic AI systems themselves are complex systems that require re-thinking currently established software engineering practices. To chart a coherent research agenda covering the two directions, we organized the A2SE seminar in Rio de Janeiro, bringing together 18 experts from academia and industry. Through structured presentations, collaborative topic clustering, and focused group discussions, participants identified six thematic areas: Governance, Software Engineering for Agents, Agents for Software Architecture, Quality and Evaluation, Sustainability, and Code, and they prioritized short-term and long-term research directions for each. This paper presents the resulting community-driven, opinionated research agenda, offering the SE community a structured foundation for coordinating efforts at this critical juncture.

preprint2026arXiv

An Empirical Study of Policy-as-Code Adoption in Open-Source Software Projects

\textbf{Context:} Policy-as-Code (PaC) has become a foundational approach for embedding governance, compliance, and security requirements directly into software systems. While organizations increasingly adopt PaC tools, the software engineering community lacks an empirical understanding of how these tools are used in real-world development practices. \textbf{Objective:} This paper aims to bridge this gap by conducting the first large-scale study of PaC usage in open-source software. Our goal is to characterize how PaC tools are adopted, what purposes they serve, and what governance activities they support across diverse software ecosystems. \textbf{Method:} We analyzed 399 GitHub repositories using nine widely adopted PaC tools. Our mixed-methods approach combines quantitative analysis of tool usage and project characteristics with a qualitative investigation of policy files. We further employ a Large Language Model (LLM)--assisted classification pipeline, refined through expert validation, to derive a taxonomy of PaC usage consisting of 5 categories and 15 sub-categories. \textbf{Results:} Our study reveals substantial diversity in PaC adoption. PaC tools are frequently used in early-stage projects and are heavily oriented toward governance, configuration control, and documentation. We also observe emerging PaC usage in MLOps pipelines and strong co-usage patterns, such as between OPA and Gatekeeper. Our taxonomy highlights recurring governance intents. \textbf{Conclusion:} Our findings offer actionable insights for practitioners and tool developers. They highlight concrete usage patterns, emphasize actual PaC usage, and motivate opportunities for improving tool interoperability. This study lays the empirical foundation for future research on PaC practices and their role in ensuring trustworthy, compliant software systems.

preprint2026arXiv

Empirical Characterization of Logging Smells in Machine Learning Code

\underline{Context:} Logging is a fundamental yet complex practice in software engineering, essential for monitoring, debugging, and auditing software systems. With the increasing integration of machine learning (ML) components into software systems, effective logging has become critical to ensure reproducibility, traceability, and observability throughout model training and deployment. Although various general-purpose and ML-specific logging frameworks exist, little is known about how these tools are actually used in practice or whether ML practitioners adopt consistent and effective logging strategies. To date, no empirical study has systematically characterized recurring bad logging practices--or logging smells--in ML System. \underline{Goal:} This study aims to empirically identify and characterize logging smells in ML systems, providing an evidence-based understanding of how logging is implemented and challenged in practice. \underline{Method:} We propose to conduct a large-scale mining of open-source ML repositories hosted on GitHub to catalogue recurring logging smells. Subsequently, a practitioner survey involving ML engineers will be conducted to assess the perceived relevance, severity, and frequency of the identified smells. \underline{Limitations:} % While The study's limitations include that While our findings may not be generalizable to closed-source industrial projects, we believe our study provides an essential step toward understanding and improving logging practices in ML development.

preprint2026arXiv

Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis

The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.

preprint2026arXiv

Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models

The advent of transformer-based language models has reshaped how AI systems process and generate text. In software engineering (SE), these models now support diverse activities, accelerating automation and decision-making. Yet, evidence shows that these models can reproduce or amplify social biases, raising fairness concerns. Recent work on neuron editing has shown that internal activations in pre-trained transformers can be traced and modified to alter model behavior. Building on the concept of knowledge neurons, neurons that encode factual information, we hypothesize the existence of biased neurons that capture stereotypical associations within pre-trained transformers. To test this hypothesis, we build a dataset of biased relations, i.e., triplets encoding stereotypes across nine bias types, and adapt neuron attribution strategies to trace and suppress biased neurons in BERT models. We then assess the impact of suppression on SE tasks. Our findings show that biased knowledge is localized within small neuron subsets, and suppressing them substantially reduces bias with minimal performance loss. This demonstrates that bias in transformers can be traced and mitigated at the neuron level, offering an interpretable approach to fairness in SE.

preprint2023arXiv

AmbieGen: A Search-based Framework for Autonomous Systems Testing

Thorough testing of safety-critical autonomous systems, such as self-driving cars, autonomous robots, and drones, is essential for detecting potential failures before deployment. One crucial testing stage is model-in-the-loop testing, where the system model is evaluated by executing various scenarios in a simulator. However, the search space of possible parameters defining these test scenarios is vast, and simulating all combinations is computationally infeasible. To address this challenge, we introduce AmbieGen, a search-based test case generation framework for autonomous systems. AmbieGen uses evolutionary search to identify the most critical scenarios for a given system, and has a modular architecture that allows for the addition of new systems under test, algorithms, and search operators. Currently, AmbieGen supports test case generation for autonomous robots and autonomous car lane keeping assist systems. In this paper, we provide a high-level overview of the framework's architecture and demonstrate its practical use cases.

preprint2023arXiv

Bugs in Machine Learning-based Systems: A Faultload Benchmark

The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.

preprint2023arXiv

Mutation Testing of Deep Reinforcement Learning Based on Real Faults

Testing Deep Learning (DL) systems is a complex task as they do not behave like traditional systems would, notably because of their stochastic nature. Nonetheless, being able to adapt existing testing techniques such as Mutation Testing (MT) to DL settings would greatly improve their potential verifiability. While some efforts have been made to extend MT to the Supervised Learning paradigm, little work has gone into extending it to Reinforcement Learning (RL) which is also an important component of the DL ecosystem but behaves very differently from SL. This paper builds on the existing approach of MT in order to propose a framework, RLMutation, for MT applied to RL. Notably, we use existing taxonomies of faults to build a set of mutation operators relevant to RL and use a simple heuristic to generate test cases for RL. This allows us to compare different mutation killing definitions based on existing approaches, as well as to analyze the behavior of the obtained mutation operators and their potential combinations called Higher Order Mutation(s) (HOM). We show that the design choice of the mutation killing definition can affect whether or not a mutation is killed as well as the generated test cases. Moreover, we found that even with a relatively small number of test cases and operators we manage to generate HOM with interesting properties which can enhance testing capability in RL systems.

preprint2023arXiv

Partial Order in Chaos: Consensus on Feature Attributions in the Rashomon Set

Post-hoc global/local feature attribution methods are progressively being employed to understand the decisions of complex machine learning models. Yet, because of limited amounts of data, it is possible to obtain a diversity of models with good empirical performance but that provide very different explanations for the same prediction, making it hard to derive insight from them. In this work, instead of aiming at reducing the under-specification of model explanations, we fully embrace it and extract logical statements about feature attributions that are consistent across all models with good empirical performance (i.e. all models in the Rashomon Set). We show that partial orders of local/global feature importance arise from this methodology enabling more nuanced interpretations by allowing pairs of features to be incomparable when there is no consensus on their relative importance. We prove that every relation among features present in these partial orders also holds in the rankings provided by existing approaches. Finally, we present three use cases employing hypothesis spaces with tractable Rashomon Sets (Additive models, Kernel Ridge, and Random Forests) and show that partial orders allow one to extract consistent local and global interpretations of models despite their under-specification.

preprint2023arXiv

Studying Logging Practice in Machine Learning-based Applications

Logging is a common practice in traditional software development. Several research works have been done to investigate the different characteristics of logging practices in traditional software systems (e.g., Android applications, JAVA applications, C/C++ applications). Nowadays, we are witnessing more and more development of Machine Learning-based applications (ML-based applications). Today, there are many popular libraries that facilitate and contribute to the development of such applications, among which we can mention: Pytorch, Tensorflow, Theano, MXNet, Scikit-Learn, Caffe, and Keras. Despite the popularity of ML, we don't have a clear understanding of logging practices in ML applications. In this paper, we aim to fill this knowledge gap and help ML practitioners understand the characteristics of logging in ML-based applications. In particular, we conduct an empirical study on 110 open-source ML-based applications. Through a quantitative analysis, we find that logging practice in ML-based applications is less pervasive than in traditional applications including Android, JAVA, and C/C++ applications. Furthermore, the majority of logging statements in ML-based applications are in info and warn levels, compared to traditional applications where info is the majority of logging statement in C/C++ application and debug, error levels constitute the majority of logging statement in Android application. We also perform a quantitative and qualitative analysis of a random sample of logging statements to understand where ML developers put most of logging statements and examine why and how they are using logging. These analyses led to the following observations: (i) ML developers put most of the logging statements in model training, and in non-ML components. (ii) Data and model management appear to be the main reason behind the introduction of logging statements in ML-based applications.

preprint2022arXiv

A Probabilistic Framework for Mutation Testing in Deep Neural Networks

Context: Mutation Testing (MT) is an important tool in traditional Software Engineering (SE) white-box testing. It aims to artificially inject faults in a system to evaluate a test suite's capability to detect them, assuming that the test suite defects finding capability will then translate to real faults. If MT has long been used in SE, it is only recently that it started gaining the attention of the Deep Learning (DL) community, with researchers adapting it to improve the testability of DL models and improve the trustworthiness of DL systems. Objective: If several techniques have been proposed for MT, most of them neglected the stochasticity inherent to DL resulting from the training phase. Even the latest MT approaches in DL, which propose to tackle MT through a statistical approach, might give inconsistent results. Indeed, as their statistic is based on a fixed set of sampled training instances, it can lead to different results across instances set when results should be consistent for any instance. Methods: In this work, we propose a Probabilistic Mutation Testing (PMT) approach that alleviates the inconsistency problem and allows for a more consistent decision on whether a mutant is killed or not. Results: We show that PMT effectively allows a more consistent and informed decision on mutations through evaluation using three models and eight mutation operators used in previously proposed MT methods. We also analyze the trade-off between the approximation error and the cost of our method, showing that relatively small error can be achieved for a manageable cost. Conclusion: Our results showed the limitation of current MT practices in DNN and the need to rethink them. We believe PMT is the first step in that direction which effectively removes the lack of consistency across test executions of previous methods caused by the stochasticity of DNN training.

preprint2022arXiv

An Empirical Study of Challenges in Converting Deep Learning Models

There is an increase in deploying Deep Learning (DL)-based software systems in real-world applications. Usually DL models are developed and trained using DL frameworks that have their own internal mechanisms/formats to represent and train DL models, and usually those formats cannot be recognized by other frameworks. Moreover, trained models are usually deployed in environments different from where they were developed. To solve the interoperability issue and make DL models compatible with different frameworks/environments, some exchange formats are introduced for DL models, like ONNX and CoreML. However, ONNX and CoreML were never empirically evaluated by the community to reveal their prediction accuracy, performance, and robustness after conversion. Poor accuracy or non-robust behavior of converted models may lead to poor quality of deployed DL-based software systems. We conduct, in this paper, the first empirical study to assess ONNX and CoreML for converting trained DL models. In our systematic approach, two popular DL frameworks, Keras and PyTorch, are used to train five widely used DL models on three popular datasets. The trained models are then converted to ONNX and CoreML and transferred to two runtime environments designated for such formats, to be evaluated. We investigate the prediction accuracy before and after conversion. Our results unveil that the prediction accuracy of converted models are at the same level of originals. The performance (time cost and memory consumption) of converted models are studied as well. The size of models are reduced after conversion, which can result in optimized DL-based software deployment. Converted models are generally assessed as robust at the same level of originals. However, obtained results show that CoreML models are more vulnerable to adversarial attacks compared to ONNX.

preprint2022arXiv

An Empirical Study on the Usage of Automated Machine Learning Tools

The popularity of automated machine learning (AutoML) tools in different domains has increased over the past few years. Machine learning (ML) practitioners use AutoML tools to automate and optimize the process of feature engineering, model training, and hyperparameter optimization and so on. Recent work performed qualitative studies on practitioners' experiences of using AutoML tools and compared different AutoML tools based on their performance and provided features, but none of the existing work studied the practices of using AutoML tools in real-world projects at a large scale. Therefore, we conducted an empirical study to understand how ML practitioners use AutoML tools in their projects. To this end, we examined the top 10 most used AutoML tools and their respective usages in a large number of open-source project repositories hosted on GitHub. The results of our study show 1) which AutoML tools are mostly used by ML practitioners and 2) the characteristics of the repositories that use these AutoML tools. Also, we identified the purpose of using AutoML tools (e.g. model parameter sampling, search space management, model evaluation/error-analysis, Data/ feature transformation, and data labeling) and the stages of the ML pipeline (e.g. feature engineering) where AutoML tools are used. Finally, we report how often AutoML tools are used together in the same source code files. We hope our results can help ML practitioners learn about different AutoML tools and their usages, so that they can pick the right tool for their purposes. Besides, AutoML tool developers can benefit from our findings to gain insight into the usages of their tools and improve their tools to better fit the users' usages and needs.

preprint2022arXiv

Bug Characteristics in Quantum Software Ecosystem

With the advance in quantum computing in recent years, quantum software becomes vital for exploring the full potential of quantum computing systems. Quantum programming is different from classical programming, for example, the state of a quantum program is probabilistic in nature, and a quantum computer is error-prone due to the instability of quantum mechanisms. Therefore, the characteristics of bugs in quantum software projects may be very different from that of classical software projects. This work aims to understand the characteristics of bugs in quantum software projects, in order to provide insights to help devise effective testing and debugging mechanisms. To achieve this goal, we conduct an empirical study on the bug reports of 125 quantum software projects. We observe that quantum software projects are more buggy than classical software projects and that quantum project bugs are more costly to fix than classical project bugs. We also identify the types of the bugs and the quantum programming components where they occurred. Our study shows that the bugs are spread across different components, but quantum-specific bugs particularly appear in the compiler, gate operation, and state preparation components. The three most occurring types of bugs are Program anomaly bugs, Configuration bugs, and Data type and structure bugs. Our study highlights some particularly challenging areas in quantum software development, such as the lack of scientific quantum computation libraries that implement comprehensive mathematical functions for quantum computing. Quantum developers also seek specialized data manipulation libraries for quantum software engineering like Numpy for quantum computing. Our findings also provide insights for future work to advance the quantum program development, testing, and debugging of quantum software, such as providing tooling support for debugging low-level circuits.

preprint2022arXiv

Data-access performance anti-patterns in data-intensive systems

Data-intensive systems handle variable, high volume, and high-velocity data generated by human and digital devices. Like traditional software, data-intensive systems are prone to technical debts introduced to cope-up with the pressure of time and resource constraints on developers. Data-access is a critical component of data-intensive systems as it determines the overall performance and functionality of such systems. While data access technical debts are getting attention from the research community, technical debts affecting the performance, are not well investigated. Objective: Identify, categorize, and validate data access performance issues in the context of NoSQL-based and polyglot persistence data-intensive systems using qualitative study. Method: We collect issues from NoSQL-based and polyglot persistence open-source data-intensive systems and identify data access performance issues using inductive coding and build a taxonomy of the root causes. Then, we validate the perceived relevance of the newly identified performance issues using a developer survey.

preprint2022arXiv

Do Developers Refactor Data Access Code? An Empirical Study

Developers often refactor code to improve the maintainability and comprehension of the software. There are many studies on refactoring activities in traditional software systems. However, refactoring in data-intensive systems is not well explored. Understanding the refactoring practices of developers is important to develop efficient tool support.We conducted a longitudinal study of refactoring activities in data access classes using 12 data-intensive subject systems. We investigated the prevalence and evolution of refactorings and the association of refactorings with data access smells. We also conducted a manual analysis of over 378 samples of data access refactoring instances to identify the functionalities of the code that are targeted by such refactorings. Our results show that (1) data access refactorings are prevalent and different in type. \textit{Rename variable} is the most prevalent data access refactoring. (2) The prevalence and type of refactorings vary as systems evolve in time. (3) Most data access refactorings target codes that implement data fetching and insertion. (4) Data access refactorings do not generally touch SQL queries. Overall, the results show that data access refactorings focus on improving the code quality but not the underlying data access operations. Hence, more work is needed from the research community on providing awareness and support to practitioners on the benefits of addressing data access smells with refactorings.

preprint2022arXiv

FIXME: Synchronize with Database An Empirical Study of Data Access Self-Admitted Technical Debt

Developers sometimes choose design and implementation shortcuts due to the pressure from tight release schedules. However, shortcuts introduce technical debt that increases as the software evolves. The debt needs to be repaid as fast as possible to minimize its impact on software development and software quality. Sometimes, technical debt is admitted by developers in comments and commit messages. Such debt is known as self-admitted technical debt (SATD). In data-intensive systems, where data manipulation is a critical functionality, the presence of SATD in the data access logic could seriously harm performance and maintainability. Understanding the composition and distribution of the SATDs across software systems and their evolution could provide insights into managing technical debt efficiently. We present a large-scale empirical study on the prevalence, composition, and evolution of SATD in data-intensive systems. We analyzed 83 open-source systems relying on relational databases as well as 19 systems relying on NoSQL databases. We detected SATD in source code comments obtained from different snapshots of the subject systems. To understand the evolution dynamics of SATDs, we conducted a survival analysis. Next, we performed a manual analysis of 361 sample data-access SATDs, investigating the composition of data-access SATDs and the reasons behind their introduction and removal. We identified 15 new SATD categories, out of which 11 are specific to database access operations. We found that most of the data-access SATDs are introduced in the later stages of change history rather than at the beginning. We also observed that bug fixing and refactoring are the main reasons behind the introduction of data-access SATDs.

preprint2022arXiv

Never trust, always verify : a roadmap for Trustworthy AI?

Artificial Intelligence (AI) is becoming the corner stone of many systems used in our daily lives such as autonomous vehicles, healthcare systems, and unmanned aircraft systems. Machine Learning is a field of AI that enables systems to learn from data and make decisions on new data based on models to achieve a given goal. The stochastic nature of AI models makes verification and validation tasks challenging. Moreover, there are intrinsic biaises in AI models such as reproductibility bias, selection bias (e.g., races, genders, color), and reporting bias (i.e., results that do not reflect the reality). Increasingly, there is also a particular attention to the ethical, legal, and societal impacts of AI. AI systems are difficult to audit and certify because of their black-box nature. They also appear to be vulnerable to threats; AI systems can misbehave when untrusted data are given, making them insecure and unsafe. Governments, national and international organizations have proposed several principles to overcome these challenges but their applications in practice are limited and there are different interpretations in the principles that can bias implementations. In this paper, we examine trust in the context of AI-based systems to understand what it means for an AI system to be trustworthy and identify actions that need to be undertaken to ensure that AI systems are trustworthy. To achieve this goal, we first review existing approaches proposed for ensuring the trustworthiness of AI systems, in order to identify potential conceptual gaps in understanding what trustworthy AI is. Then, we suggest a trust (resp. zero-trust) model for AI and suggest a set of properties that should be satisfied to ensure the trustworthiness of AI systems.

preprint2022arXiv

On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems

Code smells indicate software design problems that harm software quality. Data-intensive systems that frequently access databases often suffer from SQL code smells besides the traditional smells. While there have been extensive studies on traditional code smells, recently, there has been a growing interest in SQL code smells. In this paper, we conduct an empirical study to investigate the prevalence and evolution of SQL code smells in open-source, data-intensive systems. We collected 150 projects and examined both traditional and SQL code smells in these projects. Our investigation delivers several important findings. First, SQL code smells are indeed prevalent in data-intensive software systems. Second, SQL code smells have a weak co-occurrence with traditional code smells. Third, SQL code smells have a weaker association with bugs than that of traditional code smells. Fourth, SQL code smells are more likely to be introduced at the beginning of the project lifetime and likely to be left in the code without a fix, compared to traditional code smells. Overall, our results show that SQL code smells are indeed prevalent and persistent in the studied data-intensive software systems. Developers should be aware of these smells and consider detecting and refactoring SQL code smells and traditional code smells separately, using dedicated tools.

preprint2022arXiv

Physics-Guided Adversarial Machine Learning for Aircraft Systems Simulation

In the context of aircraft system performance assessment, deep learning technologies allow to quickly infer models from experimental measurements, with less detailed system knowledge than usually required by physics-based modeling. However, this inexpensive model development also comes with new challenges regarding model trustworthiness. This work presents a novel approach, physics-guided adversarial machine learning (ML), that improves the confidence over the physics consistency of the model. The approach performs, first, a physics-guided adversarial testing phase to search for test inputs revealing behavioral system inconsistencies, while still falling within the range of foreseeable operational conditions. Then, it proceeds with physics-informed adversarial training to teach the model the system-related physics domain foreknowledge through iteratively reducing the unwanted output deviations on the previously-uncovered counterexamples. Empirical evaluation on two aircraft system performance models shows the effectiveness of our adversarial ML approach in exposing physical inconsistencies of both models and in improving their propensity to be consistent with physics domain knowledge.

preprint2022arXiv

Quality issues in Machine Learning Software Systems

Context: An increasing demand is observed in various domains to employ Machine Learning (ML) for solving complex problems. ML models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs). Problem: There is a strong need for ensuring the serving quality of MLSSs. False or poor decisions of such systems can lead to malfunction of other systems, significant financial losses, or even threat to human life. The quality assurance of MLSSs is considered as a challenging task and currently is a hot research topic. Moreover, it is important to cover all various aspects of the quality in MLSSs. Objective: This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. This empirical study aims to identify a catalog of bad-practices related to poor quality in MLSSs. Method: We plan to conduct a set of interviews with practitioners/experts, believing that interviews are the best method to retrieve their experience and practices when dealing with quality issues. We expect that the catalog of issues developed at this step will also help us later to identify the severity, root causes, and possible remedy for quality issues of MLSSs, allowing us to develop efficient quality assurance tools for ML models and MLSSs.

preprint2022arXiv

SmOOD: Smoothness-based Out-of-Distribution Detection Approach for Surrogate Neural Networks in Aircraft Design

Aircraft industry is constantly striving for more efficient design optimization methods in terms of human efforts, computation time, and resource consumption. Hybrid surrogate optimization maintains high results quality while providing rapid design assessments when both the surrogate model and the switch mechanism for eventually transitioning to the HF model are calibrated properly. Feedforward neural networks (FNNs) can capture highly nonlinear input-output mappings, yielding efficient surrogates for aircraft performance factors. However, FNNs often fail to generalize over the out-of-distribution (OOD) samples, which hinders their adoption in critical aircraft design optimization. Through SmOOD, our smoothness-based out-of-distribution detection approach, we propose to codesign a model-dependent OOD indicator with the optimized FNN surrogate, to produce a trustworthy surrogate model with selective but credible predictions. Unlike conventional uncertainty-grounded methods, SmOOD exploits inherent smoothness properties of the HF simulations to effectively expose OODs through revealing their suspicious sensitivities, thereby avoiding over-confident uncertainty estimates on OOD samples. By using SmOOD, only high-risk OOD inputs are forwarded to the HF model for re-evaluation, leading to more accurate results at a low overhead cost. Three aircraft performance models are investigated. Results show that FNN-based surrogates outperform their Gaussian Process counterparts in terms of predictive performance. Moreover, SmOOD does cover averagely 85% of actual OODs on all the study cases. When SmOOD plus FNN surrogates are deployed in hybrid surrogate optimization settings, they result in a decrease error rate of 34.65% and a computational speed up rate of 58.36 times, respectively.

preprint2022arXiv

Studying the Practices of Deploying Machine Learning Projects on Docker

Docker is a containerization service that allows for convenient deployment of websites, databases, applications' APIs, and machine learning (ML) models with a few lines of code. Studies have recently explored the use of Docker for deploying general software projects with no specific focus on how Docker is used to deploy ML-based projects. In this study, we conducted an exploratory study to understand how Docker is being used to deploy ML-based projects. As the initial step, we examined the categories of ML-based projects that use Docker. We then examined why and how these projects use Docker, and the characteristics of the resulting Docker images. Our results indicate that six categories of ML-based projects use Docker for deployment, including ML Applications, MLOps/ AIOps, Toolkits, DL Frameworks, Models, and Documentation. We derived the taxonomy of 21 major categories representing the purposes of using Docker, including those specific to models such as model management tasks (e.g., testing, training). We then showed that ML engineers use Docker images mostly to help with the platform portability, such as transferring the software across the operating systems, runtimes such as GPU, and language constraints. However, we also found that more resources may be required to run the Docker images for building ML-based software projects due to the large number of files contained in the image layers with deeply nested directories. We hope to shed light on the emerging practices of deploying ML software projects using containers and highlight aspects that should be improved.

preprint2022arXiv

Technical Debts and Faults in Open-source Quantum Software Systems: An Empirical Study

Quantum computing is a rapidly growing field attracting the interest of both researchers and software developers. Supported by its numerous open-source tools, developers can now build, test, or run their quantum algorithms. Although the maintenance practices for traditional software systems have been extensively studied, the maintenance of quantum software is still a new field of study but a critical part to ensure the quality of a whole quantum computing system. In this work, we set out to investigate the distribution and evolution of technical debts in quantum software and their relationship with fault occurrences. Understanding these problems could guide future quantum development and provide maintenance recommendations for the key areas where quantum software developers and researchers should pay more attention. In this paper, we empirically studied 118 open-source quantum projects, which were selected from GitHub. The projects are categorized into 10 categories. We found that the studied quantum software suffers from the issues of code convention violation, error-handling, and code design. We also observed a statistically significant correlation between code design, redundant code or code convention, and the occurrences of faults in quantum software.

preprint2022arXiv

Testing Feedforward Neural Networks Training Programs

Nowadays, we are witnessing an increasing effort to improve the performance and trustworthiness of Deep Neural Networks (DNNs), with the aim to enable their adoption in safety critical systems such as self-driving cars. Multiple testing techniques are proposed to generate test cases that can expose inconsistencies in the behavior of DNN models. These techniques assume implicitly that the training program is bug-free and appropriately configured. However, satisfying this assumption for a novel problem requires significant engineering work to prepare the data, design the DNN, implement the training program, and tune the hyperparameters in order to produce the model for which current automated test data generators search for corner-case behaviors. All these model training steps can be error-prone. Therefore, it is crucial to detect and correct errors throughout all the engineering steps of DNN-based software systems and not only on the resulting DNN model. In this paper, we gather a catalog of training issues and based on their symptoms and their effects on the behavior of the training program, we propose practical verification routines to detect the aforementioned issues, automatically, by continuously validating that some important properties of the learning dynamics hold during the training. Then, we design, TheDeepChecker, an end-to-end property-based debugging approach for DNN training programs. We assess the effectiveness of TheDeepChecker on synthetic and real-world buggy DL programs and compare it with Amazon SageMaker Debugger (SMD). Results show that TheDeepChecker's on-execution validation of DNN-based program's properties succeeds in revealing several coding bugs and system misconfigurations, early on and at a low cost. Moreover, TheDeepChecker outperforms the SMD's offline rules verification on training logs in terms of detection accuracy and DL bugs coverage.

preprint2022arXiv

The Different Faces of AI Ethics Across the World: A Principle-Implementation Gap Analysis

Artificial Intelligence (AI) is transforming our daily life with several applications in healthcare, space exploration, banking and finance. These rapid progresses in AI have brought increasing attention to the potential impacts of AI technologies on society, with ethically questionable consequences. In recent years, several ethical principles have been released by governments, national and international organisations. These principles outline high-level precepts to guide the ethical development, deployment, and governance of AI. However, the abstract nature, diversity, and context-dependency of these principles make them difficult to implement and operationalize, resulting in gaps between principles and their execution. Most recent work analysed and summarized existing AI principles and guidelines but they did not provide findings on principle-implementation gaps and how to mitigate them. These findings are particularly important to ensure that AI implementations are aligned with ethical principles and values. In this paper, we provide a contextual and global evaluation of current ethical AI principles for all continents, with the aim to identify potential principle characteristics tailored to specific countries or applicable across countries. Next, we analyze the current level of AI readiness and current implementations of ethical AI principles in different countries, to identify gaps in the implementation of AI principles and their causes. Finally, we propose recommendations to mitigate the principle-implementation gaps.

preprint2022arXiv

Threat Assessment in Machine Learning based Systems

Machine learning is a field of artificial intelligence (AI) that is becoming essential for several critical systems, making it a good target for threat actors. Threat actors exploit different Tactics, Techniques, and Procedures (TTPs) against the confidentiality, integrity, and availability of Machine Learning (ML) systems. During the ML cycle, they exploit adversarial TTPs to poison data and fool ML-based systems. In recent years, multiple security practices have been proposed for traditional systems but they are not enough to cope with the nature of ML-based systems. In this paper, we conduct an empirical study of threats reported against ML-based systems with the aim to understand and characterize the nature of ML threats and identify common mitigation strategies. The study is based on 89 real-world ML attack scenarios from the MITRE's ATLAS database, the AI Incident Database, and the literature; 854 ML repositories from the GitHub search and the Python Packaging Advisory database, selected based on their reputation. Attacks from the AI Incident Database and the literature are used to identify vulnerabilities and new types of threats that were not documented in ATLAS. Results show that convolutional neural networks were one of the most targeted models among the attack scenarios. ML repositories with the largest vulnerability prominence include TensorFlow, OpenCV, and Notebook. In this paper, we also report the most frequent vulnerabilities in the studied ML repositories, the most targeted ML phases and models, the most used TTPs in ML phases and attack scenarios. This information is particularly important for red/blue teams to better conduct attacks/defenses, for practitioners to prevent threats during ML development, and for researchers to develop efficient defense mechanisms.

preprint2022arXiv

Understanding Quantum Software Engineering Challenges An Empirical Study on Stack Exchange Forums and GitHub Issues

With the advance in quantum computing, quantum software becomes critical for exploring the full potential of quantum computing systems. Recently, quantum software engineering (QSE) becomes an emerging area attracting more and more attention. However, it is not clear what are the challenges and opportunities of quantum computing facing the software engineering community. This work aims to understand the QSE-related challenges perceived by developers. We perform an empirical study on Stack Exchange forums where developers post-QSE-related questions & answers and Github issue reports where developers raise QSE-related issues in practical quantum computing projects. Based on an existing taxonomy of question types on Stack Overflow, we first perform a qualitative analysis of the types of QSE-related questions asked on Stack Exchange forums. We then use automated topic modeling to uncover the topics in QSE-related Stack Exchange posts and GitHub issue reports. Our study highlights some particularly challenging areas of QSE that are different from that of traditional software engineering, such as explaining the theory behind quantum computing code, interpreting quantum program outputs, and bridging the knowledge gap between quantum computing and classical computing, as well as their associated opportunities.

preprint2021arXiv

An Empirical Study of C++ Vulnerabilities in Crowd-Sourced Code Examples

Software developers share programming solutions in Q&A sites like Stack Overflow. The reuse of crowd-sourced code snippets can facilitate rapid prototyping. However, recent research shows that the shared code snippets may be of low quality and can even contain vulnerabilities. This paper aims to understand the nature and the prevalence of security vulnerabilities in crowd-sourced code examples. To achieve this goal, we investigate security vulnerabilities in the C++ code snippets shared on Stack Overflow over a period of 10 years. In collaborative sessions involving multiple human coders, we manually assessed each code snippet for security vulnerabilities following CWE (Common Weakness Enumeration) guidelines. From the 72,483 reviewed code snippets used in at least one project hosted on GitHub, we found a total of 69 vulnerable code snippets categorized into 29 types. Many of the investigated code snippets are still not corrected on Stack Overflow. The 69 vulnerable code snippets found in Stack Overflow were reused in a total of 2859 GitHub projects. To help improve the quality of code snippets shared on Stack Overflow, we developed a browser extension that allow Stack Overflow users to check for vulnerabilities in code snippets when they upload them on the platform.

preprint2021arXiv

Automatic API Usage Scenario Documentation from Technical Q&A Sites

The online technical Q&A site Stack Overflow (SO) is popular among developers to support their coding and diverse development needs. To address shortcomings in API official documentation resources, several research has thus focused on augmenting official API documentation with insights (e.g., code examples) from SO. The techniques propose to add code examples/insights about APIs into its official documentation. Reviews are opinionated sentences with positive/negative sentiments. However, we are aware of no previous research that attempts to automatically produce API documentation from SO by considering both API code examples and reviews. In this paper, we present two novel algorithms that can be used to automatically produce API documentation from SO by combining code examples and reviews towards those examples. The first algorithm is called statistical documentation, which shows the distribution of positivity and negativity around the code examples of an API using different metrics (e.g., star ratings). The second algorithm is called concept-based documentation, which clusters similar and conceptually relevant usage scenarios. An API usage scenario contains a code example, a textual description of the underlying task addressed by the code example, and the reviews (i.e., opinions with positive and negative sentiments) from other developers towards the code example. We deployed the algorithms in Opiner, a web-based platform to aggregate information about APIs from online forums. We evaluated the algorithms by mining all Java JSON-based posts in SO and by conducting three user studies based on produced documentation from the posts.

preprint2021arXiv

How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review

Context: Machine Learning (ML) has been at the heart of many innovations over the past years. However, including it in so-called 'safety-critical' systems such as automotive or aeronautic has proven to be very challenging, since the shift in paradigm that ML brings completely changes traditional certification approaches. Objective: This paper aims to elucidate challenges related to the certification of ML-based safety-critical systems, as well as the solutions that are proposed in the literature to tackle them, answering the question 'How to Certify Machine Learning Based Safety-critical Systems?'. Method: We conduct a Systematic Literature Review (SLR) of research papers published between 2015 to 2020, covering topics related to the certification of ML systems. In total, we identified 217 papers covering topics considered to be the main pillars of ML certification: Robustness, Uncertainty, Explainability, Verification, Safe Reinforcement Learning, and Direct Certification. We analyzed the main trends and problems of each sub-field and provided summaries of the papers extracted. Results: The SLR results highlighted the enthusiasm of the community for this subject, as well as the lack of diversity in terms of datasets and type of models. It also emphasized the need to further develop connections between academia and industries to deepen the domain study. Finally, it also illustrated the necessity to build connections between the above mention main pillars that are for now mainly studied separately. Conclusion: We highlighted current efforts deployed to enable the certification of ML based software systems, and discuss some future research directions.

preprint2021arXiv

Machine Learning Application Development: Practitioners' Insights

Nowadays, intelligent systems and services are getting increasingly popular as they provide data-driven solutions to diverse real-world problems, thanks to recent breakthroughs in Artificial Intelligence (AI) and Machine Learning (ML). However, machine learning meets software engineering not only with promising potentials but also with some inherent challenges. Despite some recent research efforts, we still do not have a clear understanding of the challenges of developing ML-based applications and the current industry practices. Moreover, it is unclear where software engineering researchers should focus their efforts to better support ML application developers. In this paper, we report about a survey that aimed to understand the challenges and best practices of ML application development. We synthesize the results obtained from 80 practitioners (with diverse skills, experience, and application domains) into 17 findings; outlining challenges and best practices for ML application development. Practitioners involved in the development of ML-based software systems can leverage the summarized best practices to improve the quality of their system. We hope that the reported challenges will inform the research community about topics that need to be investigated to improve the engineering process and the quality of ML-based applications.

preprint2021arXiv

Mining API Usage Scenarios from Stack Overflow

We propose a framework to mine API usage scenarios from Stack Overflow. Each task consists of a code example, the task description, and the reactions of developers towards the code example. First, we present an algorithm to automatically link a code example in a forum post to an API mentioned in the textual contents of the forum post. Second, we generate a natural language description of the task by summarizing the discussions around the code example. Third, we automatically associate developers reactions (i.e., positive and negative opinions) towards the code example to offer information about code quality. We evaluate the algorithms using three benchmarks.

preprint2021arXiv

Understanding How and Why Developers Seek and Analyze API-related Opinions

With the advent and proliferation of online developer forums as informal documentation, developers often share their opinions about the APIs they use. Thus, opinions of others often shape the developer's perception and decisions related to software development. For example, the choice of an API or how to reuse the functionality the API offers are, to a considerable degree, conditioned upon what other developers think about the API. While many developers refer to and rely on such opinion-rich information about APIs, we found little research that investigates the use and benefits of public opinions. To understand how developers seek and evaluate API opinions, we conducted two surveys involving a total of 178 software developers. We analyzed the data in two dimensions, each corresponding to specific needs related to API reviews: (1) Needs for seeking API reviews, and (2) Needs for automated tool support to assess the reviews. We observed that developers seek API reviews and often have to summarize those for diverse development needs (e.g., API suitability). Developers also make conscious efforts to judge the trustworthiness of the provided opinions and believe that automated tool support for API reviews analysis can assist in diverse development scenarios, including, for example, saving time in API selection as well as making informed decisions on a particular API features.

preprint2020arXiv

A Large Scale Empirical Study of the Impact of Spaghetti Code and Blob Anti-patterns on Program Comprehension

Context: Several studies investigated the impact of anti-patterns (i.e., "poor" solutions to recurring design problems) during maintenance activities and reported that anti-patterns significantly affect the developers' effort required to edit files. However, before developers edit files, they must understand the source code of the systems. This source code must be easy to understand by developers. Objective: In this work, we provide a complete assessment of the impact of two instances of two anti-patterns, Blob or Spaghetti Code, on program comprehension. Method: We analyze the impact of these two anti-patterns through three empirical studies conducted at Polytechnique Montréal (Canada) with 24 participants; at Carlton University (Canada) with 30 participants; and at University Basilicata (Italy) with 79 participants. Results: We collect data from 372 tasks obtained thanks to 133 different participants from the three universities. We use three metrics to assess the developers' comprehension of the source code: (1) the duration to complete each task; (2) their percentage of correct answers; and, (3) the NASA task load index for their effort. Conclusions: We report that, although single occurrences of Blob or Spaghetti code anti-patterns have little effect on code comprehension, two occurrences of either Blob or Spaghetti Code significantly increases the developers' time spent in their tasks, reduce their percentage of correct answers, and increase their effort. Hence, we recommend that developers act on both anti-patterns, which should be refactored out of the source code whenever possible. We also recommend further studies on combinations of anti-patterns rather than on single anti-patterns one at a time.

preprint2020arXiv

Are Multi-language Design Smells Fault-prone? An Empirical Study

Nowadays, modern applications are developed using components written in different programming languages. These systems introduce several advantages. However, as the number of languages increases, so does the challenges related to the development and maintenance of these systems. In such situations, developers may introduce design smells (i.e., anti-patterns and code smells) which are symptoms of poor design and implementation choices. Design smells are defined as poor design and coding choices that can negatively impact the quality of a software program despite satisfying functional requirements. Studies on mono-language systems suggest that the presence of design smells affects code comprehension, thus making systems harder to maintain. However, these studies target only mono-language systems and do not consider the interaction between different programming languages. In this paper, we present an approach to detect multi-language design smells in the context of JNI systems. We then investigate the prevalence of those design smells. Specifically, we detect 15 design smells in 98 releases of nine open-source JNI projects. Our results show that the design smells are prevalent in the selected projects and persist throughout the releases of the systems. We observe that in the analyzed systems, 33.95% of the files involving communications between Java and C/C++ contains occurrences of multi-language design smells. Some kinds of smells are more prevalent than others, e.g., Unused Parameters, Too Much Scattering, Unused Method Declaration. Our results suggest that files with multi-language design smells can often be more associated with bugs than files without these smells, and that specific smells are more correlated to fault-proneness than others.