Source author record

Markus Borg

Markus Borg appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Artificial Intelligence Computer Vision Computation and Language cs.CY Machine Learning Performance

Catalog footprint

What is connected

18works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics

We are entering a hybrid era in which human developers and AI coding agents work in the same codebases. While industry practice has long optimized code for human comprehension, it is increasingly important to ensure that LLMs with different capabilities can edit code reliably. In this study, we investigate the concept of ``AI-friendly code'' via LLM-based refactoring on a dataset of 5,000 Python files from competitive programming. We find a meaningful association between CodeHealth, a quality metric calibrated for human comprehension, and semantic preservation after AI refactoring. Our findings confirm that human-friendly code is also more compatible with AI tooling. These results suggest that organizations can use CodeHealth to guide where AI interventions are lower risk and where additional human oversight is warranted. Investing in maintainability not only helps humans; it also prepares for large-scale AI adoption.

preprint2022arXiv

Code Red: The Business Impact of Code Quality -- A Quantitative Study of 39 Proprietary Production Codebases

Code quality remains an abstract concept that fails to get traction at the business level. Consequently, software companies keep trading code quality for time-to-market and new features. The resulting technical debt is estimated to waste up to 42% of developers' time. At the same time, there is a global shortage of software developers, meaning that developer productivity is key to software businesses. Our overall mission is to make code quality a business concern, not just a technical aspect. Our first goal is to understand how code quality impacts 1) the number of reported defects, 2) the time to resolve issues, and 3) the predictability of resolving issues on time. We analyze 39 proprietary production codebases from a variety of domains using the CodeScene tool based on a combination of source code analysis, version-control mining, and issue information from Jira. By analyzing activity in 30,737 files, we find that low quality code contains 15 times more defects than high quality code. Furthermore, resolving issues in low quality code takes on average 124% more time in development. Finally, we report that issue resolutions in low quality code involve higher uncertainty manifested as 9 times longer maximum cycle times. This study provides evidence that code quality cannot be dismissed as a technical concern. With 15 times fewer defects, twice the development speed, and substantially more predictable issue resolution times, the business advantage of high quality code should be unmistakably clear.

preprint2022arXiv

Exploring ML testing in practice -- Lessons learned from an interactive rapid review with Axis Communications

There is a growing interest in industry and academia in machine learning (ML) testing. We believe that industry and academia need to learn together to produce rigorous and relevant knowledge. In this study, we initiate a collaboration between stakeholders from one case company, one research institute, and one university. To establish a common view of the problem domain, we applied an interactive rapid review of the state of the art. Four researchers from Lund University and RISE Research Institutes and four practitioners from Axis Communications reviewed a set of 180 primary studies on ML testing. We developed a taxonomy for the communication around ML testing challenges and results and identified a list of 12 review questions relevant for Axis Communications. The three most important questions (data testing, metrics for assessment, and test generation) were mapped to the literature, and an in-depth analysis of the 35 primary studies matching the most important question (data testing) was made. A final set of the five best matches were analysed and we reflect on the criteria for applicability and relevance for the industry. The taxonomies are helpful for communication but not final. Furthermore, there was no perfect match to the case company's investigated review question (data testing). However, we extracted relevant approaches from the five studies on a conceptual level to support later context-specific improvements. We found the interactive rapid review approach useful for triggering and aligning communication between the different stakeholders.

preprint2022arXiv

Performance Analysis of Out-of-Distribution Detection on Trained Neural Networks

Several areas have been improved with Deep Learning during the past years. Implementing Deep Neural Networks (DNN) for non-safety related applications have shown remarkable achievements over the past years; however, for using DNNs in safety critical applications, we are missing approaches for verifying the robustness of such models. A common challenge for DNNs occurs when exposed to out-of-distribution samples that are outside of the scope of a DNN, but which result in high confidence outputs despite no prior knowledge of such input. In this paper, we analyze three methods that separate between in- and out-of-distribution data, called supervisors, on four well-known DNN architectures. We find that the outlier detection performance improves with the quality of the model. We also analyse the performance of the particular supervisors during the training procedure by applying the supervisor at a predefined interval to investigate its performance as the training proceeds. We observe that understanding the relationship between training results and supervisor performance is crucial to improve the model's robustness and to indicate, what input samples require further measures to improve the robustness of a DNN. In addition, our work paves the road towards an instrument for safety argumentation for safety critical applications. This paper is an extended version of our previous work presented at 2019 SEAA (cf. [1]); here, we elaborate on the used metrics, add an additional supervisor and test them on two additional datasets.

preprint2022arXiv

Quality Assurance of Generative Dialog Models in an Evolving Conversational Agent Used for Swedish Language Practice

Due to the migration megatrend, efficient and effective second-language acquisition is vital. One proposed solution involves AI-enabled conversational agents for person-centered interactive language practice. We present results from ongoing action research targeting quality assurance of proprietary generative dialog models trained for virtual job interviews. The action team elicited a set of 38 requirements for which we designed corresponding automated test cases for 15 of particular interest to the evolving solution. Our results show that six of the test case designs can detect meaningful differences between candidate models. While quality assurance of natural language processing applications is complex, we provide initial steps toward an automated framework for machine learning model selection in the context of an evolving conversational agent. Future work will focus on model selection in an MLOps setting.

preprint2021arXiv

Digital Twins Are Not Monozygotic -- Cross-Replicating ADAS Testing in Two Industry-Grade Automotive Simulators

The increasing levels of software- and data-intensive driving automation call for an evolution of automotive software testing. As a recommended practice of the Verification and Validation (V&V) process of ISO/PAS 21448, a candidate standard for safety of the intended functionality for road vehicles, simulation-based testing has the potential to reduce both risks and costs. There is a growing body of research on devising test automation techniques using simulators for Advanced Driver-Assistance Systems (ADAS). However, how similar are the results if the same test scenarios are executed in different simulators? We conduct a replication study of applying a Search-Based Software Testing (SBST) solution to a real-world ADAS (PeVi, a pedestrian vision detection system) using two different commercial simulators, namely, TASS/Siemens PreScan and ESI Pro-SiVIC. Based on a minimalistic scene, we compare critical test scenarios generated using our SBST solution in these two simulators. We show that SBST can be used to effectively and efficiently generate critical test scenarios in both simulators, and the test results obtained from the two simulators can reveal several weaknesses of the ADAS under test. However, executing the same test scenarios in the two simulators leads to notable differences in the details of the test outputs, in particular, related to (1) safety violations revealed by tests, and (2) dynamics of cars and pedestrians. Based on our findings, we recommend future V&V plans to include multiple simulators to support robust simulation-based testing and to base test objectives on measures that are less dependant on the internals of the simulators.

preprint2021arXiv

Test Automation with Grad-CAM Heatmaps -- A Future Pipe Segment in MLOps for Vision AI?

Machine Learning (ML) is a fundamental part of modern perception systems. In the last decade, the performance of computer vision using trained deep neural networks has outperformed previous approaches based on careful feature engineering. However, the opaqueness of large ML models is a substantial impediment for critical applications such as in the automotive context. As a remedy, Gradient-weighted Class Activation Mapping (Grad-CAM) has been proposed to provide visual explanations of model internals. In this paper, we demonstrate how Grad-CAM heatmaps can be used to increase the explainability of an image recognition model trained for a pedestrian underpass. We argue how the heatmaps support compliance to the EU's seven key requirements for Trustworthy AI. Finally, we propose adding automated heatmap analysis as a pipe segment in an MLOps pipeline. We believe that such a building block can be used to automatically detect if a trained ML-model is activated based on invalid pixels in test images, suggesting biased models.

preprint2020arXiv

An Autonomous Performance Testing Framework using Self-Adaptive Fuzzy Reinforcement Learning

Test automation brings the potential to reduce costs and human effort, but several aspects of software testing remain challenging to automate. One such example is automated performance testing to find performance breaking points. Current approaches to tackle automated generation of performance test cases mainly involve using source code or system model analysis or use-case based techniques. However, source code and system models might not always be available at testing time. On the other hand, if the optimal performance testing policy for the intended objective in a testing process instead could be learned by the testing system, then test automation without advanced performance models could be possible. Furthermore, the learned policy could later be reused for similar software systems under test, thus leading to higher test efficiency. We propose SaFReL, a self-adaptive fuzzy reinforcement learning-based performance testing framework. SaFReL learns the optimal policy to generate performance test cases through an initial learning phase, then reuses it during a transfer learning phase, while keeping the learning running and updating the policy in the long term. Through multiple experiments on a simulated environment, we demonstrate that our approach generates the target performance test cases for different programs more efficiently than a typical testing process, and performs adaptively without access to source code and performance models.

preprint2020arXiv

Enabling Image Recognition on Constrained Devices Using Neural Network Pruning and a CycleGAN

Smart cameras are increasingly used in surveillance solutions in public spaces. Contemporary computer vision applications can be used to recognize events that require intervention by emergency services. Smart cameras can be mounted in locations where citizens feel particularly unsafe, e.g., pathways and underpasses with a history of incidents. One promising approach for smart cameras is edge AI, i.e., deploying AI technology on IoT devices. However, implementing resource-demanding technology such as image recognition using deep neural networks (DNN) on constrained devices is a substantial challenge. In this paper, we explore two approaches to reduce the need for compute in contemporary image recognition in an underpass. First, we showcase successful neural network pruning, i.e., we retain comparable classification accuracy with only 1.1\% of the neurons remaining from the state-of-the-art DNN architecture. Second, we demonstrate how a CycleGAN can be used to transform out-of-distribution images to the operational design domain. We posit that both pruning and CycleGANs are promising enablers for efficient edge AI in smart cameras.

preprint2020arXiv

Illuminating a Blind Spot in Digitalization -- Software Development in Sweden's Private and Public Sector

As Netscape co-founder Marc Andreessen famously remarked in 2011, software is eating the world - becoming a pervasive invisible critical infrastructure. Data on the distribution of software use and development in society is scarce, but we compile results from two novel surveys to provide a fuller picture of the role software plays in the public and private sectors in Sweden, respectively. Three out of ten Swedish firms, across industry sectors, develop software in-house. The corresponding figure for Sweden's government agencies is four out of ten, i.e., the public sector should not be underestimated. The digitalization of society will continue, thus the demand for software developers will further increase. Many private firms report that the limited supply of software developers in Sweden is directly affecting their expansion plans. Based on our findings, we outline directions that need additional research to allow evidence-informed policy-making. We argue that such work should ideally be conducted by academic researchers and national statistics agencies in collaboration.

preprint2020arXiv

Making Lab Sessions Mandatory -- On Student Work Distribution in a Gamified Project Course on Market-Driven Software Engineering

Unfair work distribution in student teams is a common issue in project-based learning. One contributing factor is that students are differently skilled developers. In a course with group work intertwining engineering and business aspects, we designed an intervention to help novice programmers, i.e., we introduced mandatory programming lab sessions. However, the intervention did not affect the work distribution, showing that more is needed to balance the workload. Contrary to our goal, the intervention was very well received among experienced students, but unpopular with students weak at programming.

preprint2020arXiv

Practical relevance of software engineering research: Synthesizing the community's voice

Software engineering (SE) research should be relevant to industrial practice. There have been regular discussions in the SE community on this issue since the 1980's, led by pioneers such as Robert Glass. As we recently passed the milestone of "50 years of software engineering", some recent positive efforts have been made in this direction, e.g., establishing "industrial" tracks in several SE conferences. However, many researchers and practitioners believe that we, as a community, are still struggling with research relevance and utility. The goal of this paper is to synthesize the evidence and experience-based opinions shared on this topic so far in the SE community, and to encourage the community to further reflect and act on the research relevance. For this purpose, we have conducted a Multi-vocal Literature Review (MLR) of 54 systematically-selected sources (papers and non peer-reviewed articles). Instead of relying on and considering the individual opinions on research relevance, mentioned in each of the sources, the MLR aims to synthesize and provide the "holistic" view on the topic. The highlights of our MLR findings are as follows. The top three root causes of low relevance, discussed in the community, are: (1) Researchers having simplistic views (or wrong assumptions) about SE in practice; (2) Lack of connection with industry; and (3) Wrong identification of research problems. The top three suggestions for improving research relevance are: (1) Using appropriate research approaches such as action-research; (2) Choosing relevant research problems; and (3) Collaborating with industry. By synthesizing all the discussions on this important topic so far, this paper aims to encourage further discussions and actions in the community to increase our collective efforts to improve the research relevance.

preprint2020arXiv

The AIQ Meta-Testbed: Pragmatically Bridging Academic AI Testing and Industrial Q Needs

AI solutions seem to appear in any and all application domains. As AI becomes more pervasive, the importance of quality assurance increases. Unfortunately, there is no consensus on what artificial intelligence means and interpretations range from simple statistical analysis to sentient humanoid robots. On top of that, quality is a notoriously hard concept to pinpoint. What does this mean for AI quality? In this paper, we share our working definition and a pragmatic approach to address the corresponding quality assurance with a focus on testing. Finally, we present our ongoing work on establishing the AIQ Meta-Testbed.

preprint2016arXiv

Advancing Trace Recovery Evaluation - Applied Information Retrieval in a Software Engineering Context

Successful development of software systems involves efficient navigation among software artifacts. One state-of-practice approach to structure information is to establish trace links between artifacts, a practice that is also enforced by several development standards. Unfortunately, manually maintaining trace links in an evolving system is a tedious task. To tackle this issue, several researchers have proposed treating the capture and recovery of trace links as an Information Retrieval (IR) problem. The work contains a Systematic Literature Review (SLR) of previous evaluations of IR-based trace recovery. We show that a majority of previous evaluations have been technology-oriented, conducted in "the cave of IR evaluation", using small datasets as experimental input. Also, software artifacts originating from student projects have frequently been used in evaluations. We conducted a survey among traceability researchers, and found that a majority consider student artifacts to be only partly representative to industrial counterparts. Our findings call for additional case studies to evaluate IR-based trace recovery within the full complexity of an industrial setting. Also, this thesis contributes to the body of empirical evidence of IR-based trace recovery in two experiments with industrial software artifacts. The technology-oriented experiment highlights the clear dependence between datasets and the accuracy of IR-based trace recovery, in line with findings from the SLR. The human-oriented experiment investigates how different quality levels of tool output affect the tracing accuracy of engineers. Finally, we present how tools and methods are evaluated in the general field of IR research, and propose a taxonomy of evaluation contexts tailored for IR-based trace recovery.

preprint2016arXiv

An Industrial Case Study on Measuring the Quality of the Requirements Scoping Process

Decision making and requirements scoping occupy central roles in helping to develop products that are demanded by the customers and ensuring company strategies are accurately realized in product scope. Many companies experience continuous and frequent scope changes and fluctuations but struggle to measure the phenomena and correlate the measurement to the quality of the requirements process. We present the results from an exploratory interview study among 22 participants working with requirements management processes at a large company that develops embedded systems for a global market. Our respondents shared their opinions about the current set of requirements management process metrics as well as what additional metrics they envisioned as useful. We present a set of metrics that describe the quality of the requirements scoping process. The findings provide practical insights that can be used as input when introducing new measurement programs for requirements management and decision making.

preprint2016arXiv

Practitioners' Perspectives on Change Impact Analysis for Safety-Critical Software - A Preliminary Analysis

Safety standards prescribe change impact analysis (CIA) during evolution of safety-critical software systems. Although CIA is a fundamental activity, there is a lack of empirical studies about how it is performed in practice. We present a case study on CIA in the context of an evolving automation system, based on 14 interviews in Sweden and India. Our analysis suggests that engineers on average spend 50-100 hours on CIA per year, but the effort varies considerably with the phases of projects. Also, the respondents presented different connotations to CIA and perceived the importance of CIA differently. We report the most pressing CIA challenges, and several ideas on how to support future CIA. However, we show that measuring the effect of such improvement solutions is non-trivial, as CIA is intertwined with other development activities. While this paper only reports preliminary results, our work contributes empirical insights into practical CIA.

preprint2016arXiv

Testing Quality Requirements of a System-of-Systems in the Public Sector - Challenges and Potential Remedies

Quality requirements is a difficult concept in software projects, and testing software qualities is a well-known challenge. Without proper management of quality requirements, there is an increased risk that the software product under development will not meet the expectations of its future users. In this paper, we share experiences from testing quality requirements when developing a large system-of-systems in the public sector in Sweden. We complement the experience reporting by analyzing documents from the case under study. As a final step, we match the identified challenges with solution proposals from the literature. We report five main challenges covering inadequate requirements engineering and disconnected test managers. Finally, we match the challenges to solutions proposed in the scientific literature, including integrated requirements engineering, the twin peaks model, virtual plumblines, the QUPER model, and architecturally significant requirements. Our experiences are valuable to other large development projects struggling with testing of quality requirements. Furthermore, the report could be used by as input to process improvement activities in the case under study.

preprint2014arXiv

Workshop Summary of the 1st International Workshop on Requirements and Testing (RET'14)

The main objective of the RET workshop was to explore the interaction of Requirements Engineering (RE) and Testing, i.e. RET, in research and industry, and the challenges that result from this interaction. While much work has been done in the respective fields of requirements engineering and testing, there exists much more than can be done to understand the connection between the processes of RE and of testing.

Markus Borg

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics

Code Red: The Business Impact of Code Quality -- A Quantitative Study of 39 Proprietary Production Codebases

Exploring ML testing in practice -- Lessons learned from an interactive rapid review with Axis Communications

Performance Analysis of Out-of-Distribution Detection on Trained Neural Networks

Quality Assurance of Generative Dialog Models in an Evolving Conversational Agent Used for Swedish Language Practice

Digital Twins Are Not Monozygotic -- Cross-Replicating ADAS Testing in Two Industry-Grade Automotive Simulators

Test Automation with Grad-CAM Heatmaps -- A Future Pipe Segment in MLOps for Vision AI?

An Autonomous Performance Testing Framework using Self-Adaptive Fuzzy Reinforcement Learning

Enabling Image Recognition on Constrained Devices Using Neural Network Pruning and a CycleGAN

Illuminating a Blind Spot in Digitalization -- Software Development in Sweden's Private and Public Sector

Making Lab Sessions Mandatory -- On Student Work Distribution in a Gamified Project Course on Market-Driven Software Engineering

Practical relevance of software engineering research: Synthesizing the community's voice

The AIQ Meta-Testbed: Pragmatically Bridging Academic AI Testing and Industrial Q Needs

Advancing Trace Recovery Evaluation - Applied Information Retrieval in a Software Engineering Context

An Industrial Case Study on Measuring the Quality of the Requirements Scoping Process

Practitioners' Perspectives on Change Impact Analysis for Safety-Critical Software - A Preliminary Analysis

Testing Quality Requirements of a System-of-Systems in the Public Sector - Challenges and Potential Remedies

Workshop Summary of the 1st International Workshop on Requirements and Testing (RET'14)