Researcher profile

Andreas Vogelsang

Andreas Vogelsang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

Context-Adaptive Requirements Defect Prediction through Human-LLM Collaboration

Automated requirements assessment traditionally relies on universal patterns as proxies for defectiveness, implemented through rule-based heuristics or machine learning classifiers trained on large annotated datasets. However, what constitutes a "defect" is inherently context-dependent and varies across projects, domains, and stakeholder interpretations. In this paper, we propose a Human-LLM Collaboration (HLC) approach that treats defect prediction as an adaptive process rather than a static classification task. HLC leverages LLM Chain-of-Thought reasoning in a feedback loop: users validate predictions alongside their explanations, and these validated examples adaptively guide future predictions through few-shot learning. We evaluate this approach using the weak word smell on the QuRE benchmark of 1,266 annotated Mercedes-Benz requirements. Our results show that HLC effectively adapts to the provision of validated examples, with rapid performance gains from as few as 20 validated examples. Incorporating validated explanations, not just labels, enables HLC to substantially outperform both standard few-shot prompting and fine-tuned BERT models while maintaining high recall. These results highlight how the in-context and Chain-of-Thought learning capabilities of LLMs enable adaptive classification approaches that move beyond one-size-fits-all models, creating opportunities for tools that learn continuously from stakeholder feedback.

preprint2026arXiv

From Issues to Insights: RAG-based Explanation Generation from Software Engineering Artifacts

The increasing complexity of modern software systems has made understanding their behavior increasingly challenging, driving the need for explainability to improve transparency and user trust. Traditional documentation is often outdated or incomplete, making it difficult to derive accurate, context-specific explanations. Meanwhile, issue-tracking systems capture rich and continuously updated development knowledge, but their potential for explainability remains untapped. With this work, we are the first to apply a Retrieval-Augmented Generation (RAG) approach for generating explanations from issue-tracking data. Our proof-of-concept system is implemented using open-source tools and language models, demonstrating the feasibility of leveraging structured issue data for explanation generation. Evaluating our approach on an exemplary project's set of GitHub issues, we achieve 90% alignment with human-written explanations. Additionally, our system exhibits strong faithfulness and instruction adherence, ensuring reliable and grounded explanations. These findings suggest that RAG-based methods can extend explainability beyond black-box ML models to a broader range of software systems, provided that issue-tracking data is available - making system behavior more accessible and interpretable.

preprint2026arXiv

Reporting LLM Prompting in Automated Software Engineering: A Guideline Based on Current Practices and Expectations

Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a critical factor in system performance and behavior. Despite their growing role in SE research, prompt-related decisions are rarely documented in a systematic or transparent manner, hindering reproducibility and comparability across studies. To address this gap, we conducted a two-phase empirical study. First, we analyzed nearly 300 papers published at the top-3 SE conferences since 2022 to assess how prompt design, testing, and optimization are currently reported. Second, we surveyed 105 program committee members from these conferences to capture their expectations for prompt reporting in LLM-driven research. Based on the findings, we derived a structured guideline that distinguishes essential, desirable, and exceptional reporting elements. Our results reveal significant misalignment between current practices and reviewer expectations, particularly regarding version disclosure, prompt justification, and threats to validity. We present our guideline as a step toward improving transparency, reproducibility, and methodological rigor in LLM-based SE research.

preprint2022arXiv

How Do Drivers Self-Regulate their Secondary Task Engagements? The Effect of Driving Automation on Touchscreen Interactions and Glance Behavior

With ever-improving driver assistance systems and large touchscreens becoming the main in-vehicle interface, drivers are more tempted than ever to engage in distracting non-driving-related tasks. However, little research exists on how driving automation affects drivers' self-regulation when interacting with center stack touchscreens. To investigate this, we employ multilevel models on a real-world driving dataset consisting of 10,139 sequences. Our results show significant differences in drivers' interaction and glance behavior in response to varying levels of driving automation, vehicle speed, and road curvature. During partially automated driving, drivers are not only more likely to engage in secondary touchscreen tasks, but their mean glance duration toward the touchscreen also increases by 12% (Level 1) and 20% (Level 2) compared to manual driving. We further show that the effect of driving automation on drivers' self-regulation is larger than that of vehicle speed and road curvature. The derived knowledge can facilitate the safety evaluation of infotainment systems and the development of context-aware driver monitoring systems.

preprint2022arXiv

ICEBOAT: An Interactive User Behavior Analysis Tool for Automotive User Interfaces

In this work, we present ICEBOAT an interactive tool that enables automotive UX experts to explore how users interact with In-vehicle Information Systems. Based on large naturalistic driving data continuously collected from production line vehicles, ICEBOAT visualizes drivers' interactions and driving behavior on different levels of detail. Hence, it allows to easily compare different user flows based on performance- and safety-related metrics.

preprint2021arXiv

Automatic Detection of Causality in Requirement Artifacts: the CiRA Approach

System behavior is often expressed by causal relations in requirements (e.g., If event 1, then event 2). Automatically extracting this embedded causal knowledge supports not only reasoning about requirements dependencies, but also various automated engineering tasks such as seamless derivation of test cases. However, causality extraction from natural language is still an open research challenge as existing approaches fail to extract causality with reasonable performance. We understand causality extraction from requirements as a two-step problem: First, we need to detect if requirements have causal properties or not. Second, we need to understand and extract their causal relations. At present, though, we lack knowledge about the form and complexity of causality in requirements, which is necessary to develop a suitable approach addressing these two problems. We conduct an exploratory case study with 14,983 sentences from 53 requirements documents originating from 18 different domains and shed light on the form and complexity of causality in requirements. Based on our findings, we develop a tool-supported approach for causality detection (CiRA). This constitutes a first step towards causality extraction from NL requirements. We report on a case study and the resulting tool-supported approach for causality detection in requirements. Our case study corroborates, among other things, that causality is, in fact, a widely used linguistic pattern to describe system behavior, as about a third of the analyzed sentences are causal. We further demonstrate that our tool CiRA achieves a macro-F1 score of 82 % on real word data and that it outperforms related approaches with an average gain of 11.06 % in macro-Recall and 11.43 % in macro-Precision. Finally, we disclose our open data sets as well as our tool to foster the discourse on the automatic detection of causality in the RE community.

preprint2021arXiv

Causality in Requirements Artifacts: Prevalence, Detection, and Impact

Background: Causal relations in natural language (NL) requirements convey strong, semantic information. Automatically extracting such causal information enables multiple use cases, such as test case generation, but it also requires to reliably detect causal relations in the first place. Currently, this is still a cumbersome task as causality in NL requirements is still barely understood and, thus, barely detectable. Objective: In our empirically informed research, we aim at better understanding the notion of causality and supporting the automatic extraction of causal relations in NL requirements. Method: In a first case study, we investigate 14.983 sentences from 53 requirements documents to understand the extent and form in which causality occurs. Second, we present and evaluate a tool-supported approach, called CiRA, for causality detection. We conclude with a second case study where we demonstrate the applicability of our tool and investigate the impact of causality on NL requirements. Results: The first case study shows that causality constitutes around 28% of all NL requirements sentences. We then demonstrate that our detection tool achieves a macro-F1 score of 82% on real-world data and that it outperforms related approaches with an average gain of 11.06% in macro-Recall and 11.43% in macro-Precision. Finally, our second case study corroborates the positive correlations of causality with features of NL requirements. Conclusion: The results strengthen our confidence in the eligibility of causal relations for downstream reuse, while our tool and publicly available data constitute a first step in the ongoing endeavors of utilizing causality in RE and beyond.

preprint2020arXiv

Data-driven Risk Management for Requirements Engineering: An Automated Approach based on Bayesian Networks

Requirements Engineering (RE) is a means to reduce the risk of delivering a product that does not fulfill the stakeholders' needs. Therefore, a major challenge in RE is to decide how much RE is needed and what RE methods to apply. The quality of such decisions is strongly based on the RE expert's experience and expertise in carefully analyzing the context and current state of a project. Recent work, however, shows that lack of experience and qualification are common causes for problems in RE. We trained a series of Bayesian Networks on data from the NaPiRE survey to model relationships between RE problems, their causes, and effects in projects with different contextual characteristics. These models were used to conduct (1) a postmortem (diagnostic) analysis, deriving probable causes of suboptimal RE performance, and (2) to conduct a preventive analysis, predicting probable issues a young project might encounter. The method was subject to a rigorous cross-validation procedure for both use cases before assessing

preprint2020arXiv

Destination Prediction Based on Partial Trajectory Data

Two-thirds of the people who buy a new car prefer to use a substitute instead of the built-in navigation system. However, for many applications, knowledge about a user's intended destination and route is crucial. For example, suggestions for available parking spots close to the destination can be made or ride-sharing opportunities along the route are facilitated. Our approach predicts probable destinations and routes of a vehicle, based on the most recent partial trajectory and additional contextual data. The approach follows a three-step procedure: First, a $k$-d tree-based space discretization is performed, mapping GPS locations to discrete regions. Secondly, a recurrent neural network is trained to predict the destination based on partial sequences of trajectories. The neural network produces destination scores, signifying the probability of each region being the destination. Finally, the routes to the most probable destinations are calculated. To evaluate the method, we compare multiple neural architectures and present the experimental results of the destination prediction. The experiments are based on two public datasets of non-personalized, timestamped GPS locations of taxi trips. The best performing models were able to predict the destination of a vehicle with a mean error of 1.3 km and 1.43 km respectively.

preprint2020arXiv

How do Quantifiers Affect the Quality of Requirements?

Context: Requirements quality can have a substantial impact on the effectiveness and efficiency of using requirements artifacts in a development process. Quantifiers such as "at least", "all", or "exactly" are common language constructs used to express requirements. Quantifiers can be formulated by affirmative phrases ("At least") or negative phrases ("Not less than"). Problem: It is long assumed that negation in quantification negatively affects the readability of requirements, however, empirical research on these topics remains sparse. Principal Idea: In a web-based experiment with 51 participants, we compare the impact of negations and quantifiers on readability in terms of reading effort, reading error rate and perceived reading difficulty of requirements. Results: For 5 out of 9 quantifiers, our participants performed better on the affirmative phrase compared to the negative phrase. Only for one quantifier, the negative phrase was more effective. Contribution: This research focuses on creating an empirical understanding of the effect of language in Requirements Engineering. It furthermore provides concrete advice on how to phrase requirements.

preprint2020arXiv

The Role and Potentials of Field User Interaction Data in the Automotive UX Development Lifecycle: An Industry Perspective

We are interested in the role of field user interaction data in the development of IVIS, the potentials practitioners see in analyzing this data, the concerns they share, and how this compares to companies with digital products. We conducted interviews with 14 UX professionals, 8 from automotive and 6 from digital companies, and analyzed the results by emergent thematic coding. Our key findings indicate that implicit feedback through field user interaction data is currently not evident in the automotive UX development process. Most decisions regarding the design of IVIS are made based on personal preferences and the intuitions of stakeholders. However, the interviewees also indicated that user interaction data has the potential to lower the influence of guesswork and assumptions in the UX design process and can help to make the UX development lifecycle more evidence-based and user-centered.

preprint2020arXiv

Topic Modeling on User Stories using Word Mover's Distance

Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.

preprint2020arXiv

Towards Causality Extraction from Requirements

System behavior is often based on causal relations between certain events (e.g. If event1, then event2). Consequently, those causal relations are also textually embedded in requirements. We want to extract this causal knowledge and utilize it to derive test cases automatically and to reason about dependencies between requirements. Existing NLP approaches fail to extract causality from natural language (NL) with reasonable performance. In this paper, we describe first steps towards building a new approach for causality extraction and contribute: (1) an NLP architecture based on Tree Recursive Neural Networks (TRNN) that we will train to identify causal relations in NL requirements and (2) an annotation scheme and a dataset that is suitable for training TRNNs. Our dataset contains 212,186 sentences from 463 publicly available requirement documents and is a first step towards a gold standard corpus for causality extraction. We encourage fellow researchers to contribute to our dataset and help us in finalizing the causality annotation process. Additionally, the dataset can also be annotated further to serve as a benchmark for other RE-relevant NLP tasks such as requirements classification.

preprint2020arXiv

Views on Quality Requirements in Academia and Practice: Commonalities, Differences, and Context-Dependent Grey Areas

Context: Quality requirements (QRs) are a topic of constant discussions both in industry and academia. Debates entwine around the definition of quality requirements, the way how to handle them, or their importance for project success. While many academic endeavors contribute to the body of knowledge about QRs, practitioners may have different views. In fact, we still lack a consistent body of knowledge on QRs since much of the discussion around this topic is still dominated by observations that are strongly context-dependent. This holds for both academic and practitioners' views. Our assumption is that, in consequence, those views may differ. Objective: We report on a study to better understand the extent to which available research statements on quality requirements, as found in exemplary peer-reviewed and frequently cited publications, are reflected in the perception of practitioners. Our goal is to analyze differences, commonalities, and context-dependent grey areas in the views of academics and practitioners to allow a discussion on potential misconceptions (on either sides) and opportunities for future research. Method: We conducted a survey with 109 practitioners to assess whether they agree with research statements about QRs reflected in the literature. Based on a statistical model, we evaluate the impact of a set of context factors to the perception of research statements. Results: Our results show that a majority of the statements is well respected by practitioners; however, not all of them. When examining the different groups and backgrounds of respondents, we noticed interesting deviations of perceptions within different groups that may lead to new research questions. Conclusions: Our results help identifying prevalent context-dependent differences about how academics and practitioners view QRs and pinpointing statements where further research might be useful.

preprint2020arXiv

What Makes Agile Test Artifacts Useful? An Activity-Based Quality Model from a Practitioners' Perspective

Background: The artifacts used in Agile software testing and the reasons why these artifacts are used are fairly well-understood. However, empirical research on how Agile test artifacts are eventually designed in practice and which quality factors make them useful for software testing remains sparse. Aims: Our objective is two-fold. First, we identify current challenges in using test artifacts to understand why certain quality factors are considered good or bad. Second, we build an Activity-Based Artifact Quality Model that describes what Agile test artifacts should look like. Method: We conduct an industrial survey with 18 practitioners from 12 companies operating in seven different domains. Results: Our analysis reveals nine challenges and 16 factors describing the quality of six test artifacts from the perspective of Agile testers. Interestingly, we observed mostly challenges regarding language and traceability, which are well-known to occur in non-Agile projects. Conclusions: Although Agile software testing is becoming the norm, we still have little confidence about general do's and don'ts going beyond conventional wisdom. This study is the first to distill a list of quality factors deemed important to what can be considered as useful test artifacts.