Source author record

Sebastian Baltes

Sebastian Baltes appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Artificial Intelligence cs.CY General Literature Information Retrieval Neural and Evolutionary Computing

Catalog footprint

What is connected

9works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems

Flaky tests are a common problem in software testing. They produce inconsistent results when executed multiple times on the same code, invalidating the assumption that a test failure indicates a software defect. Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests. However, its prevalence and underlying causes are unclear. We examined the flakiness of LLM-generated tests in the context of four relational database management systems: SAP HANA, DuckDB, MySQL, and SQLite. We amplified test suites with two LLMs, GPT-4o and Mistral-Large-Instruct-2407, to assess the flakiness of the generated test cases. Our results suggest that generated tests have a slightly higher proportion of flaky tests compared to existing tests. Based on a manual inspection, we found that the most common root cause of flakiness was the reliance of a test on a certain order that is not guaranteed ("unordered collection"), which was present in 72 of 115 flaky tests (63%). Furthermore, both LLMs transferred the flakiness from the existing tests to the newly generated tests via the provided prompt context. Our experiments suggest that flakiness transfer is more prevalent in closed-source systems such as SAP HANA than in open-source systems. Our study informs developers on what types of flakiness to expect from LLM-generated tests. It also highlights the importance of providing LLMs with tailored context when employing LLMs for test generation.

preprint2022arXiv

Paving the Way for Mature Secondary Research: The Seven Types of Literature Review

Confusion over different kinds of secondary research, and their divergent purposes, is undermining the effectiveness and usefulness of secondary studies in software engineering. This short paper therefore explains the differences between ad hoc review, case survey, critical review, meta-analysis (aka systematic literature review), meta-synthesis (aka thematic analysis), rapid review and scoping review (aka systematic mapping study). These definitions and associated guidelines help researchers better select and describe their literature reviews, while helping reviewers select more appropriate evaluation criteria.

preprint2021arXiv

Automated Query Reformulation for Efficient Search based on Query Logs From Stack Overflow

As a popular Q&A site for programming, Stack Overflow is a treasure for developers. However, the amount of questions and answers on Stack Overflow make it difficult for developers to efficiently locate the information they are looking for. There are two gaps leading to poor search results: the gap between the user's intention and the textual query, and the semantic gap between the query and the post content. Therefore, developers have to constantly reformulate their queries by correcting misspelled words, adding limitations to certain programming languages or platforms, etc. As query reformulation is tedious for developers, especially for novices, we propose an automated software-specific query reformulation approach based on deep learning. With query logs provided by Stack Overflow, we construct a large-scale query reformulation corpus, including the original queries and corresponding reformulated ones. Our approach trains a Transformer model that can automatically generate candidate reformulated queries when given the user's original query. The evaluation results show that our approach outperforms five state-of-the-art baselines, and achieves a 5.6% to 33.5% boost in terms of $\mathit{ExactMatch}$ and a 4.8% to 14.4% boost in terms of $\mathit{GLEU}$.

preprint2021arXiv

Empirical Standards for Software Engineering Research

Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair.

preprint2020arXiv

An Annotated Dataset of Stack Overflow Post Edits

To improve software engineering, software repositories have been mined for code snippets and bug fixes. Typically, this mining takes place at the level of files or commits. To be able to dig deeper and to extract insights at a higher resolution, we hereby present an annotated dataset that contains over 7 million edits of code and text on Stack Overflow. Our preliminary study indicates that these edits might be a treasure trove for mining information about fine-grained patches, e.g., for the optimisation of non-functional properties.

preprint2020arXiv

Code Duplication on Stack Overflow

Despite the unarguable importance of Stack Overflow (SO) for the daily work of many software developers and despite existing knowledge about the impact of code duplication on software maintainability, the prevalence and implications of code clones on SO have not yet received the attention they deserve. In this paper, we motivate why studies on code duplication within SO are needed and how existing studies on code reuse differ from this new research direction. We present similarities and differences between code clones in general and code clones on SO and point to open questions that need to be addressed to be able to make data-informed decisions about how to properly handle clones on this important platform. We present results from a first preliminary investigation, indicating that clones on SO are common and diverse. We further point to specific challenges, including incentives for users to clone successful answers and difficulties with bulk edits on the platform, and conclude with possible directions for future work.

preprint2020arXiv

Contextual Documentation Referencing on Stack Overflow

Software engineering is knowledge-intensive and requires software developers to continually search for knowledge, often on community question answering platforms such as Stack Overflow. Such information sharing platforms do not exist in isolation, and part of the evidence that they exist in a broader software documentation ecosystem is the common presence of hyperlinks to other documentation resources found in forum posts. With the goal of helping to improve the information diffusion between Stack Overflow and other documentation resources, we conducted a study to answer the question of how and why documentation is referenced in Stack Overflow threads. We sampled and classified 759 links from two different domains, regular expressions and Android development, to qualitatively and quantitatively analyze the links' context and purpose, including attribution, awareness, and recommendations. We found that links on Stack Overflow serve a wide range of distinct purposes, ranging from citation links attributing content copied into Stack Overflow, over links clarifying concepts using Wikipedia pages, to recommendations of software components and resources for background reading. This purpose spectrum has major corollaries, including our observation that links to documentation resources are a reflection of the information needs typical to a technology domain. We contribute a framework and method to analyze the context and purpose of Stack Overflow links, a public dataset of annotated links, and a description of five major observations about linking practices on Stack Overflow. We further point to potential tool support to enhance the information diffusion between Stack Overflow and other documentation resources.

preprint2020arXiv

Is 40 the new 60? How popular media portrays the employability of older software developers

Alerted by our previous research as well as media reports and discussions in online forums about ageism in the software industry, we set out to study the public discourse around age and software development. With a focus on the USA, we analyzed popular online articles and related discussions on Hacker News through the lens of (perceived) employability issues and potential mitigation strategies. Besides rather controversial strategies such as disguising age-related aspects in résumés or undergoing plastic surgeries to appear young, we highlight the importance of keeping up-to-date, specializing in certain tasks or technologies, and present role transitions as a way forward for veteran developers. With this article, we want to build awareness among decision makers in software projects to help them anticipate and mitigate challenges that their older employees may face.

preprint2016arXiv

Empirical Research Plan: Effects of Sketching on Program Comprehension

Sketching is an important means of communication in software engineering practice. Yet, there is little research investigating the use of sketches. We want to contribute a better understanding of sketching, in particular its use during program comprehension. We propose a controlled experiment to investigate the effectiveness and efficiency of program comprehension with the support of sketches as well as what sketches are used in what way.

Sebastian Baltes

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems

Paving the Way for Mature Secondary Research: The Seven Types of Literature Review

Automated Query Reformulation for Efficient Search based on Query Logs From Stack Overflow

Empirical Standards for Software Engineering Research

An Annotated Dataset of Stack Overflow Post Edits

Code Duplication on Stack Overflow

Contextual Documentation Referencing on Stack Overflow

Is 40 the new 60? How popular media portrays the employability of older software developers

Empirical Research Plan: Effects of Sketching on Program Comprehension