Source author record

Gordon Fraser

Gordon Fraser appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering cs.CY Programming Languages

Catalog footprint

What is connected

20works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Survey on How Test Flakiness Affects Developers and What Support They Need To Address It

Non-deterministically passing and failing test cases, so-called flaky tests, have recently become a focus area of software engineering research. While this research focus has been met with some enthusiastic endorsement from industry, prior work nevertheless mostly studied flakiness using a code-centric approach by mining software repositories. What data extracted from software repositories cannot tell us, however, is how developers perceive flakiness: How prevalent is test flakiness in developers' daily routine, how does it affect them, and most importantly: What do they want us researchers to do about it? To answer these questions, we surveyed 335 professional software developers and testers in different domains. The survey respondents confirm that flaky tests are a common and serious problem, thus reinforcing ongoing research on flaky test detection. Developers are less worried about the computational costs caused by re-running tests and more about the loss of trust in the test outcomes. Therefore, they would like to have IDE plugins to detect flaky code as well as better visualizations of the problem, particularly dashboards showing test outcomes over time; they also wish for more training and information on flakiness. These important aspects will require the attention of researchers as well as tool developers.

preprint2022arXiv

An Empirical Study of Automated Unit Test Generation for Python

Various mature automated test generation tools exist for statically typed programming languages such as Java. Automatically generating unit tests for dynamically typed programming languages such as Python, however, is substantially more difficult due to the dynamic nature of these languages as well as the lack of type information. Our Pynguin framework provides automated unit test generation for Python. In this paper, we extend our previous work on Pynguin to support more aspects of the Python language, and by studying a larger variety of well-established state of the art test-generation algorithms, namely DynaMOSA, MIO, and MOSA. Furthermore, we improved our Pynguin tool to generate regression assertions, whose quality we also evaluate. Our experiments confirm that evolutionary algorithms can outperform random test generation also in the context of Python, and similar to the Java world, DynaMOSA yields the highest coverage results. However, our results also demonstrate that there are still fundamental remaining issues, such as inferring type information for code without this information, currently limiting the effectiveness of test generation for Python.

preprint2022arXiv

Automated Test Generation for Scratch Programs

The importance of programming education has lead to dedicated educational programming environments, where users visually arrange block-based programming constructs that typically control graphical, interactive game-like programs. The Scratch programming environment is particularly popular, with more than 70 million registered users at the time of this writing. While the block-based nature of Scratch helps learners by preventing syntactical mistakes, there nevertheless remains a need to provide feedback and support in order to implement desired functionality. To support individual learning and classroom settings, this feedback and support should ideally be provided in an automated fashion, which requires tests to enable dynamic program analysis. The Whisker framework enables automated testing of Scratch programs, but creating these automated tests for Scratch programs is challenging. In this paper, we therefore investigate how to automatically generate Whisker tests. This raises important challenges: First, game-like programs are typically randomised, leading to flaky tests. Second, Scratch programs usually consist of animations and interactions with long delays, inhibiting the application of classical test generation approaches. Evaluation on common programming exercises, a random sample of 1000 Scratch user programs, and the 1000 most popular Scratch programs demonstrates that our approach enables Whisker to reliably accelerate test executions, and even though many Scratch programs are small and easy to cover, there are many unique challenges for which advanced search-based test generation using many-objective algorithms is needed in order to achieve high coverage.

preprint2022arXiv

Common Patterns in Block-Based Robot Programs

Programmable robots are engaging and fun to play with, interact with the real world, and are therefore well suited to introduce young learners to programming. Introductory robot programming languages often extend existing block-based languages such as Scratch. While teaching programming with such languages is well established, the interaction with the real world in robot programs leads to specific challenges, for which learners and educators may require assistance and feedback. A practical approach to provide this feedback is by identifying and pointing out patterns in the code that are indicative of good or bad solutions. While such patterns have been defined for regular block-based programs, robot-specific programming aspects have not been considered so far. The aim of this paper is therefore to identify patterns specific to robot programming for the Scratch-based mBlock programming language, which is used for the popular mBot and Codey Rocky robots. We identify: (1) 26 bug patterns, which indicate erroneous code; (2) three code smells, which indicate code that may work but is written in a confusing or difficult to understand way; and (3) 18 code perfumes, which indicate aspects of code that are likely good. We extend the LitterBox analysis framework to automatically identify these patterns in mBlock programs. Evaluated on a dataset of 3,540 mBlock programs, we find a total of 6,129 instances of bug patterns, 592 code smells and 14,495 code perfumes. This demonstrates the potential of our approach to provide feedback and assistance to learners and educators alike for their mBlock robot programs.

preprint2022arXiv

Common Problems and Effects of Feedback on Fun When Programming Ozobots in Primary School

Computational thinking is increasingly introduced at primary school level, usually with some form of programming activity. In particular, educational robots provide an opportunity for engaging students with programming through hands-on experiences. However, primary school teachers might not be adequately prepared for teaching computer science related topics, and giving feedback to students can often be challenging: Besides the content of the feedback (e.g., what problems have to be handled), the way the feedback is given is also important, as it can lead to negative emotional effects. To support teachers with the way of giving feedback on common problems when teaching programming with robotics, we conducted a study consisting of seven workshops with three third and four fourth grade primary school classes. Within seven different activities, the 116 primary school children first programmed the Ozobot Evo robot in the pen-and-paper mode and then on a digital device. Throughout these activities we collected data on the problems the students encountered, the feedback given, and the fun they experienced. Our analysis reveals eight categories of problems, which we summarise in this paper together with corresponding possible feedback. We observed that problems that are urgent or can harm the students' self-efficacy have a negative impact on how enjoyable an activity is perceived. While direct instruction significantly decreased the experienced fun, hints had a positive effect. Generally, we found programming the Ozobot Evo to be encouraging for both girls and boys. To support teachers, we discuss ideas for giving encouraging feedback on common problems of Ozobot Evo programming activities and how our findings transfer to other robots.

preprint2022arXiv

Gamekins: Gamifying Software Testing in Jenkins

Developers have to write thorough tests for their software in order to find bugs and to prevent regressions. Writing tests, however, is not every developer's favourite occupation, and if a lack of motivation leads to a lack of tests, then this may have dire consequences, such as programs with poor quality or even project failures. This paper introduces Gamekins, a tool that uses gamification to motivate developers to write more and better tests. Gamekins is integrated into the Jenkins continuous integration platform where game elements are based on commits to the source code repository: Developers can earn points for completing test challenges and quests posed by Gamekins, compete with other developers or developer teams on a leaderboard, and are rewarded for their test-related achievements.

preprint2022arXiv

Gender-dependent Contribution, Code and Creativity in a Virtual Programming Course

Since computer science is still mainly male dominated, academia, industry and education jointly seek ways to motivate and inspire girls, for example by introducing them to programming at an early age. The recent COVID-19 pandemic has forced many such endeavours to move to an online setting. While the gender-dependent differences in programming courses have been studied previously, for example revealing that girls may feel safer in same-sex groups, much less is known about gender-specific differences in online programming courses. In order to investigate whether gender-specific differences can be observed in online courses, we conducted an online introductory programming course for Scratch, in which we observed the gender-specific characteristics of participants with respect to how they interact, their enjoyment, the code they produce, and the creativity exposed by their programs. Overall, we observed no significant differences between how girls participated in all-female vs. mixed groups, and girls generally engaged with the course more actively than boys. This suggests that online courses can be a useful means to avoid gender-dependent group dynamics. However, when encouraging creative freedom in programming, girls and boys seem to fall back to socially inherited stereotypical behavior also in an online setting, influencing the choice of programming concepts applied. This may inhibit learning and is a challenge that needs to be addressed independently of whether courses are held online.

preprint2022arXiv

Model-based Testing of Scratch Programs

Learners are often introduced to programming via dedicated languages such as Scratch, where block-based commands are assembled visually in order to control the interactions of graphical sprites. Automated testing of such programs is an important prerequisite for supporting debugging, providing hints, or assessing learning outcomes. However, writing tests for Scratch programs can be challenging: The game-like and randomised nature of typical Scratch programs makes it difficult to identify specific timed input sequences used to control the programs. Furthermore, precise test assertions to check the resulting program states are incompatible with the fundamental principle of creative freedom in programming in Scratch, where correct program behaviour may be implemented with deviations in the graphical appearance or timing of the program. The event-driven and actor-oriented nature of Scratch programs, however, makes them a natural fit for describing program behaviour using finite state machines. In this paper, we introduce a model-based testing approach by extending Whisker, an automated testing framework for Scratch programs. The model-based extension describes expected program behaviour in terms of state machines, which makes it feasible to check the abstract behaviour of a program independent of exact timing and pixel-precise graphical details, and to automatically derive test inputs testing even challenging programs. A video demonstrating model-based testing with Whisker is available at the following URL: https://youtu.be/edgCNbGSGEY

preprint2022arXiv

Neuroevolution-Based Generation of Tests and Oracles for Games

Game-like programs have become increasingly popular in many software engineering domains such as mobile apps, web applications, or programming education. However, creating tests for programs that have the purpose of challenging human players is a daunting task for automatic test generators. Even if test generation succeeds in finding a relevant sequence of events to exercise a program, the randomized nature of games means that it may neither be possible to reproduce the exact program behavior underlying this sequence, nor to create test assertions checking if observed randomized game behavior is correct. To overcome these problems, we propose Neatest, a novel test generator based on the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. Neatest systematically explores a program's statements, and creates neural networks that operate the program in order to reliably reach each statement -- that is, Neatest learns to play the game in a way to reliably cover different parts of the code. As the networks learn the actual game behavior, they can also serve as test oracles by evaluating how surprising the observed behavior of a program under test is compared to a supposedly correct version of the program. We evaluate this approach in the context of Scratch, an educational programming environment. Our empirical study on 25 non-trivial Scratch games demonstrates that our approach can successfully train neural networks that are not only far more resilient to random influences than traditional test suites consisting of static input sequences, but are also highly effective with an average mutation score of more than 65%.

preprint2022arXiv

Pynguin: Automated Unit Test Generation for Python

Automated unit test generation is a well-known methodology aiming to reduce the developers' effort of writing tests manually. Prior research focused mainly on statically typed programming languages like Java. In practice, however, dynamically typed languages have received a huge gain in popularity over the last decade. This introduces the need for tools and research on test generation for these languages, too. We introduce Pynguin, an extendable test-generation framework for Python, which generates regression tests with high code coverage. Pynguin is designed to be easily usable by practitioners; it is also extensible to allow researchers to adapt it for their needs and to enable future research. We provide a demo of Pynguin at https://youtu.be/UiGrG25Vts0; further information, documentation, the tool, and its source code are available at https://www.pynguin.eu.

preprint2022arXiv

Scratch as Social Network: Topic Modeling and Sentiment Analysis in Scratch Projects

Societal matters like the Black Lives Matter (BLM) movement influence software engineering, as the recent debate on replacing certain discriminatory terms such as whitelist/blacklist has shown. Identifying relevant and trending societal matters is important, and often done using social network analysis for traditional social media channels such as twitter. In this paper we explore whether this type of analysis can also be used for introspection of the software world, by looking at the thriving scene of young Scratch programmers. The educational programming language Scratch is not only used for teaching programming concepts, but offers a platform for young programmers to express and share their creativity on any topics of relevance. By analyzing titles and project comments in a dataset of 106.032 Scratch projects, we explore which topics are common in the Scratch community, whether socially relevant events are reflected and how how the sentiment in the comments is. It turns out that the diversity of topics within the Scratch projects make the analysis process challenging. Our results nevertheless show that topics from pop and net culture in particular are present, and even recent societal events such as the Covid-19 pandemic or BLM are to some extent reflected in Scratch. The tone in the comments is mostly positive with catchy youth language. Hence, despite the challenges, Scratch projects can be studied in the same way as social networks, which opens up new possibilities to improve our understanding of the behavior and motivation of novice programmers.

preprint2021arXiv

An Empirical Study of Flaky Tests in Python

Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of the role of flakiness in software development beyond Java, we empirically study the prevalence, causes, and degree of flakiness within software written in Python, one of the currently most popular programming languages. For this, we sampled 22352 open source projects from the popular PyPI package index, and analyzed their 876186 test cases for flakiness. Our investigation suggests that flakiness is equally prevalent in Python as it is in Java. The reasons, however, are different: Order dependency is a much more dominant problem in Python, causing 59% of the 7571 flaky tests in our dataset. Another 28% were caused by test infrastructure problems, which represent a previously undocumented cause of flakiness. The remaining 13% can mostly be attributed to the use of network and randomness APIs by the projects, which is indicative of the type of software commonly written in Python. Our data also suggests that finding flaky tests requires more runs than are often done in the literature: A 95% confidence that a passing test case is not flaky on average would require 170 reruns.

preprint2021arXiv

Finding Anomalies in Scratch Assignments

In programming education, teachers need to monitor and assess the progress of their students by investigating the code they write. Code quality of programs written in traditional programming languages can be automatically assessed with automated tests, verification tools, or linters. In many cases these approaches rely on some form of manually written formal specification to analyze the given programs. Writing such specifications, however, is hard for teachers, who are often not adequately trained for this task. Furthermore, automated tool support for popular block-based introductory programming languages like Scratch is lacking. Anomaly detection is an approach to automatically identify deviations of common behavior in datasets without any need for writing a specification. In this paper, we use anomaly detection to automatically find deviations of Scratch code in a classroom setting, where anomalies can represent erroneous code, alternative solutions, or distinguished work. Evaluation on solutions of different programming tasks demonstrates that anomaly detection can successfully be applied to tightly specified as well as open-ended programming tasks.

preprint2021arXiv

Gradeer: An Open-Source Modular Hybrid Grader

Automated assessment has been shown to greatly simplify the process of assessing students' programs. However, manual assessment still offers benefits to both students and tutors. We introduce Gradeer, a hybrid assessment tool, which allows tutors to leverage the advantages of both automated and manual assessment. The tool features a modular design, allowing new grading functionality to be added. Gradeer directly assists manual grading, by automatically loading code inspectors, running students' programs, and allowing grading to be stopped and resumed in place at a later time. We used Gradeer to assess an end of year assignment for an introductory Java programming course, and found that its hybrid approach offers several benefits.

preprint2021arXiv

LitterBox: A Linter for Scratch Programs

Creating programs with block-based programming languages like Scratch is easy and fun. Block-based programs can nevertheless contain bugs, in particular when learners have misconceptions about programming. Even when they do not, Scratch code is often of low quality and contains code smells, further inhibiting understanding, reuse, and fun. To address this problem, in this paper we introduce LitterBox, a linter for Scratch programs. Given a program or its public project ID, LitterBox checks the program against patterns of known bugs and code smells. For each issue identified, LitterBox provides not only the location in the code, but also a helpful explanation of the underlying reason and possible misconceptions. Learners can access LitterBox through an easy to use web interface with visual information about the errors in the block-code, while for researchers LitterBox provides a general, open source, and extensible framework for static analysis of Scratch programs.

preprint2021arXiv

Practical Mutation Testing at Scale

Mutation analysis assesses a test suite's adequacy by measuring its ability to detect small artificial faults, systematically seeded into the tested program. Mutation analysis is considered one of the strongest test-adequacy criteria. Mutation testing builds on top of mutation analysis and is a testing technique that uses mutants as test goals to create or improve a test suite. Mutation testing has long been considered intractable because the sheer number of mutants that can be created represents an insurmountable problem -- both in terms of human and computational effort. This has hindered the adoption of mutation testing as an industry standard. For example, Google has a codebase of two billion lines of code and more than 500,000,000 tests are executed on a daily basis. The traditional approach to mutation testing does not scale to such an environment. To address these challenges, this paper presents a scalable approach to mutation testing based on the following main ideas: (1) Mutation testing is done incrementally, mutating only changed code during code review, rather than the entire code base; (2) Mutants are filtered, removing mutants that are likely to be irrelevant to developers, and limiting the number of mutants per line and per code review process; (3) Mutants are selected based on the historical performance of mutation operators, further eliminating irrelevant mutants and improving mutant quality. Evaluation in a code-review-based setting with more than 24,000 developers on more than 1,000 projects shows that the proposed approach produces orders of magnitude fewer mutants and that context-based mutant filtering and selection improve mutant quality and actionability. Overall, the proposed approach represents a mutation testing framework that seamlessly integrates into the software development workflow and is applicable up to large-scale industrial settings.

preprint2020arXiv

Search-based Testing for Scratch Programs

Block-based programming languages enable young learners to quickly implement fun programs and games. The Scratch programming environment is particularly successful at this, with more than 50 million registered users at the time of this writing. Although Scratch simplifies creating syntactically correct programs, learners and educators nevertheless frequently require feedback and support. Dynamic program analysis could enable automation of this support, but the test suites necessary for dynamic analysis do not usually exist for Scratch programs. It is, however, possible to cast test generation for Scratch as a search problem. In this paper, we introduce an approach for automatically generating test suites for Scratch programs using grammatical evolution. The use of grammatical evolution clearly separates the search encoding from framework-specific implementation details, and allows us to use advanced test acceleration techniques. We implemented our approach as an extension of the Whisker test framework. Evaluation on sample Scratch programs demonstrates the potential of the approach.

preprint2016arXiv

Inferring Loop Invariants by Mutation, Dynamic Analysis, and Static Checking

Verifiers that can prove programs correct against their full functional specification require, for programs with loops, additional annotations in the form of loop invariants---propeties that hold for every iteration of a loop. We show that significant loop invariant candidates can be generated by systematically mutating postconditions; then, dynamic checking (based on automatically generated tests) weeds out invalid candidates, and static checking selects provably valid ones. We present a framework that automatically applies these techniques to support a program prover, paving the way for fully automatic verification without manually written loop invariants: Applied to 28 methods (including 39 different loops) from various java.util classes (occasionally modified to avoid using Java features not fully supported by the static checker), our DYNAMATE prototype automatically discharged 97% of all proof obligations, resulting in automatic complete correctness proofs of 25 out of the 28 methods---outperforming several state-of-the-art tools for fully automatic verification.

preprint2016arXiv

Uncertainty-Driven Black-Box Test Data Generation

We can never be certain that a software system is correct simply by testing it, but with every additional successful test we become less uncertain about its correctness. In absence of source code or elaborate specifications and models, tests are usually generated or chosen randomly. However, rather than randomly choosing tests, it would be preferable to choose those tests that decrease our uncertainty about correctness the most. In order to guide test generation, we apply what is referred to in Machine Learning as "Query Strategy Framework": We infer a behavioural model of the system under test and select those tests which the inferred model is "least certain" about. Running these tests on the system under test thus directly targets those parts about which tests so far have failed to inform the model. We provide an implementation that uses a genetic programming engine for model inference in order to enable an uncertainty sampling technique known as "query by committee", and evaluate it on eight subject systems from the Apache Commons Math framework and JodaTime. The results indicate that test generation using uncertainty sampling outperforms conventional and Adaptive Random Testing.

preprint2013arXiv

Using State Infection Conditions to Detect Equivalent Mutants and Speed up Mutation Analysis

Mutation analysis evaluates test suites and testing techniques by measuring how well they detect seeded defects (mutants). Even though well established in research, mutation analysis is rarely used in practice due to scalability problems --- there are multiple mutations per code statement leading to a large number of mutants, and hence executions of the test suite. In addition, the use of mutation to improve test suites is futile for mutants that are equivalent, which means that there exists no test case that distinguishes them from the original program. This paper introduces two optimizations based on state infection conditions, i.e., conditions that determine for a test execution whether the same execution on a mutant would lead to a different state. First, redundant test execution can be avoided by monitoring state infection conditions, leading to an overall performance improvement. Second, state infection conditions can aid in identifying equivalent mutants, thus guiding efforts to improve test suites.

Gordon Fraser

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

A Survey on How Test Flakiness Affects Developers and What Support They Need To Address It

An Empirical Study of Automated Unit Test Generation for Python

Automated Test Generation for Scratch Programs

Common Patterns in Block-Based Robot Programs

Common Problems and Effects of Feedback on Fun When Programming Ozobots in Primary School

Gamekins: Gamifying Software Testing in Jenkins

Gender-dependent Contribution, Code and Creativity in a Virtual Programming Course

Model-based Testing of Scratch Programs

Neuroevolution-Based Generation of Tests and Oracles for Games

Pynguin: Automated Unit Test Generation for Python

Scratch as Social Network: Topic Modeling and Sentiment Analysis in Scratch Projects

An Empirical Study of Flaky Tests in Python

Finding Anomalies in Scratch Assignments

Gradeer: An Open-Source Modular Hybrid Grader

LitterBox: A Linter for Scratch Programs

Practical Mutation Testing at Scale

Search-based Testing for Scratch Programs

Inferring Loop Invariants by Mutation, Dynamic Analysis, and Static Checking

Uncertainty-Driven Black-Box Test Data Generation

Using State Infection Conditions to Detect Equivalent Mutants and Speed up Mutation Analysis