Source author record

Markus Wagner

Markus Wagner appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Neural and Evolutionary Computing Software Engineering Artificial Intelligence Machine Learning eess.SP Human-Computer Interaction Data Structures and Algorithms Computation and Language math.OC Performance

Catalog footprint

What is connected

39works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

NCQ: Code reuse support for node.js developers

Code reuse is an important part of software development. The adoption of code reuse practices is especially common among Node.js developers. The Node.js package manager, NPM, indexes over 1 Million packages and developers often seek out packages to solve programming tasks. Due to the vast number of packages, selecting the right package is difficult and time consuming. With the goal of improving productivity of developers that heavily reuse code through third-party packages, we present Node Code Query (NCQ), a Read-Eval-Print-Loop environment that allows developers to 1) search for NPM packages using natural language queries, 2) search for code snippets related to those packages, 3) automatically correct errors in these code snippets, 4) quickly setup new environments for testing those snippets, and 5) transition between search and editing modes. In two user studies with a total of 20 participants, we find that participants begin programming faster and conclude tasks faster with NCQ than with baseline approaches, and that they like, among other features, the search for code snippets and packages. Our results suggest that NCQ makes Node.js developers more efficient in reusing code.

preprint2022arXiv

Automatically Categorising GitHub Repositories by Application Domain

GitHub is the largest host of open source software on the Internet. This large, freely accessible database has attracted the attention of practitioners and researchers alike. But as GitHub's growth continues, it is becoming increasingly hard to navigate the plethora of repositories which span a wide range of domains. Past work has shown that taking the application domain into account is crucial for tasks such as predicting the popularity of a repository and reasoning about project quality. In this work, we build on a previously annotated dataset of 5,000 GitHub repositories to design an automated classifier for categorising repositories by their application domain. The classifier uses state-of-the-art natural language processing techniques and machine learning to learn from multiple data sources and catalogue repositories according to five application domains. We contribute with (1) an automated classifier that can assign popular repositories to each application domain with at least 70% precision, (2) an investigation of the approach's performance on less popular repositories, and (3) a practical application of this approach to answer how the adoption of software engineering practices differs across application domains. Our work aims to help the GitHub community identify repositories of interest and opens promising avenues for future work investigating differences between repositories from different application domains.

preprint2022arXiv

Is Surprisal in Issue Trackers Actionable?

Background. From information theory, surprisal is a measurement of how unexpected an event is. Statistical language models provide a probabilistic approximation of natural languages, and because surprisal is constructed with the probability of an event occuring, it is therefore possible to determine the surprisal associated with English sentences. The issues and pull requests of software repository issue trackers give insight into the development process and likely contain the surprising events of this process. Objective. Prior works have identified that unusual events in software repositories are of interest to developers, and use simple code metrics-based methods for detecting them. In this study we will propose a new method for unusual event detection in software repositories using surprisal. With the ability to find surprising issues and pull requests, we intend to further analyse them to determine if they actually hold importance in a repository, or if they pose a significant challenge to address. If it is possible to find bad surprises early, or before they cause additional troubles, it is plausible that effort, cost and time will be saved as a result. Method. After extracting the issues and pull requests from 5000 of the most popular software repositories on GitHub, we will train a language model to represent these issues. We will measure their perceived importance in the repository, measure their resolution difficulty using several analogues, measure the surprisal of each, and finally generate inferential statistics to describe any correlations.

preprint2022arXiv

On the Fitness Landscapes of Interdependency Models in the Travelling Thief Problem

Since its inception in 2013, the Travelling Thief Problem (TTP) has been widely studied as an example of problems with multiple interconnected sub-problems. The dependency in this model arises when tying the travelling time of the "thief" to the weight of the knapsack. However, other forms of dependency as well as combinations of dependencies should be considered for investigation, as they are often found in complex real-world problems. Our goal is to study the impact of different forms of dependency in the TTP using a simple local search algorithm. To achieve this, we use Local Optima Networks, a technique for analysing the fitness landscape.

preprint2022arXiv

On the Utility of Marrying GIN and PMD for Improving Stack Overflow Code Snippets

Software developers are increasingly dependent on question and answer portals and blogs for coding solutions. While such interfaces provide useful information, there are concerns that code hosted here is often incorrect, insecure or incomplete. Previous work indeed detected a range of faults in code provided on Stack Overflow through the use of static analysis. Static analysis may go a far way towards quickly establishing the health of software code available online. In addition, mechanisms that enable rapid automated program improvement may then enhance such code. Accordingly, we present this proof of concept. We use the PMD static analysis tool to detect performance faults for a sample of Stack Overflow Java code snippets, before performing mutations on these snippets using GIN. We then re-analyse the performance faults in these snippets after the GIN mutations. GIN's RandomSampler was used to perform 17,986 unique line and statement patches on 3,034 snippets where PMD violations were removed from 770 patched versions. Our outcomes indicate that static analysis techniques may be combined with automated program improvement methods to enhance publicly available code with very little resource requirements. We discuss our planned research agenda in this regard.

preprint2022arXiv

Optimal offering strategy for an aggregator across multiple products of European day-ahead market

Most literature surrounding optimal bidding strategies for aggregators in European day-ahead market (DAM) considers only hourly orders. While other order types (e.g., block orders) may better represent the temporal characteristics of certain sources of flexibility (e.g., behind-the-meter flexibility), the increased combinations from these orders make it hard to develop a tractable optimization formulation. Thus, our aim in this paper is to develop a tractable optimal offering strategy for flexibility aggregators in the European DAM (a.k.a. Elspot) considering these orders. Towards this, we employ a price-based mechanism of procuring flexibility and place hourly and regular block orders in the market. We develop two mixed-integer bi-linear programs: 1) a brute force formulation for validation and 2) a novel formulation based on logical constraints. To evaluate the performance of these formulations, we proposed a generic flexibility model for an aggregated cluster of prosumers that considers the prosumers' responsiveness, inter-temporal dependencies, and seasonal and diurnal variations. The simulation results show that the proposed model significantly outperforms the brute force model in terms of computation speed. Also, we observed that using block orders has potential for profitability of an aggregator.

preprint2022arXiv

Self-Adaptive Systems: A Systematic Literature Review Across Categories and Domains

Context: Championed by IBM's vision of autonomic computing paper in 2003, the autonomic computing research field has seen increased research activity over the last 20 years. Several conferences and workshops have been established and have contributed to the autonomic computing knowledge base in search of a new kind of system -- a self-adaptive system (SAS). These systems are characterized by being context-aware and can act on that awareness. The actions carried out could be on the system or on the context (or environment). The underlying goal of a SAS is the sustained achievement of its goals despite changes in its environment. Objective: Despite a number of literature reviews on specific aspects of SASs ranging from their requirements to quality attributes, we lack a systematic understanding of the current state of the art. Method: This paper contributes a systematic literature review into self-adaptive systems using the dblp computer science bibliography as a database. We filtered the records systematically in successive steps to arrive at 293 relevant papers. Each paper was critically analyzed and categorized into an attribute matrix. This matrix consisted of five categories, with each category having multiple attributes. The attributes of each paper, along with the summary of its contents formed the basis of the literature review that spanned 30 years. Results: We characterize the maturation process of the research area from theoretical papers over practical implementations to more holistic and generic approaches, frameworks, and exemplars, applied to areas such as networking, web services, and robotics, with much of the recent work focusing on IoT and IaaS. Conclusion: While there is an ebb and flow of application domains, domains like bio-inspired approaches, security, and cyber physical systems showed promise to grow heading into the 2020s.

preprint2022arXiv

Software Engineering User Study Recruitment on Prolific: An Experience Report

Online participant recruitment platforms such as Prolific have been gaining popularity in research, as they enable researchers to easily access large pools of participants. However, participant quality can be an issue; participants may give incorrect information to gain access to more studies, adding unwanted noise to results. This paper details our experience recruiting participants from Prolific for a user study requiring programming skills in Node.js, with the aim of helping other researchers conduct similar studies. We explore a method of recruiting programmer participants using prescreening validation, attention checks and a series of programming knowledge questions. We received 680 responses, and determined that 55 met the criteria to be invited to our user study. We ultimately conducted user study sessions via video calls with 10 participants. We conclude this paper with a series of recommendations for researchers.

preprint2022arXiv

Visualization Onboarding Grounded in Educational Theories

The aim of visualization is to support people in dealing with large and complex information structures, to make these structures more comprehensible, facilitate exploration, and enable knowledge discovery. However, users often have problems reading and interpreting data from visualizations, in particular when they experience them for the first time. A lack of visualization literacy, i.e., knowledge in terms of domain, data, visual encoding, interaction, and also analytical methods can be observed. To support users in learning how to use new digital technologies, the concept of onboarding has been successfully applied in other domains. However, it has not received much attention from the visualization community so far. This chapter aims to fill this gap by defining the concept and systematically laying out the design space of onboarding in the context of visualization as a descriptive design space. On this basis, we present a survey of approaches from the academic community as well as from commercial products, especially surveying educational theories that inform the onboarding strategies. Additionally, we derived design considerations based on previous publications and present some guidelines for the design of visualization onboarding concepts.

preprint2021arXiv

GTOPX Space Mission Benchmarks

This contribution introduces the GTOPX space mission benchmark collection, which is an extension of GTOP database published by the European Space Agency (ESA). GTOPX consists of ten individual benchmark instances representing real-world interplanetary space trajectory design problems. In regard to the original GTOP collection, GTOPX includes three new problem instances featuring mixed-integer and multi-objective properties. GTOPX enables a simplified user handling, unified benchmark function call and some minor bug corrections to the original GTOP implementation. Furthermore, GTOPX is linked from it's original C++ source code to Python and Matlab based on dynamic link libraries, assuring computationally fast and accurate reproduction of the benchmark results in all three programming languages. Space mission trajectory design problems as those represented in GTOPX are known to be highly non-linear and difficult to solve. The GTOPX collection, therefore, aims particularly at researchers wishing to put advanced (meta)heuristic and hybrid optimization algorithms to the test. The goal of this paper is to provide researchers with a manual and reference to the newly available GTOPX benchmark software.

preprint2021arXiv

MATE: A Model-based Algorithm Tuning Engine

In this paper, we introduce a Model-based Algorithm Turning Engine, namely MATE, where the parameters of an algorithm are represented as expressions of the features of a target optimisation problem. In contrast to most static (feature-independent) algorithm tuning engines such as irace and SPOT, our approach aims to derive the best parameter configuration of a given algorithm for a specific problem, exploiting the relationships between the algorithm parameters and the features of the problem. We formulate the problem of finding the relationships between the parameters and the problem features as a symbolic regression problem and we use genetic programming to extract these expressions. For the evaluation, we apply our approach to configuration of the (1+1) EA and RLS algorithms for the OneMax, LeadingOnes, BinValue and Jump optimisation problems, where the theoretically optimal algorithm parameters to the problems are available as functions of the features of the problems. Our study shows that the found relationships typically comply with known theoretical results, thus demonstrating a new opportunity to consider model-based parameter tuning as an effective alternative to the static algorithm tuning engines.

preprint2020arXiv

A Non-Dominated Sorting Based Customized Random-Key Genetic Algorithm for the Bi-Objective Traveling Thief Problem

In this paper, we propose a method to solve a bi-objective variant of the well-studied Traveling Thief Problem (TTP). The TTP is a multi-component problem that combines two classic combinatorial problems: Traveling Salesman Problem (TSP) and Knapsack Problem (KP). We address the BI-TTP, a bi-objective version of the TTP, where the goal is to minimize the overall traveling time and to maximize the profit of the collected items. Our proposed method is based on a biased-random key genetic algorithm with customizations addressing problem-specific characteristics. We incorporate domain knowledge through a combination of near-optimal solutions of each subproblem in the initial population and use a custom repair operator to avoid the evaluation of infeasible solutions. The bi-objective aspect of the problem is addressed through an elite population extracted based on the non-dominated rank and crowding distance. Furthermore, we provide a comprehensive study showing the influence of each parameter on the performance. Finally, we discuss the results of the BI-TTP competitions at EMO-2019 and GECCO-2019 conferences where our method has won first and second places, respectively, thus proving its ability to find high-quality solutions consistently.

preprint2020arXiv

An Annotated Dataset of Stack Overflow Post Edits

To improve software engineering, software repositories have been mined for code snippets and bug fixes. Typically, this mining takes place at the level of files or commits. To be able to dig deeper and to extract insights at a higher resolution, we hereby present an annotated dataset that contains over 7 million edits of code and text on Stack Overflow. Our preliminary study indicates that these edits might be a treasure trove for mining information about fine-grained patches, e.g., for the optimisation of non-functional properties.

preprint2020arXiv

An Evolutionary Deep Learning Method for Short-term Wind Speed Prediction: A Case Study of the Lillgrund Offshore Wind Farm

Accurate short-term wind speed forecasting is essential for large-scale integration of wind power generation. However, the seasonal and stochastic characteristics of wind speed make forecasting a challenging task. This study uses a new hybrid evolutionary approach that uses a popular evolutionary search algorithm, CMA-ES, to tune the hyper-parameters of two Long short-term memory(LSTM) ANN models for wind prediction. The proposed hybrid approach is trained on data gathered from an offshore wind turbine installed in a Swedish wind farm located in the Baltic Sea. Two forecasting horizons including ten-minutes ahead (absolute short term) and one-hour ahead (short term) are considered in our experiments. Our experimental results indicate that the new approach is superior to five other applied machine learning models, i.e., polynomial neural network (PNN), feed-forward neural network (FNN), nonlinear autoregressive neural network (NAR) and adaptive neuro-fuzzy inference system (ANFIS), as measured by five performance criteria.

preprint2020arXiv

Ants can orienteer a thief in their robbery

The Thief Orienteering Problem (ThOP) is a multi-component problem that combines features of two classic combinatorial optimization problems: Orienteering Problem and Knapsack Problem. The ThOP is challenging due to the given time constraint and the interaction between its components. We propose an Ant Colony Optimization algorithm together with a new packing heuristic to deal individually and interactively with problem components. Our approach outperforms existing work on more than 90% of the benchmarking instances, with an average improvement of over 300%.

preprint2020arXiv

Design optimisation of a multi-mode wave energy converter

A wave energy converter (WEC) similar to the CETO system developed by Carnegie Clean Energy is considered for design optimisation. This WEC is able to absorb power from heave, surge and pitch motion modes, making the optimisation problem nontrivial. The WEC dynamics is simulated using the spectral-domain model taking into account hydrodynamic forces, viscous drag, and power take-off forces. The design parameters for optimisation include the buoy radius, buoy height, tether inclination angles, and control variables (damping and stiffness). The WEC design is optimised for the wave climate at Albany test site in Western Australia considering unidirectional irregular waves. Two objective functions are considered: (i) maximisation of the annual average power output, and (ii) minimisation of the levelised cost of energy (LCoE) for a given sea site. The LCoE calculation is approximated as a ratio of the produced energy to the significant mass of the system that includes the mass of the buoy and anchor system. Six different heuristic optimisation methods are applied in order to evaluate and compare the performance of the best known evolutionary algorithms, a swarm intelligence technique and a numerical optimisation approach. The results demonstrate that if we are interested in maximising energy production without taking into account the cost of manufacturing such a system, the buoy should be built as large as possible (20 m radius and 30 m height). However, if we want the system that produces cheap energy, then the radius of the buoy should be approximately 11-14~m while the height should be as low as possible. These results coincide with the overall design that Carnegie Clean Energy has selected for its CETO 6 multi-moored unit. However, it should be noted that this study is not informed by them, so this can be seen as an independent validation of the design choices.

preprint2020arXiv

Fitness Landscape Analysis of Dimensionally-Aware Genetic Programming Featuring Feynman Equations

Genetic programming is an often-used technique for symbolic regression: finding symbolic expressions that match data from an unknown function. To make the symbolic regression more efficient, one can also use dimensionally-aware genetic programming that constrains the physical units of the equation. Nevertheless, there is no formal analysis of how much dimensionality awareness helps in the regression process. In this paper, we conduct a fitness landscape analysis of dimensionallyaware genetic programming search spaces on a subset of equations from Richard Feynmans well-known lectures. We define an initialisation procedure and an accompanying set of neighbourhood operators for conducting the local search within the physical unit constraints. Our experiments show that the added information about the variable dimensionality can efficiently guide the search algorithm. Still, further analysis of the differences between the dimensionally-aware and standard genetic programming landscapes is needed to help in the design of efficient evolutionary operators to be used in a dimensionally-aware regression.

preprint2020arXiv

Genetic Improvement @ ICSE 2020

Following Prof. Mark Harman of Facebook's keynote and formal presentations (which are recorded in the proceedings) there was a wide ranging discussion at the eighth international Genetic Improvement workshop, GI-2020 @ ICSE (held as part of the 42nd ACM/IEEE International Conference on Software Engineering on Friday 3rd July 2020). Topics included industry take up, human factors, explainabiloity (explainability, justifyability, exploitability) and GI benchmarks. We also contrast various recent online approaches (e.g. SBST 2020) to holding virtual computer science conferences and workshops via the WWW on the Internet without face-2-face interaction. Finally we speculate on how the Coronavirus Covid-19 Pandemic will affect research next year and into the future.

preprint2020arXiv

Human-Like Summaries from Heterogeneous and Time-Windowed Software Development Artefacts

Automatic text summarisation has drawn considerable interest in the area of software engineering. It is challenging to summarise the activities related to a software project, (1) because of the volume and heterogeneity of involved software artefacts, and (2) because it is unclear what information a developer seeks in such a multi-document summary. We present the first framework for summarising multi-document software artefacts containing heterogeneous data within a given time frame. To produce human-like summaries, we employ a range of iterative heuristics to minimise the cosine-similarity between texts and high-dimensional feature vectors. A first study shows that users find the automatically generated summaries the most useful when they are generated using word similarity and based on the eight most relevant software artefacts.

preprint2020arXiv

Hybrid Neuro-Evolutionary Method for Predicting Wind Turbine Power Output

Reliable wind turbine power prediction is imperative to the planning, scheduling and control of wind energy farms for stable power production. In recent years Machine Learning (ML) methods have been successfully applied in a wide range of domains, including renewable energy. However, due to the challenging nature of power prediction in wind farms, current models are far short of the accuracy required by industry. In this paper, we deploy a composite ML approach--namely a hybrid neuro-evolutionary algorithm--for accurate forecasting of the power output in wind-turbine farms. We use historical data in the supervisory control and data acquisition (SCADA) systems as input to estimate the power output from an onshore wind farm in Sweden. At the beginning stage, the k-means clustering method and an Autoencoder are employed, respectively, to detect and filter noise in the SCADA measurements. Next, with the prior knowledge that the underlying wind patterns are highly non-linear and diverse, we combine a self-adaptive differential evolution (SaDE) algorithm as a hyper-parameter optimizer, and a recurrent neural network (RNN) called Long Short-term memory (LSTM) to model the power curve of a wind turbine in a farm. Two short time forecasting horizons, including ten-minutes ahead and one-hour ahead, are considered in our experiments. We show that our approach outperforms its counterparts.

preprint2020arXiv

Optimisation of Large Wave Farms using a Multi-strategy Evolutionary Framework

Wave energy is a fast-developing and promising renewable energy resource. The primary goal of this research is to maximise the total harnessed power of a large wave farm consisting of fully-submerged three-tether wave energy converters (WECs). Energy maximisation for large farms is a challenging search problem due to the costly calculations of the hydrodynamic interactions between WECs in a large wave farm and the high dimensionality of the search space. To address this problem, we propose a new hybrid multi-strategy evolutionary framework combining smart initialisation, binary population-based evolutionary algorithm, discrete local search and continuous global optimisation. For assessing the performance of the proposed hybrid method, we compare it with a wide variety of state-of-the-art optimisation approaches, including six continuous evolutionary algorithms, four discrete search techniques and three hybrid optimisation methods. The results show that the proposed method performs considerably better in terms of convergence speed and farm output.

preprint2020arXiv

Optimising the Fit of Stack Overflow Code Snippets into Existing Code

Software developers often reuse code from online sources such as Stack Overflow within their projects. However, the process of searching for code snippets and integrating them within existing source code can be tedious. In order to improve efficiency and reduce time spent on code reuse, we present an automated code reuse tool for the Eclipse IDE (Integrated Developer Environment), NLP2TestableCode. NLP2TestableCode can not only search for Java code snippets using natural language tasks, but also evaluate code snippets based on a user's existing code, modify snippets to improve fit and correct errors, before presenting the user with the best snippet, all without leaving the editor. NLP2TestableCode also includes functionality to automatically generate customisable test cases and suggest argument and return types, in order to further evaluate code snippets. In evaluation, NLP2TestableCode was capable of finding compilable code snippets for 82.9% of tasks, and testable code snippets for 42.9%.

preprint2020arXiv

The Dynamic Travelling Thief Problem: Benchmarks and Performance of Evolutionary Algorithms

Many real-world optimisation problems involve dynamic and stochastic components. While problems with multiple interacting components are omnipresent in inherently dynamic domains like supply-chain optimisation and logistics, most research on dynamic problems focuses on single-component problems. With this article, we define a number of scenarios based on the Travelling Thief Problem to enable research on the effect of dynamic changes to sub-components. Our investigations of 72 scenarios and seven algorithms show that -- depending on the instance, the magnitude of the change, and the algorithms in the portfolio -- it is preferable to either restart the optimisation from scratch or to continue with the previously valid solutions.

preprint2020arXiv

Towards a Structural Framework for Explicit Domain Knowledge in Visual Analytics

Clinicians and other analysts working with healthcare data are in need for better support to cope with large and complex data. While an increasing number of visual analytics environments integrates explicit domain knowledge as a means to deliver a precise representation of the available data, theoretical work so far has focused on the role of knowledge in the visual analytics process. There has been little discussion about how such explicit domain knowledge can be structured in a generalized framework. This paper collects desiderata for such a structural framework, proposes how to address these desiderata based on the model of linked data, and demonstrates the applicability in a visual analytics environment for physiotherapy.

preprint2020arXiv

Towards Rigorous Validation of Energy Optimisation Experiments

The optimisation of software energy consumption is of growing importance across all scales of modern computing, i.e., from embedded systems to data-centres. Practitioners in the field of Search-Based Software Engineering and Genetic Improvement of Software acknowledge that optimising software energy consumption is difficult due to noisy and expensive fitness evaluations. However, it is apparent from results to date that more progress needs to be made in rigorously validating optimisation results. This problem is pressing because modern computing platforms have highly complex and variable behaviour with respect to energy consumption. To compare solutions fairly we propose in this paper a new validation approach called R3-validation which exercises software variants in a rotated-round-robin order. Using a case study, we present an in-depth analysis of the impacts of changing system states on software energy usage, and we show how R3-validation mitigates these. We compare it with current validation approaches across multiple devices and operating systems, and we show that it aligns better with actual platform behaviour.

preprint2019arXiv

A Hybrid Cooperative Co-evolution Algorithm Framework for Optimising Power Take Off and Placements of Wave Energy Converters

Wave energy technologies have the potential to play a significant role in the supply of renewable energy on a world scale. One of the most promising designs for wave energy converters (WECs) are fully submerged buoys. In this work, we explore the optimisation of WEC arrays consisting of a three-tether buoy model called CETO. Such arrays can be optimised for total energy output by adjusting both the relative positions of buoys in farms and also the power-take-off (PTO) parameters for each buoy. The search space for these parameters is complex and multi-modal. Moreover, the evaluation of each parameter setting is computationally expensive -- limiting the number of full model evaluations that can be made. To handle this problem, we propose a new hybrid cooperative co-evolution algorithm (HCCA). HCCA consists of a symmetric local search plus Nelder-Mead and a cooperative co-evolution algorithm (CC) with a backtracking strategy for optimising the positions and PTO settings of WECs, respectively. Moreover, a new adaptive scenario is proposed for tuning grey wolf optimiser (AGWO) hyper-parameter. AGWO participates notably with other applied optimisers in HCCA. For assessing the effectiveness of the proposed approach five popular Evolutionary Algorithms (EAs), four alternating optimisation methods and two modern hybrid ideas (LS-NM and SLS-NM-B) are carefully compared in four real wave situations (Adelaide, Tasmania, Sydney and Perth) with two wave farm sizes (4 and 16). According to the experimental outcomes, the hybrid cooperative framework exhibits better performance in terms of both runtime and quality of obtained solutions.

preprint2019arXiv

Better Software Analytics via "DUO": Data Mining Algorithms Using/Used-by Optimizers

This paper claims that a new field of empirical software engineering research and practice is emerging: data mining using/used-by optimizers for empirical studies or DUO. For example, data miners can generate models that are explored by optimizers. Also, optimizers can advise how to best adjust the control parameters of a data miner. This combined approach acts like an agent leaning over the shoulder of an analyst that advises "ask this question next" or "ignore that problem, it is not relevant to your goals". Further, those agents can help us build "better" predictive models, where "better" can be either greater predictive accuracy or faster modeling time (which, in turn, enables the exploration of a wider range of options). We also caution that the era of papers that just use data miners is coming to an end. Results obtained from an unoptimized data miner can be quickly refuted, just by applying an optimizer to produce a different (and better performing) model. Our conclusion, hence, is that for software analytics it is possible, useful and necessary to combine data mining and optimization using DUO.

preprint2018arXiv

Data-Driven Search-based Software Engineering

This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as data mining problems, SBSE reformulates SE problems as optimization problems and use meta-heuristic algorithms to solve them. Both MSR and SBSE share the common goal of providing insights to improve software engineering. The algorithms used in these two areas also have intrinsic relationships. We, therefore, argue that combining these two fields is useful for situations (a) which require learning from a large data source or (b) when optimizers need to know the lay of the land to find better solutions, faster. This paper aims to answer the following three questions: (1) What are the various topics addressed by DSE? (2) What types of data are used by the researchers in this area? (3) What research approaches do researchers use? The paper briefly sets out to act as a practical guide to develop new DSE techniques and also to serve as a teaching resource. This paper also presents a resource (tiny.cc/data-se) for exploring DSE. The resource contains 89 artifacts which are related to DSE, divided into 13 groups such as requirements engineering, software product lines, software processes. All the materials in this repository have been used in recent software engineering papers; i.e., for all this material, there exist baseline results against which researchers can comparatively assess their new ideas.

preprint2016arXiv

A case study of algorithm selection for the traveling thief problem

Many real-world problems are composed of several interacting components. In order to facilitate research on such interactions, the Traveling Thief Problem (TTP) was created in 2013 as the combination of two well-understood combinatorial optimization problems. With this article, we contribute in four ways. First, we create a comprehensive dataset that comprises the performance data of 21 TTP algorithms on the full original set of 9720 TTP instances. Second, we define 55 characteristics for all TPP instances that can be used to select the best algorithm on a per-instance basis. Third, we use these algorithms and features to construct the first algorithm portfolios for TTP, clearly outperforming the single best algorithm. Finally, we study which algorithms contribute most to this portfolio.

preprint2016arXiv

A Generic Bet-and-run Strategy for Speeding Up Traveling Salesperson and Minimum Vertex Cover

A common strategy for improving optimization algorithms is to restart the algorithm when it is believed to be trapped in an inferior part of the search space. However, while specific restart strategies have been developed for specific problems (and specific algorithms), restarts are typically not regarded as a general tool to speed up an optimization algorithm. In fact, many optimization algorithms do not employ restarts at all. Recently, "bet-and-run" was introduced in the context of mixed-integer programming, where first a number of short runs with randomized initial conditions is made, and then the most promising run of these is continued. In this article, we consider two classical NP-complete combinatorial optimization problems, traveling salesperson and minimum vertex cover, and study the effectiveness of different bet-and-run strategies. In particular, our restart strategies do not take any problem knowledge into account, nor are tailored to the optimization algorithm. Therefore, they can be used off-the-shelf. We observe that state-of-the-art solvers for these problems can benefit significantly from restarts on standard benchmark instances.

preprint2016arXiv

Evolutionary computation for multicomponent problems: opportunities and future directions

Over the past 30 years many researchers in the field of evolutionary computation have put a lot of effort to introduce various approaches for solving hard problems. Most of these problems have been inspired by major industries so that solving them, by providing either optimal or near optimal solution, was of major significance. Indeed, this was a very promising trajectory as advances in these problem-solving approaches could result in adding values to major industries. In this paper we revisit this trajectory to find out whether the attempts that started three decades ago are still aligned with the same goal, as complexities of real-world problems increased significantly. We present some examples of modern real-world problems, discuss why they might be difficult to solve, and whether there is any mismatch between these examples and the problems that are investigated in the evolutionary computation area.

preprint2014arXiv

Seeding the Initial Population of Multi-Objective Evolutionary Algorithms: A Computational Study

Most experimental studies initialize the population of evolutionary algorithms with random genotypes. In practice, however, optimizers are typically seeded with good candidate solutions either previously known or created according to some problem-specific method. This "seeding" has been studied extensively for single-objective problems. For multi-objective problems, however, very little literature is available on the approaches to seeding and their individual benefits and disadvantages. In this article, we are trying to narrow this gap via a comprehensive computational study on common real-valued test functions. We investigate the effect of two seeding techniques for five algorithms on 48 optimization problems with 2, 3, 4, 6, and 8 objectives. We observe that some functions (e.g., DTLZ4 and the LZ family) benefit significantly from seeding, while others (e.g., WFG) profit less. The advantage of seeding also depends on the examined algorithm.

preprint2012arXiv

A Fast and Effective Local Search Algorithm for Optimizing the Placement of Wind Turbines

The placement of wind turbines on a given area of land such that the wind farm produces a maximum amount of energy is a challenging optimization problem. In this article, we tackle this problem, taking into account wake effects that are produced by the different turbines on the wind farm. We significantly improve upon existing results for the minimization of wake effects by developing a new problem-specific local search algorithm. One key step in the speed-up of our algorithm is the reduction in computation time needed to assess a given wind farm layout compared to previous approaches. Our new method allows the optimization of large real-world scenarios within a single night on a standard computer, whereas weeks on specialized computing servers were required for previous approaches.

preprint2012arXiv

A Novel Feature-Based Approach to Characterize Algorithm Performance for the Traveling Salesman Problem

Meta-heuristics are frequently used to tackle NP-hard combinatorial optimization problems. With this paper we contribute to the understanding of the success of 2-opt based local search algorithms for solving the traveling salesman problem (TSP). Although 2-opt is widely used in practice, it is hard to understand its success from a theoretical perspective. We take a statistical approach and examine the features of TSP instances that make the problem either hard or easy to solve. As a measure of problem difficulty for 2-opt we use the approximation ratio that it achieves on a given instance. Our investigations point out important features that make TSP instances hard or easy to be approximated by 2-opt.

preprint2011arXiv

Computational Complexity Results for Genetic Programming and the Sorting Problem

Genetic Programming (GP) has found various applications. Understanding this type of algorithm from a theoretical point of view is a challenging task. The first results on the computational complexity of GP have been obtained for problems with isolated program semantics. With this paper, we push forward the computational complexity analysis of GP on a problem with dependent program semantics. We study the well-known sorting problem in this context and analyze rigorously how GP can deal with different measures of sortedness.

preprint2011arXiv

Evolving Pacing Strategies for Team Pursuit Track Cycling

Team pursuit track cycling is a bicycle racing sport held on velodromes and is part of the Summer Olympics. It involves the use of strategies to minimize the overall time that a team of cyclists needs to complete a race. We present an optimisation framework for team pursuit track cycling and show how to evolve strategies using metaheuristics for this interesting real-world problem. Our experimental results show that these heuristics lead to significantly better strategies than state-of-art strategies that are currently used by teams of cyclists.

preprint2011arXiv

Predicting the Energy Output of Wind Farms Based on Weather Data: Important Variables and their Correlation

Wind energy plays an increasing role in the supply of energy world-wide. The energy output of a wind farm is highly dependent on the weather condition present at the wind farm. If the output can be predicted more accurately, energy suppliers can coordinate the collaborative production of different energy sources more efficiently to avoid costly overproductions. With this paper, we take a computer science perspective on energy prediction based on weather data and analyze the important parameters as well as their correlation on the energy output. To deal with the interaction of the different parameters we use symbolic regression based on the genetic programming tool DataModeler. Our studies are carried out on publicly available weather and energy data for a wind farm in Australia. We reveal the correlation of the different variables for the energy output. The model obtained for energy prediction gives a very reliable prediction of the energy output for newly given weather data.

preprint2010arXiv

Faster Black-Box Algorithms Through Higher Arity Operators

We extend the work of Lehre and Witt (GECCO 2010) on the unbiased black-box model by considering higher arity variation operators. In particular, we show that already for binary operators the black-box complexity of \leadingones drops from $Θ(n^2)$ for unary operators to $O(n \log n)$. For \onemax, the $Ω(n \log n)$ unary black-box complexity drops to O(n) in the binary case. For $k$-ary operators, $k \leq n$, the \onemax-complexity further decreases to $O(n/\log k)$.

preprint2010arXiv

Simple Max-Min Ant Systems and the Optimization of Linear Pseudo-Boolean Functions

With this paper, we contribute to the understanding of ant colony optimization (ACO) algorithms by formally analyzing their runtime behavior. We study simple MAX-MIN ant systems on the class of linear pseudo-Boolean functions defined on binary strings of length 'n'. Our investigations point out how the progress according to function values is stored in pheromone. We provide a general upper bound of O((n^3 \log n)/ ρ) for two ACO variants on all linear functions, where (ρ) determines the pheromone update strength. Furthermore, we show improved bounds for two well-known linear pseudo-Boolean functions called OneMax and BinVal and give additional insights using an experimental study.

Markus Wagner

What is connected

Connect this record

See the researcher in context

Building this map preview

39 published item(s)

NCQ: Code reuse support for node.js developers

Automatically Categorising GitHub Repositories by Application Domain

Is Surprisal in Issue Trackers Actionable?

On the Fitness Landscapes of Interdependency Models in the Travelling Thief Problem

On the Utility of Marrying GIN and PMD for Improving Stack Overflow Code Snippets

Optimal offering strategy for an aggregator across multiple products of European day-ahead market

Self-Adaptive Systems: A Systematic Literature Review Across Categories and Domains

Software Engineering User Study Recruitment on Prolific: An Experience Report

Visualization Onboarding Grounded in Educational Theories

GTOPX Space Mission Benchmarks

MATE: A Model-based Algorithm Tuning Engine

A Non-Dominated Sorting Based Customized Random-Key Genetic Algorithm for the Bi-Objective Traveling Thief Problem

An Annotated Dataset of Stack Overflow Post Edits

An Evolutionary Deep Learning Method for Short-term Wind Speed Prediction: A Case Study of the Lillgrund Offshore Wind Farm

Ants can orienteer a thief in their robbery

Design optimisation of a multi-mode wave energy converter

Fitness Landscape Analysis of Dimensionally-Aware Genetic Programming Featuring Feynman Equations

Genetic Improvement @ ICSE 2020

Human-Like Summaries from Heterogeneous and Time-Windowed Software Development Artefacts

Hybrid Neuro-Evolutionary Method for Predicting Wind Turbine Power Output

Optimisation of Large Wave Farms using a Multi-strategy Evolutionary Framework

Optimising the Fit of Stack Overflow Code Snippets into Existing Code

The Dynamic Travelling Thief Problem: Benchmarks and Performance of Evolutionary Algorithms

Towards a Structural Framework for Explicit Domain Knowledge in Visual Analytics

Towards Rigorous Validation of Energy Optimisation Experiments

A Hybrid Cooperative Co-evolution Algorithm Framework for Optimising Power Take Off and Placements of Wave Energy Converters

Better Software Analytics via "DUO": Data Mining Algorithms Using/Used-by Optimizers

Data-Driven Search-based Software Engineering

A case study of algorithm selection for the traveling thief problem

A Generic Bet-and-run Strategy for Speeding Up Traveling Salesperson and Minimum Vertex Cover

Evolutionary computation for multicomponent problems: opportunities and future directions

Seeding the Initial Population of Multi-Objective Evolutionary Algorithms: A Computational Study

A Fast and Effective Local Search Algorithm for Optimizing the Placement of Wind Turbines

A Novel Feature-Based Approach to Characterize Algorithm Performance for the Traveling Salesman Problem

Computational Complexity Results for Genetic Programming and the Sorting Problem

Evolving Pacing Strategies for Team Pursuit Track Cycling

Predicting the Energy Output of Wind Farms Based on Weather Data: Important Variables and their Correlation

Faster Black-Box Algorithms Through Higher Arity Operators

Simple Max-Min Ant Systems and the Optimization of Linear Pseudo-Boolean Functions