Researcher profile

Markus Wagner

Markus Wagner contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
28works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

28 published item(s)

preprint2023arXiv

NCQ: Code reuse support for node.js developers

Code reuse is an important part of software development. The adoption of code reuse practices is especially common among Node.js developers. The Node.js package manager, NPM, indexes over 1 Million packages and developers often seek out packages to solve programming tasks. Due to the vast number of packages, selecting the right package is difficult and time consuming. With the goal of improving productivity of developers that heavily reuse code through third-party packages, we present Node Code Query (NCQ), a Read-Eval-Print-Loop environment that allows developers to 1) search for NPM packages using natural language queries, 2) search for code snippets related to those packages, 3) automatically correct errors in these code snippets, 4) quickly setup new environments for testing those snippets, and 5) transition between search and editing modes. In two user studies with a total of 20 participants, we find that participants begin programming faster and conclude tasks faster with NCQ than with baseline approaches, and that they like, among other features, the search for code snippets and packages. Our results suggest that NCQ makes Node.js developers more efficient in reusing code.

preprint2022arXiv

Automatically Categorising GitHub Repositories by Application Domain

GitHub is the largest host of open source software on the Internet. This large, freely accessible database has attracted the attention of practitioners and researchers alike. But as GitHub's growth continues, it is becoming increasingly hard to navigate the plethora of repositories which span a wide range of domains. Past work has shown that taking the application domain into account is crucial for tasks such as predicting the popularity of a repository and reasoning about project quality. In this work, we build on a previously annotated dataset of 5,000 GitHub repositories to design an automated classifier for categorising repositories by their application domain. The classifier uses state-of-the-art natural language processing techniques and machine learning to learn from multiple data sources and catalogue repositories according to five application domains. We contribute with (1) an automated classifier that can assign popular repositories to each application domain with at least 70% precision, (2) an investigation of the approach's performance on less popular repositories, and (3) a practical application of this approach to answer how the adoption of software engineering practices differs across application domains. Our work aims to help the GitHub community identify repositories of interest and opens promising avenues for future work investigating differences between repositories from different application domains.

preprint2022arXiv

Is Surprisal in Issue Trackers Actionable?

Background. From information theory, surprisal is a measurement of how unexpected an event is. Statistical language models provide a probabilistic approximation of natural languages, and because surprisal is constructed with the probability of an event occuring, it is therefore possible to determine the surprisal associated with English sentences. The issues and pull requests of software repository issue trackers give insight into the development process and likely contain the surprising events of this process. Objective. Prior works have identified that unusual events in software repositories are of interest to developers, and use simple code metrics-based methods for detecting them. In this study we will propose a new method for unusual event detection in software repositories using surprisal. With the ability to find surprising issues and pull requests, we intend to further analyse them to determine if they actually hold importance in a repository, or if they pose a significant challenge to address. If it is possible to find bad surprises early, or before they cause additional troubles, it is plausible that effort, cost and time will be saved as a result. Method. After extracting the issues and pull requests from 5000 of the most popular software repositories on GitHub, we will train a language model to represent these issues. We will measure their perceived importance in the repository, measure their resolution difficulty using several analogues, measure the surprisal of each, and finally generate inferential statistics to describe any correlations.

preprint2022arXiv

On the Fitness Landscapes of Interdependency Models in the Travelling Thief Problem

Since its inception in 2013, the Travelling Thief Problem (TTP) has been widely studied as an example of problems with multiple interconnected sub-problems. The dependency in this model arises when tying the travelling time of the "thief" to the weight of the knapsack. However, other forms of dependency as well as combinations of dependencies should be considered for investigation, as they are often found in complex real-world problems. Our goal is to study the impact of different forms of dependency in the TTP using a simple local search algorithm. To achieve this, we use Local Optima Networks, a technique for analysing the fitness landscape.

preprint2022arXiv

On the Utility of Marrying GIN and PMD for Improving Stack Overflow Code Snippets

Software developers are increasingly dependent on question and answer portals and blogs for coding solutions. While such interfaces provide useful information, there are concerns that code hosted here is often incorrect, insecure or incomplete. Previous work indeed detected a range of faults in code provided on Stack Overflow through the use of static analysis. Static analysis may go a far way towards quickly establishing the health of software code available online. In addition, mechanisms that enable rapid automated program improvement may then enhance such code. Accordingly, we present this proof of concept. We use the PMD static analysis tool to detect performance faults for a sample of Stack Overflow Java code snippets, before performing mutations on these snippets using GIN. We then re-analyse the performance faults in these snippets after the GIN mutations. GIN's RandomSampler was used to perform 17,986 unique line and statement patches on 3,034 snippets where PMD violations were removed from 770 patched versions. Our outcomes indicate that static analysis techniques may be combined with automated program improvement methods to enhance publicly available code with very little resource requirements. We discuss our planned research agenda in this regard.

preprint2022arXiv

Optimal offering strategy for an aggregator across multiple products of European day-ahead market

Most literature surrounding optimal bidding strategies for aggregators in European day-ahead market (DAM) considers only hourly orders. While other order types (e.g., block orders) may better represent the temporal characteristics of certain sources of flexibility (e.g., behind-the-meter flexibility), the increased combinations from these orders make it hard to develop a tractable optimization formulation. Thus, our aim in this paper is to develop a tractable optimal offering strategy for flexibility aggregators in the European DAM (a.k.a. Elspot) considering these orders. Towards this, we employ a price-based mechanism of procuring flexibility and place hourly and regular block orders in the market. We develop two mixed-integer bi-linear programs: 1) a brute force formulation for validation and 2) a novel formulation based on logical constraints. To evaluate the performance of these formulations, we proposed a generic flexibility model for an aggregated cluster of prosumers that considers the prosumers' responsiveness, inter-temporal dependencies, and seasonal and diurnal variations. The simulation results show that the proposed model significantly outperforms the brute force model in terms of computation speed. Also, we observed that using block orders has potential for profitability of an aggregator.

preprint2022arXiv

Self-Adaptive Systems: A Systematic Literature Review Across Categories and Domains

Context: Championed by IBM's vision of autonomic computing paper in 2003, the autonomic computing research field has seen increased research activity over the last 20 years. Several conferences and workshops have been established and have contributed to the autonomic computing knowledge base in search of a new kind of system -- a self-adaptive system (SAS). These systems are characterized by being context-aware and can act on that awareness. The actions carried out could be on the system or on the context (or environment). The underlying goal of a SAS is the sustained achievement of its goals despite changes in its environment. Objective: Despite a number of literature reviews on specific aspects of SASs ranging from their requirements to quality attributes, we lack a systematic understanding of the current state of the art. Method: This paper contributes a systematic literature review into self-adaptive systems using the dblp computer science bibliography as a database. We filtered the records systematically in successive steps to arrive at 293 relevant papers. Each paper was critically analyzed and categorized into an attribute matrix. This matrix consisted of five categories, with each category having multiple attributes. The attributes of each paper, along with the summary of its contents formed the basis of the literature review that spanned 30 years. Results: We characterize the maturation process of the research area from theoretical papers over practical implementations to more holistic and generic approaches, frameworks, and exemplars, applied to areas such as networking, web services, and robotics, with much of the recent work focusing on IoT and IaaS. Conclusion: While there is an ebb and flow of application domains, domains like bio-inspired approaches, security, and cyber physical systems showed promise to grow heading into the 2020s.

preprint2022arXiv

Software Engineering User Study Recruitment on Prolific: An Experience Report

Online participant recruitment platforms such as Prolific have been gaining popularity in research, as they enable researchers to easily access large pools of participants. However, participant quality can be an issue; participants may give incorrect information to gain access to more studies, adding unwanted noise to results. This paper details our experience recruiting participants from Prolific for a user study requiring programming skills in Node.js, with the aim of helping other researchers conduct similar studies. We explore a method of recruiting programmer participants using prescreening validation, attention checks and a series of programming knowledge questions. We received 680 responses, and determined that 55 met the criteria to be invited to our user study. We ultimately conducted user study sessions via video calls with 10 participants. We conclude this paper with a series of recommendations for researchers.

preprint2022arXiv

Visualization Onboarding Grounded in Educational Theories

The aim of visualization is to support people in dealing with large and complex information structures, to make these structures more comprehensible, facilitate exploration, and enable knowledge discovery. However, users often have problems reading and interpreting data from visualizations, in particular when they experience them for the first time. A lack of visualization literacy, i.e., knowledge in terms of domain, data, visual encoding, interaction, and also analytical methods can be observed. To support users in learning how to use new digital technologies, the concept of onboarding has been successfully applied in other domains. However, it has not received much attention from the visualization community so far. This chapter aims to fill this gap by defining the concept and systematically laying out the design space of onboarding in the context of visualization as a descriptive design space. On this basis, we present a survey of approaches from the academic community as well as from commercial products, especially surveying educational theories that inform the onboarding strategies. Additionally, we derived design considerations based on previous publications and present some guidelines for the design of visualization onboarding concepts.

preprint2021arXiv

GTOPX Space Mission Benchmarks

This contribution introduces the GTOPX space mission benchmark collection, which is an extension of GTOP database published by the European Space Agency (ESA). GTOPX consists of ten individual benchmark instances representing real-world interplanetary space trajectory design problems. In regard to the original GTOP collection, GTOPX includes three new problem instances featuring mixed-integer and multi-objective properties. GTOPX enables a simplified user handling, unified benchmark function call and some minor bug corrections to the original GTOP implementation. Furthermore, GTOPX is linked from it's original C++ source code to Python and Matlab based on dynamic link libraries, assuring computationally fast and accurate reproduction of the benchmark results in all three programming languages. Space mission trajectory design problems as those represented in GTOPX are known to be highly non-linear and difficult to solve. The GTOPX collection, therefore, aims particularly at researchers wishing to put advanced (meta)heuristic and hybrid optimization algorithms to the test. The goal of this paper is to provide researchers with a manual and reference to the newly available GTOPX benchmark software.

preprint2021arXiv

MATE: A Model-based Algorithm Tuning Engine

In this paper, we introduce a Model-based Algorithm Turning Engine, namely MATE, where the parameters of an algorithm are represented as expressions of the features of a target optimisation problem. In contrast to most static (feature-independent) algorithm tuning engines such as irace and SPOT, our approach aims to derive the best parameter configuration of a given algorithm for a specific problem, exploiting the relationships between the algorithm parameters and the features of the problem. We formulate the problem of finding the relationships between the parameters and the problem features as a symbolic regression problem and we use genetic programming to extract these expressions. For the evaluation, we apply our approach to configuration of the (1+1) EA and RLS algorithms for the OneMax, LeadingOnes, BinValue and Jump optimisation problems, where the theoretically optimal algorithm parameters to the problems are available as functions of the features of the problems. Our study shows that the found relationships typically comply with known theoretical results, thus demonstrating a new opportunity to consider model-based parameter tuning as an effective alternative to the static algorithm tuning engines.

preprint2020arXiv

A Non-Dominated Sorting Based Customized Random-Key Genetic Algorithm for the Bi-Objective Traveling Thief Problem

In this paper, we propose a method to solve a bi-objective variant of the well-studied Traveling Thief Problem (TTP). The TTP is a multi-component problem that combines two classic combinatorial problems: Traveling Salesman Problem (TSP) and Knapsack Problem (KP). We address the BI-TTP, a bi-objective version of the TTP, where the goal is to minimize the overall traveling time and to maximize the profit of the collected items. Our proposed method is based on a biased-random key genetic algorithm with customizations addressing problem-specific characteristics. We incorporate domain knowledge through a combination of near-optimal solutions of each subproblem in the initial population and use a custom repair operator to avoid the evaluation of infeasible solutions. The bi-objective aspect of the problem is addressed through an elite population extracted based on the non-dominated rank and crowding distance. Furthermore, we provide a comprehensive study showing the influence of each parameter on the performance. Finally, we discuss the results of the BI-TTP competitions at EMO-2019 and GECCO-2019 conferences where our method has won first and second places, respectively, thus proving its ability to find high-quality solutions consistently.

preprint2020arXiv

An Annotated Dataset of Stack Overflow Post Edits

To improve software engineering, software repositories have been mined for code snippets and bug fixes. Typically, this mining takes place at the level of files or commits. To be able to dig deeper and to extract insights at a higher resolution, we hereby present an annotated dataset that contains over 7 million edits of code and text on Stack Overflow. Our preliminary study indicates that these edits might be a treasure trove for mining information about fine-grained patches, e.g., for the optimisation of non-functional properties.

preprint2020arXiv

An Evolutionary Deep Learning Method for Short-term Wind Speed Prediction: A Case Study of the Lillgrund Offshore Wind Farm

Accurate short-term wind speed forecasting is essential for large-scale integration of wind power generation. However, the seasonal and stochastic characteristics of wind speed make forecasting a challenging task. This study uses a new hybrid evolutionary approach that uses a popular evolutionary search algorithm, CMA-ES, to tune the hyper-parameters of two Long short-term memory(LSTM) ANN models for wind prediction. The proposed hybrid approach is trained on data gathered from an offshore wind turbine installed in a Swedish wind farm located in the Baltic Sea. Two forecasting horizons including ten-minutes ahead (absolute short term) and one-hour ahead (short term) are considered in our experiments. Our experimental results indicate that the new approach is superior to five other applied machine learning models, i.e., polynomial neural network (PNN), feed-forward neural network (FNN), nonlinear autoregressive neural network (NAR) and adaptive neuro-fuzzy inference system (ANFIS), as measured by five performance criteria.

preprint2020arXiv

Ants can orienteer a thief in their robbery

The Thief Orienteering Problem (ThOP) is a multi-component problem that combines features of two classic combinatorial optimization problems: Orienteering Problem and Knapsack Problem. The ThOP is challenging due to the given time constraint and the interaction between its components. We propose an Ant Colony Optimization algorithm together with a new packing heuristic to deal individually and interactively with problem components. Our approach outperforms existing work on more than 90% of the benchmarking instances, with an average improvement of over 300%.

preprint2020arXiv

Design optimisation of a multi-mode wave energy converter

A wave energy converter (WEC) similar to the CETO system developed by Carnegie Clean Energy is considered for design optimisation. This WEC is able to absorb power from heave, surge and pitch motion modes, making the optimisation problem nontrivial. The WEC dynamics is simulated using the spectral-domain model taking into account hydrodynamic forces, viscous drag, and power take-off forces. The design parameters for optimisation include the buoy radius, buoy height, tether inclination angles, and control variables (damping and stiffness). The WEC design is optimised for the wave climate at Albany test site in Western Australia considering unidirectional irregular waves. Two objective functions are considered: (i) maximisation of the annual average power output, and (ii) minimisation of the levelised cost of energy (LCoE) for a given sea site. The LCoE calculation is approximated as a ratio of the produced energy to the significant mass of the system that includes the mass of the buoy and anchor system. Six different heuristic optimisation methods are applied in order to evaluate and compare the performance of the best known evolutionary algorithms, a swarm intelligence technique and a numerical optimisation approach. The results demonstrate that if we are interested in maximising energy production without taking into account the cost of manufacturing such a system, the buoy should be built as large as possible (20 m radius and 30 m height). However, if we want the system that produces cheap energy, then the radius of the buoy should be approximately 11-14~m while the height should be as low as possible. These results coincide with the overall design that Carnegie Clean Energy has selected for its CETO 6 multi-moored unit. However, it should be noted that this study is not informed by them, so this can be seen as an independent validation of the design choices.

preprint2020arXiv

Fitness Landscape Analysis of Dimensionally-Aware Genetic Programming Featuring Feynman Equations

Genetic programming is an often-used technique for symbolic regression: finding symbolic expressions that match data from an unknown function. To make the symbolic regression more efficient, one can also use dimensionally-aware genetic programming that constrains the physical units of the equation. Nevertheless, there is no formal analysis of how much dimensionality awareness helps in the regression process. In this paper, we conduct a fitness landscape analysis of dimensionallyaware genetic programming search spaces on a subset of equations from Richard Feynmans well-known lectures. We define an initialisation procedure and an accompanying set of neighbourhood operators for conducting the local search within the physical unit constraints. Our experiments show that the added information about the variable dimensionality can efficiently guide the search algorithm. Still, further analysis of the differences between the dimensionally-aware and standard genetic programming landscapes is needed to help in the design of efficient evolutionary operators to be used in a dimensionally-aware regression.

preprint2020arXiv

Genetic Improvement @ ICSE 2020

Following Prof. Mark Harman of Facebook's keynote and formal presentations (which are recorded in the proceedings) there was a wide ranging discussion at the eighth international Genetic Improvement workshop, GI-2020 @ ICSE (held as part of the 42nd ACM/IEEE International Conference on Software Engineering on Friday 3rd July 2020). Topics included industry take up, human factors, explainabiloity (explainability, justifyability, exploitability) and GI benchmarks. We also contrast various recent online approaches (e.g. SBST 2020) to holding virtual computer science conferences and workshops via the WWW on the Internet without face-2-face interaction. Finally we speculate on how the Coronavirus Covid-19 Pandemic will affect research next year and into the future.

preprint2020arXiv

Human-Like Summaries from Heterogeneous and Time-Windowed Software Development Artefacts

Automatic text summarisation has drawn considerable interest in the area of software engineering. It is challenging to summarise the activities related to a software project, (1) because of the volume and heterogeneity of involved software artefacts, and (2) because it is unclear what information a developer seeks in such a multi-document summary. We present the first framework for summarising multi-document software artefacts containing heterogeneous data within a given time frame. To produce human-like summaries, we employ a range of iterative heuristics to minimise the cosine-similarity between texts and high-dimensional feature vectors. A first study shows that users find the automatically generated summaries the most useful when they are generated using word similarity and based on the eight most relevant software artefacts.

preprint2020arXiv

Hybrid Neuro-Evolutionary Method for Predicting Wind Turbine Power Output

Reliable wind turbine power prediction is imperative to the planning, scheduling and control of wind energy farms for stable power production. In recent years Machine Learning (ML) methods have been successfully applied in a wide range of domains, including renewable energy. However, due to the challenging nature of power prediction in wind farms, current models are far short of the accuracy required by industry. In this paper, we deploy a composite ML approach--namely a hybrid neuro-evolutionary algorithm--for accurate forecasting of the power output in wind-turbine farms. We use historical data in the supervisory control and data acquisition (SCADA) systems as input to estimate the power output from an onshore wind farm in Sweden. At the beginning stage, the k-means clustering method and an Autoencoder are employed, respectively, to detect and filter noise in the SCADA measurements. Next, with the prior knowledge that the underlying wind patterns are highly non-linear and diverse, we combine a self-adaptive differential evolution (SaDE) algorithm as a hyper-parameter optimizer, and a recurrent neural network (RNN) called Long Short-term memory (LSTM) to model the power curve of a wind turbine in a farm. Two short time forecasting horizons, including ten-minutes ahead and one-hour ahead, are considered in our experiments. We show that our approach outperforms its counterparts.

preprint2020arXiv

Optimisation of Large Wave Farms using a Multi-strategy Evolutionary Framework

Wave energy is a fast-developing and promising renewable energy resource. The primary goal of this research is to maximise the total harnessed power of a large wave farm consisting of fully-submerged three-tether wave energy converters (WECs). Energy maximisation for large farms is a challenging search problem due to the costly calculations of the hydrodynamic interactions between WECs in a large wave farm and the high dimensionality of the search space. To address this problem, we propose a new hybrid multi-strategy evolutionary framework combining smart initialisation, binary population-based evolutionary algorithm, discrete local search and continuous global optimisation. For assessing the performance of the proposed hybrid method, we compare it with a wide variety of state-of-the-art optimisation approaches, including six continuous evolutionary algorithms, four discrete search techniques and three hybrid optimisation methods. The results show that the proposed method performs considerably better in terms of convergence speed and farm output.

preprint2020arXiv

Optimising the Fit of Stack Overflow Code Snippets into Existing Code

Software developers often reuse code from online sources such as Stack Overflow within their projects. However, the process of searching for code snippets and integrating them within existing source code can be tedious. In order to improve efficiency and reduce time spent on code reuse, we present an automated code reuse tool for the Eclipse IDE (Integrated Developer Environment), NLP2TestableCode. NLP2TestableCode can not only search for Java code snippets using natural language tasks, but also evaluate code snippets based on a user's existing code, modify snippets to improve fit and correct errors, before presenting the user with the best snippet, all without leaving the editor. NLP2TestableCode also includes functionality to automatically generate customisable test cases and suggest argument and return types, in order to further evaluate code snippets. In evaluation, NLP2TestableCode was capable of finding compilable code snippets for 82.9% of tasks, and testable code snippets for 42.9%.

preprint2020arXiv

The Dynamic Travelling Thief Problem: Benchmarks and Performance of Evolutionary Algorithms

Many real-world optimisation problems involve dynamic and stochastic components. While problems with multiple interacting components are omnipresent in inherently dynamic domains like supply-chain optimisation and logistics, most research on dynamic problems focuses on single-component problems. With this article, we define a number of scenarios based on the Travelling Thief Problem to enable research on the effect of dynamic changes to sub-components. Our investigations of 72 scenarios and seven algorithms show that -- depending on the instance, the magnitude of the change, and the algorithms in the portfolio -- it is preferable to either restart the optimisation from scratch or to continue with the previously valid solutions.

preprint2020arXiv

Towards a Structural Framework for Explicit Domain Knowledge in Visual Analytics

Clinicians and other analysts working with healthcare data are in need for better support to cope with large and complex data. While an increasing number of visual analytics environments integrates explicit domain knowledge as a means to deliver a precise representation of the available data, theoretical work so far has focused on the role of knowledge in the visual analytics process. There has been little discussion about how such explicit domain knowledge can be structured in a generalized framework. This paper collects desiderata for such a structural framework, proposes how to address these desiderata based on the model of linked data, and demonstrates the applicability in a visual analytics environment for physiotherapy.

preprint2020arXiv

Towards Rigorous Validation of Energy Optimisation Experiments

The optimisation of software energy consumption is of growing importance across all scales of modern computing, i.e., from embedded systems to data-centres. Practitioners in the field of Search-Based Software Engineering and Genetic Improvement of Software acknowledge that optimising software energy consumption is difficult due to noisy and expensive fitness evaluations. However, it is apparent from results to date that more progress needs to be made in rigorously validating optimisation results. This problem is pressing because modern computing platforms have highly complex and variable behaviour with respect to energy consumption. To compare solutions fairly we propose in this paper a new validation approach called R3-validation which exercises software variants in a rotated-round-robin order. Using a case study, we present an in-depth analysis of the impacts of changing system states on software energy usage, and we show how R3-validation mitigates these. We compare it with current validation approaches across multiple devices and operating systems, and we show that it aligns better with actual platform behaviour.

preprint2019arXiv

A Hybrid Cooperative Co-evolution Algorithm Framework for Optimising Power Take Off and Placements of Wave Energy Converters

Wave energy technologies have the potential to play a significant role in the supply of renewable energy on a world scale. One of the most promising designs for wave energy converters (WECs) are fully submerged buoys. In this work, we explore the optimisation of WEC arrays consisting of a three-tether buoy model called CETO. Such arrays can be optimised for total energy output by adjusting both the relative positions of buoys in farms and also the power-take-off (PTO) parameters for each buoy. The search space for these parameters is complex and multi-modal. Moreover, the evaluation of each parameter setting is computationally expensive -- limiting the number of full model evaluations that can be made. To handle this problem, we propose a new hybrid cooperative co-evolution algorithm (HCCA). HCCA consists of a symmetric local search plus Nelder-Mead and a cooperative co-evolution algorithm (CC) with a backtracking strategy for optimising the positions and PTO settings of WECs, respectively. Moreover, a new adaptive scenario is proposed for tuning grey wolf optimiser (AGWO) hyper-parameter. AGWO participates notably with other applied optimisers in HCCA. For assessing the effectiveness of the proposed approach five popular Evolutionary Algorithms (EAs), four alternating optimisation methods and two modern hybrid ideas (LS-NM and SLS-NM-B) are carefully compared in four real wave situations (Adelaide, Tasmania, Sydney and Perth) with two wave farm sizes (4 and 16). According to the experimental outcomes, the hybrid cooperative framework exhibits better performance in terms of both runtime and quality of obtained solutions.

preprint2019arXiv

Better Software Analytics via "DUO": Data Mining Algorithms Using/Used-by Optimizers

This paper claims that a new field of empirical software engineering research and practice is emerging: data mining using/used-by optimizers for empirical studies or DUO. For example, data miners can generate models that are explored by optimizers. Also, optimizers can advise how to best adjust the control parameters of a data miner. This combined approach acts like an agent leaning over the shoulder of an analyst that advises "ask this question next" or "ignore that problem, it is not relevant to your goals". Further, those agents can help us build "better" predictive models, where "better" can be either greater predictive accuracy or faster modeling time (which, in turn, enables the exploration of a wider range of options). We also caution that the era of papers that just use data miners is coming to an end. Results obtained from an unoptimized data miner can be quickly refuted, just by applying an optimizer to produce a different (and better performing) model. Our conclusion, hence, is that for software analytics it is possible, useful and necessary to combine data mining and optimization using DUO.

preprint2018arXiv

Data-Driven Search-based Software Engineering

This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as data mining problems, SBSE reformulates SE problems as optimization problems and use meta-heuristic algorithms to solve them. Both MSR and SBSE share the common goal of providing insights to improve software engineering. The algorithms used in these two areas also have intrinsic relationships. We, therefore, argue that combining these two fields is useful for situations (a) which require learning from a large data source or (b) when optimizers need to know the lay of the land to find better solutions, faster. This paper aims to answer the following three questions: (1) What are the various topics addressed by DSE? (2) What types of data are used by the researchers in this area? (3) What research approaches do researchers use? The paper briefly sets out to act as a practical guide to develop new DSE techniques and also to serve as a teaching resource. This paper also presents a resource (tiny.cc/data-se) for exploring DSE. The resource contains 89 artifacts which are related to DSE, divided into 13 groups such as requirements engineering, software product lines, software processes. All the materials in this repository have been used in recent software engineering papers; i.e., for all this material, there exist baseline results against which researchers can comparatively assess their new ideas.