Source author record

Mika Mäntylä

Mika Mäntylä appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering

Catalog footprint

What is connected

9works

1topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

How to Configure Masked Event Anomaly Detection on Software Logs?

Software Log anomaly event detection with masked event prediction has various technical approaches with countless configurations and parameters. Our objective is to provide a baseline of settings for similar studies in the future. The models we use are the N-Gram model, which is a classic approach in the field of natural language processing (NLP), and two deep learning (DL) models long short-term memory (LSTM) and convolutional neural network (CNN). For datasets we used four datasets Profilence, BlueGene/L (BGL), Hadoop Distributed File System (HDFS) and Hadoop. Other settings are the size of the sliding window which determines how many surrounding events we are using to predict a given event, mask position (the position within the window we are predicting), the usage of only unique sequences, and the portion of data that is used for training. The results show clear indications of settings that can be generalized across datasets. The performance of the DL models does not deteriorate as the window size increases while the N-Gram model shows worse performance with large window sizes on the BGL and Profilence datasets. Despite the popularity of Next Event Prediction, the results show that in this context it is better not to predict events at the edges of the subsequence, i.e., first or last event, with the best result coming from predicting the fourth event when the window size is five. Regarding the amount of data used for training, the results show differences across datasets and models. For example, the N-Gram model appears to be more sensitive toward the lack of data than the DL models. Overall, for similar experimental setups we suggest the following general baseline: Window size 10, mask position second to last, do not filter out non-unique sequences, and use a half of the total data for training.

preprint2022arXiv

Pinpointing Anomaly Events in Logs from Stability Testing -- N-Grams vs. Deep-Learning

As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.

preprint2022arXiv

Test Automation Maturity Improves Product Quality -- Quantitative Study of Open Source Projects Using Continuous Integration

The popularity of continuous integration (CI) is increasing as a result of market pressure to release product features or updates frequently. The ability of CI to deliver quality at speed depends on reliable test automation. In this paper, we present an empirical study to observe the effect of test automation maturity (assessed by standard best practices in the literature) on product quality, test automation effort, and release cycle in the CI context of open source projects. We run our test automation maturity survey and got responses from 37 open source java projects. We also mined software repositories of the same projects. The main results of regression analysis reveal that, higher levels of test automation maturity are positively associated with higher product quality (p-value=0.000624) and shorter release cycle (p-value=0.01891); There is no statistically significant evidence of increased test automation effort due to higher levels of test automation maturity and product quality. Thus, we conclude that, a potential benefit of improving test automation maturity (using standard best practices) is product quality improvement and release cycle acceleration in the CI context of open source projects. We encourage future research to extend our findings by adding more datasets with different programming languages and CI tools, closed source projects, and large-scale industrial projects. Our recommendation to practitioners (in the similar CI context) is to utilize standard best practices to improve test automation maturity.

preprint2020arXiv

20-MAD -- 20 Years of Issues and Commits of Mozilla and Apache Development

Data of long-lived and high profile projects is valuable for research on successful software engineering in the wild. Having a dataset with different linked software repositories of such projects, enables deeper diving investigations. This paper presents 20-MAD, a dataset linking the commit and issue data of Mozilla and Apache projects. It includes over 20 years of information about 765 projects, 3.4M commits, 2.3M issues, and 17.3M issue comments, and its compressed size is over 6 GB. The data contains all the typical information about source code commits (e.g., lines added and removed, message and commit time) and issues (status, severity, votes, and summary). The issue comments have been pre-processed for natural language processing and sentiment analysis. This includes emoticons and valence and arousal scores. Linking code repository and issue tracker information, allows studying individuals in two types of repositories and provide more accurate time zone information for issue trackers as well. To our knowledge, this the largest linked dataset in size and in project lifetime that is not based on GitHub.

preprint2020arXiv

Chat activity is a better predictor than chat sentiment on software developers productivity

Recent works have proposed that software developers' positive emotion has a positive impact on software developers' productivity. In this paper we investigate two data sources: developers chat messages (from Slack and Hipchat) and source code commits of a single co-located Agile team over 200 working days. Our regression analysis shows that the number of chat messages is the best predictor and predicts productivity measured both in the number of commits and lines of code with $R^2$ of 0.33 and 0.27 respectively. We then add sentiment analysis variables until AIC of our model no longer improves and gets $R^2$ values of 0.37 (commits) and 0.30 (lines of code). Thus, analyzing chat sentiment improves productivity prediction over chat activity alone but the difference is not massive. This work supports the idea that emotional state and productivity are linked in software development. We find that three positive sentiment metrics, but surprisingly also one negative sentiment metric is associated with higher productivity.

preprint2020arXiv

Prevalence, Contents and Automatic Detection of KL-SATD

When developers use different keywords such as TODO and FIXME in source code comments to describe self-admitted technical debt (SATD), we refer it as Keyword-Labeled SATD (KL-SATD). We study KL-SATD from 33 software repositories with 13,588 KL-SATD comments. We find that the median percentage of KL-SATD comments among all comments is only 1,52%. We find that KL-SATD comment contents include words expressing code changes and uncertainty, such as remove, fix, maybe and probably. This makes them different compared to other comments. KL-SATD comment contents are similar to manually labeled SATD comments of prior work. Our machine learning classifier using logistic Lasso regression has good performance in detecting KL-SATD comments (AUC-ROC 0.88). Finally, we demonstrate that using machine learning we can identify comments that are currently missing but which should have a SATD keyword in them. Automating SATD identification of comments that lack SATD keywords can save time and effort by replacing manual identification of comments. Using KL-SATD offers a potential to bootstrap a complete SATD detector.

preprint2020arXiv

Time Pressure in Software Engineering: A Systematic Review

Large project overruns and overtime work have been reported in the software industry, resulting in additional expense for companies and personal issues for developers. The present work aims to provide an overview of studies related to time pressure in software engineering; specifically, existing definitions, possible causes, and metrics relevant to time pressure were collected, and a mapping of the studies to software processes and approaches was performed. Moreover, we synthesize results of existing quantitative studies on the effects of time pressure on software development, and offer practical takeaways for practitioners and researchers, based on empirical evidence. Our search strategy examined 5,414 sources, found through repository searches and snowballing. Applying inclusion and exclusion criteria resulted in the selection of 102 papers, which made relevant contributions related to time pressure in software engineering. The majority of high quality studies report increased productivity and decreased quality under time pressure. Frequent categories of studies focus on quality assurance, cost estimation, and process simulation. It appears that time pressure is usually caused by errors in cost estimation. The effect of time pressure is most often identified during software quality assurance. The majority of empirical studies report increased productivity under time pressure, while the most cost estimation and process simulation models assume that compressing the schedule increases the total needed hours. We also find evidence of the mediating effect of knowledge on the effects of time pressure, and that tight deadlines impact tasks with an algorithmic nature more severely. Future research should better contextualize quantitative studies to account for the existing conflicting results and to provide an understanding of situations when time pressure is either beneficial or harmful.

preprint2016arXiv

Benchmarking Web-testing - Selenium versus Watir and the Choice of Programming Language and Browser

Context: Selenium is claimed to be the most popular software test automation tool. Past academic works have mainly neglected testing tools in favor of more methodological topics. Objective: We investigated the performance of web-testing tools, to provide empirical evidence supporting choices in software test tool selection and configuration. Method: We used 4*5 factorial design to study 20 different configurations for testing a web-store. We studied 5 programming language bindings (C#, Java, Python, and Ruby for Selenium, while Watir supports Ruby only) and 4 browsers (Google Chrome, Internet Explorer, Mozilla Firefox and Opera). Performance was measured with execution time, memory usage, length of the test scripts and stability of the tests. Results: Considering all measures the best configuration was Selenium with Python language binding for Chrome. Selenium with Python bindings was the best option for all browsers. The effect size of the difference between the slowest and fastest configuration was very high (Cohens d=41.5, 91% increase in execution time). Overall Internet Explorer was the fastest browser while having the worst results in the stability. Conclusions: We recommend benchmarking tools before adopting them. Weighting of factors, e.g. how much test stability is one willing to sacrifice for faster performance, affects the decision.

preprint2016arXiv

Mining Valence, Arousal, and Dominance - Possibilities for Detecting Burnout and Productivity?

Similar to other industries, the software engineering domain is plagued by psychological diseases such as burnout, which lead developers to lose interest, exhibit lower activity and/or feel powerless. Prevention is essential for such diseases, which in turn requires early identification of symptoms. The emotional dimensions of Valence, Arousal and Dominance (VAD) are able to derive a person's interest (attraction), level of activation and perceived level of control for a particular situation from textual communication, such as emails. As an initial step towards identifying symptoms of productivity loss in software engineering, this paper explores the VAD metrics and their properties on 700,000 Jira issue reports containing over 2,000,000 comments, since issue reports keep track of a developer's progress on addressing bugs or new features. Using a general-purpose lexicon of 14,000 English words with known VAD scores, our results show that issue reports of different type (e.g., Feature Request vs. Bug) have a fair variation of Valence, while increase in issue priority (e.g., from Minor to Critical) typically increases Arousal. Furthermore, we show that as an issue's resolution time increases, so does the arousal of the individual the issue is assigned to. Finally, the resolution of an issue increases valence, especially for the issue Reporter and for quickly addressed issues. The existence of such relations between VAD and issue report activities shows promise that text mining in the future could offer an alternative way for work health assessment surveys.

Mika Mäntylä

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

How to Configure Masked Event Anomaly Detection on Software Logs?

Pinpointing Anomaly Events in Logs from Stability Testing -- N-Grams vs. Deep-Learning

Test Automation Maturity Improves Product Quality -- Quantitative Study of Open Source Projects Using Continuous Integration

20-MAD -- 20 Years of Issues and Commits of Mozilla and Apache Development

Chat activity is a better predictor than chat sentiment on software developers productivity

Prevalence, Contents and Automatic Detection of KL-SATD

Time Pressure in Software Engineering: A Systematic Review

Benchmarking Web-testing - Selenium versus Watir and the Choice of Programming Language and Browser

Mining Valence, Arousal, and Dominance - Possibilities for Detecting Burnout and Productivity?