Source author record

Kenichi Matsumoto

Kenichi Matsumoto appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Cryptography and Security Artificial Intelligence Digital Libraries Machine Learning

Catalog footprint

What is connected

20works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

"My GitHub Sponsors profile is live!" Investigating the Impact of Twitter/X Mentions on GitHub Sponsors

GitHub Sponsors was launched in 2019, enabling donations to open-source software developers to provide financial support, as per GitHub's slogan: "Invest in the projects you depend on". However, a 2022 study on GitHub Sponsors found that only two-fifths of developers who were seeking sponsorship received a donation. The study found that, other than internal actions (such as offering perks to sponsors), developers had advertised their GitHub Sponsors profiles on social media, such as Twitter (also known as X). Therefore, in this work, we investigate the impact of tweets that contain links to GitHub Sponsors profiles on sponsorship, as well as their reception on Twitter/X. We further characterize these tweets to understand their context and find that (1) such tweets have the impact of increasing the number of sponsors acquired, (2) compared to other donation platforms such as Open Collective and Patreon, GitHub Sponsors has significantly fewer interactions but is more visible on Twitter/X, and (3) developers tend to contribute more to open-source software during the week of posting such tweets. Our findings are the first step toward investigating the impact of social media on obtaining funding to sustain open-source software.

preprint2022arXiv

An Empirical Evaluation of Competitive Programming AI: A Case Study of AlphaCode

AlphaCode is a code generation system for assisting software developers in solving competitive programming problems using natural language problem descriptions. Despite the advantages of the code generating system, the open source community expressed concerns about practicality and data licensing. However, there is no research investigating generated codes in terms of code clone and performance. In this paper, we conduct an empirical study to find code similarities and performance differences between AlphaCode-generated codes and human codes. The results show that (i) the generated codes from AlphaCode are similar to human codes (i.e., the average maximum similarity score is 0.56) and (ii) the generated code performs on par with or worse than the human code in terms of execution time and memory usage. Moreover, AlphaCode tends to generate more similar codes to humans for low-difficulty problems (i.e., four cases have the exact same codes). It also employs excessive nested loops and unnecessary variable declarations for high-difficulty problems, which cause low performance regarding our manual investigation. The replication package is available at https:/doi.org/10.5281/zenodo.6820681

preprint2022arXiv

An Exploration of npm Package Co-Usage Examples from Stack Overflow: A Case Study

Third-party package usage has become a common practice in contemporary software development. Developers often face different challenges, including choosing the right libraries, installing errors, discrepancies, setting up the environment, and building failures during software development. The risks of maintaining a third-party package are well known, but it is unclear how information from Stack Overflow (SO) can be useful. This paper performed an empirical study to explore npm co-usage in SO. From over 30,000 SO posts, we extracted 2,100 SO posts related to npm and matched them to 217,934 npm library packages. We find that, popular and highly used libraries are not discussed as often in SO. However, we can see that the accepted answers may prove useful, as we believe that the usage examples and executable commands could be reused for tool support.

preprint2022arXiv

Bug-Fix Variants: Visualizing Unique Source Code Changes across GitHub Forks

Forking is a common practice for developers when building upon on already existing projects. These forks create variants, which have a common code base but then evolve the code in different directions, which is specific to that forked project requirements. An interesting side-effect of having multiple forks is the ability to select between different evolution directions of the code which is based on developers fixing bugs in the code base. However, the key issue that this decentralized form of information is difficult to analyze. In this study, we propose a visualization to analyze active changes in fork repositories that have not been merged back to the original project. Our visualization shows code commit activities in multiple forks with highlight on bug fix commits in the history of forks. While the commit activity of each repository is visualized similarly to the code frequency view of GitHub, our view shows only commits unique to fork repositories. To illustrate the effectiveness of our visualization, we have applied our view to two use cases: identifying forks from a repository no longer maintained, and identifying a bug fix among forks. In the first case, we identify a fork of a suspended project named Obfuscator-LLVM. Our view shows the original repository and its most active fork that continue the development on the top. In the second case, we identify a bug fix in a fork of Clipy project. Our view shows that the most active fork has its own bug fixes; we could easily identify a patch for the bug highlighted in the view. As a new ideas paper, we then present our outline of three research questions to spark real world use-cases and goals for our visualization has the potential to uncover. A prototype of our visualization is available at \textcolor{blue}{\url{https://naist-se.github.io/vissoft2022/}

preprint2022arXiv

GitHub Sponsors: Exploring a New Way to Contribute to Open Source

GitHub Sponsors, launched in 2019, enables donations to individual open source software (OSS) developers. Financial support for OSS maintainers and developers is a major issue in terms of sustaining OSS projects, and the ability to donate to individuals is expected to support the sustainability of developers, projects, and community. In this work, we conducted a mixed-methods study of GitHub Sponsors, including quantitative and qualitative analyses, to understand the characteristics of developers who are likely to receive donations and what developers think about donations to individuals. We found that: (1) sponsored developers are more active than non-sponsored developers, (2) the possibility to receive donations is related to whether there is someone in their community who is donating, and (3) developers are sponsoring as a new way to contribute to OSS. Our findings are the first step towards data-informed guidance for using GitHub Sponsors, opening up avenues for future work on this new way of financially sustaining the OSS community.

preprint2022arXiv

Giving Back: Contributions Congruent to Library Dependency Changes in a Software Ecosystem

Popular adoption of third-party libraries for contemporary software development has led to the creation of large inter-dependency networks, where sustainability issues of a single library can have widespread network effects. Maintainers of these libraries are often overworked, relying on the contributions of volunteers to sustain these libraries. In this work, we measure contributions that are aligned with dependency changes, to understand where they come from (i.e., non-maintainer, client maintainer, library maintainer, and library and client maintainer), analyze whether they contribute to library dormancy (i.e., a lack of activity), and investigate the similarities between these contributions and developers' typical contributions. Hence, we leverage socio-technical techniques to measure the dependency-contribution congruence (DC congruence), i.e., the degree to which contributions align with dependencies. We conduct a large-scale empirical study to measure the DC congruence for the NPM ecosystem using 1.7 million issues, 970 thousand pull requests (PR), and over 5.3 million commits belonging to 107,242 NPM packages. At the ecosystem level, we pinpoint in time peaks of congruence with dependency changes (i.e., 16% DC congruence score). Surprisingly, these contributions came from the ecosystem itself (i.e., non-maintainers of either client and library). At the project level, we find that DC congruence shares a statistically significant relationship with the likelihood of a package becoming dormant. Finally, by comparing source code of contributions, we find that congruent contributions are statistically different to typical contributions. Our work has implications to encourage and sustain contributions, especially to support library maintainers that require dependency changes.

preprint2022arXiv

Intertwining Ecosystems: A Large Scale Empirical Study of Libraries that Cross Software Ecosystems

An increase in diverse technology stacks and third-party library usage has led developers to inevitably switch technologies. To assist these developers, maintainers have started to release their libraries to multiple technologies, i.e., a cross-ecosystem library. Our goal is to explore the extent to which these cross-ecosystem libraries are intertwined between ecosystems. We perform a large-scale empirical study of 1.1 million libraries from five different software ecosystems, i.e., PyPI for Python, CRAN for R, Maven for Java, RubyGems for Ruby, and NPM for JavaScript to identify 4,146 GitHub projects that release libraries to these five ecosystems. Analyzing their contributions, we first find that a significant majority (median of 37.5%) of contributors of these cross-ecosystem libraries come from a single ecosystem, while also receiving a significant portion of contributions (median of 24.06%) from outside their target ecosystems. We also find that a cross-ecosystem library is written using multiple programming languages. Specifically, three (i.e., PyPI, CRAN, RubyGems) out of the five ecosystems has the majority of source code is written using languages not specific to that ecosystem. As ecosystems become intertwined, this opens up new avenues for research, such as whether or not cross-ecosystem libraries will solve the search for replacement libraries, or how these libraries fit within each ecosystem just to name a few.

preprint2022arXiv

Newcomer OSS-Candidates: Characterizing Contributions of Novice Developers to GitHub

The ability of an Open Source Software (OSS) project to attract, onboard, and retain any newcomer is vital to its livelihood. Although, evidence suggests an upsurge in novice developers joining social coding platforms (such as GitHub), the extent to which their activities result in a OSS contribution is unknown. Henceforth, we execute the protocols of a registered report to study activities of a "Newcomer OSS-Candidate", who is a novice developer that is new to that social coding platform, and has the intention to later onboard an OSS project. Using GitHub as a case platform, we analyze 171 identified Newcomer OSS-Candidates to characterize their contribution activities. Results show that Newcomer OSS-Candidates are likely to target software based repositories (i.e., 66%), and their first contributions are mainly associated with development (commits) and maintenance (PRs). Newcomer OSS-Candidates are less likely to practice social coding, but eventually end up onboarding (i.e., 30% quantitative, 70% follow-up survey) an OSS project. Furthermore, they cite finding a way to start as the most challenging barrier to contribute. Our work reveals insights on how newcomers to social coding platforms are potential sources of OSS contributions.

preprint2022arXiv

On the Use of Refactoring in Security Vulnerability Fixes: An Exploratory Study on Maven Libraries

Third-party library dependencies are commonplace in today's software development. With the growing threat of security vulnerabilities, applying security fixes in a timely manner is important to protect software systems. As such, the community developed a list of software and hardware weakness known as Common Weakness Enumeration (CWE) to assess vulnerabilities. Prior work has revealed that maintenance activities such as refactoring code potentially correlate with security-related aspects in the source code. In this work, we explore the relationship between refactoring and security by analyzing refactoring actions performed jointly with vulnerability fixes in practice. We conducted a case study to analyze 143 maven libraries in which 351 known vulnerabilities had been detected and fixed. Surprisingly, our exploratory results show that developers incorporate refactoring operations in their fixes, with 31.9% (112 out of 351) of the vulnerabilities paired with refactoring actions. We envision this short paper to open up potential new directions to motivate automated tool support, allowing developers to deliver fixes faster, while maintaining their code.

preprint2022arXiv

pycefr: Python Competency Level through Code Analysis

Python is known to be a versatile language, well suited both for beginners and advanced users. Some elements of the language are easier to understand than others: some are found in any kind of code, while some others are used only by experienced programmers. The use of these elements lead to different ways to code, depending on the experience with the language and the knowledge of its elements, the general programming competence and programming skills, etc. In this paper, we present pycefr, a tool that detects the use of the different elements of the Python language, effectively measuring the level of Python proficiency required to comprehend and deal with a fragment of Python code. Following the well-known Common European Framework of Reference for Languages (CEFR), widely used for natural languages, pycefr categorizes Python code in six levels, depending on the proficiency required to create and understand it. We also discuss different use cases for pycefr: identifying code snippets that can be understood by developers with a certain proficiency, labeling code examples in online resources such as Stackoverflow and GitHub to suit them to a certain level of competency, helping in the onboarding process of new developers in Open Source Software projects, etc. A video shows availability and usage of the tool: https://tinyurl.com/ypdt3fwe.

preprint2022arXiv

Understanding the Role of External Pull Requests in the NPM Ecosystem

The risk to using third-party libraries in a software application is that much needed maintenance is solely carried out by library maintainers. These libraries may rely on a core team of maintainers (who might be a single maintainer that is unpaid and overworked) to serve a massive client user-base. On the other hand, being open source has the benefit of receiving contributions (in the form of External PRs) to help fix bugs and add new features. In this paper, we investigate the role by which External PRs (contributions from outside the core team of maintainers) contribute to a library. Through a preliminary analysis, we find that External PRs are prevalent, and just as likely to be accepted as maintainer PRs. We find that 26.75% of External PRs submitted fix existing issues. Moreover, fixes also belong to labels such as breaking changes, urgent, and on-hold. Differently from Internal PRs, External PRs cover documentation changes (44 out of 384 PRs), while not having as much refactoring (34 out of 384 PRs). On the other hand, External PRs also cover new features (380 out of 384 PRs) and bugs (120 out of 384). Our results lay the groundwork for understanding how maintainers decide which external contributions they select to evolve their libraries and what role they play in reducing the workload.

preprint2021arXiv

FLOSS != GitHub: A Case Study of Linux/BSD Perceptions from Microsoft's Acquisition of GitHub

In 2018, the software industry giants Microsoft made a move into the Open Source world by completing the acquisition of mega Open Source platform, GitHub. This acquisition was not without controversy, as it is well-known that the free software communities includes not only the ability to use software freely, but also the libre nature in Open Source Software. In this study, our aim is to explore these perceptions in FLOSS developers. We conducted a survey that covered traditional FLOSS source Linux, and BSD communities and received 246 developer responses. The results of the survey confirm that the free community did trigger some communities to move away from GitHub and raised discussions into free and open software on the GitHub platform. The study reminds us that although GitHub is influential and trendy, it does not representative all FLOSS communities.

preprint2021arXiv

SōjiTantei: Function-Call Reachability Detection of Vulnerable Code for npm Packages

It has become common practice for software projects to adopt third-party dependencies. Developers are encouraged to update any outdated dependency to remain safe from potential threats of vulnerabilities. In this study, we present an approach to aid developers show whether or not a vulnerable code is reachable for JavaScript projects. Our prototype, SōjiTantei, is evaluated in two ways (i) the accuracy when compared to a manual approach and (ii) a larger-scale analysis of 780 clients from 78 security vulnerability cases. The first evaluation shows that SōjiTantei has a high accuracy of 83.3%, with a speed of less than a second analysis per client. The second evaluation reveals that 68 out of the studied 78 vulnerabilities reported having at least one clean client. The study proves that automation is promising with the potential for further improvement.

preprint2020arXiv

A Simulation Study of Bandit Algorithms to Address External Validity of Software Fault Prediction

Various software fault prediction models and techniques for building algorithms have been proposed. Many studies have compared and evaluated them to identify the most effective ones. However, in most cases, such models and techniques do not have the best performance on every dataset. This is because there is diversity of software development datasets, and therefore, there is a risk that the selected model or technique shows bad performance on a certain dataset. To avoid selecting a low accuracy model, we apply bandit algorithms to predict faults. Consider a case where player has 100 coins to bet on several slot machines. Ordinary usage of software fault prediction is analogous to the player betting all 100 coins in one slot machine. In contrast, bandit algorithms bet one coin on each machine (i.e., use prediction models) step-by-step to seek the best machine. In the experiment, we developed an artificial dataset that includes 100 modules, 15 of which include faults. Then, we developed various artificial fault prediction models and selected them dynamically using bandit algorithms. The Thomson sampling algorithm showed the best or second-best prediction performance compared with using only one prediction model.

preprint2020arXiv

Code-based Vulnerability Detection in Node.js Applications: How far are we?

With one of the largest available collection of reusable packages, the JavaScript runtime environment Node.js is one of the most popular programming application. With recent work showing evidence that known vulnerabilities are prevalent in both open source and industrial software, we propose and implement a viable code-based vulnerability detection tool for Node.js applications. Our case study lists the challenges encountered while implementing our Node.js vulnerable code detector.

preprint2020arXiv

DevReplay: Automatic Repair with Editable Fix Pattern

Static analysis tools, or linters, detect violation of source code conventions to maintain project readability. Those tools automatically fix specific violations while developers edit the source code. However, existing tools are designed for the general conventions of programming languages. These tools do not check the project/API-specific conventions. We propose a novel static analysis tool DevReplay that generates code change patterns by mining the code change history, and we recommend changes using the matched patterns. Using DevReplay, developers can automatically detect and fix project/API-specific problems in the code editor and code review. Also, we evaluate the accuracy of DevReplay using automatic program repair tool benchmarks and real software. We found that DevReplay resolves more bugs than state-of-the-art APR tools. Finally, we submitted patches to the most popular open-source projects that are implemented by different languages, and project reviewers accepted 80% (8 of 10) patches. DevReplay is available on https://devreplay.github.io.

preprint2020arXiv

From Academia to Software Development: Publication Citations in Source Code Comments

Academic publications have been evaluated in terms of their impact on research communities based on many metrics, such as the number of citations. On the other hand, the impact of academic publications on industry has been rarely studied. This paper investigates how academic publications contribute to software development by analyzing publication citations in source code comments in open source software repositories. We propose an automated approach for detecting academic publications based on Named Entity Recognition, and achieve 0.90 in $F_1$ as detection accuracy. We conduct a large-scale study of publication citations with 319,438,977 comments collected from 25,925 active repositories written in seven programming languages. Our findings indicate that academic publications can be knowledge sources for software development. These referenced publications are particularly from journals. In terms of knowledge transfer, algorithm is the most prevalent type of knowledge transferred from the publications, with proposed formulas or equations typically implemented in methods or functions in source code files. In a closer look at GitHub repositories referencing academic publications, we find that science-related repositories are the most frequent among GitHub repositories with publication citations, and that the vast majority of these publications are referenced by repository owners who are different from the publication authors. We also find that referencing older publications can lead to potential issues related to obsolete knowledge.

preprint2020arXiv

Newcomer Candidate: Characterizing Contributions of a Novice Developer to GitHub

Context: To attract, onboard, and retain any new-comer in Open Source Software (OSS) projects is vital to their livelihood. Recent studies conclude that OSS projects risk failure due to abandonment and poor participation of newcomers. Evidence suggests more new users are joining GitHub, however, the extent to which they contribute to OSS projects is unknown. Objective: In this study, we coin the term 'newcomer candidate' to describe new users to the GitHub platform. Our objective is to track and characterize their initial contributions. As a preliminary survey, we collected 208 newcomer candidate contributions in GitHub. Using this dataset, we then plan to track their contributions to reveal insights. Method: We will use a mixed-methods approach, i.e., quantitative and qualitative, to identify whether or not newcomer candidates practice social coding, the kinds of their contributions, projects they target, and the proportion that they eventually onboard to an OSS project. Limitation: The key limitation is that our newcomer candidates are restricted to those that were collected from our preliminary survey.

preprint2020arXiv

Predicting Defective Lines Using a Model-Agnostic Technique

Defect prediction models are proposed to help a team prioritize source code areas files that need Software QualityAssurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole filewhile only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e.,LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective line identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defectswhile requiring a smaller amount of inspection effort and a manageable computation cost.

preprint2019arXiv

Towards Generation of Visual Attention Map for Source Code

Program comprehension is a dominant process in software development and maintenance. Experts are considered to comprehend the source code efficiently by directing their gaze, or attention, to important components in it. However, reflecting the importance of components is still a remaining issue in gaze behavior analysis for source code comprehension. Here we show a conceptual framework to compare the quantified importance of source code components with the gaze behavior of programmers. We use "attention" in attention models (e.g., code2vec) as the importance indices for source code components and evaluate programmers' gaze locations based on the quantified importance. In this report, we introduce the idea of our gaze behavior analysis using the attention map, and the results of a preliminary experiment.

Kenichi Matsumoto

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

"My GitHub Sponsors profile is live!" Investigating the Impact of Twitter/X Mentions on GitHub Sponsors

An Empirical Evaluation of Competitive Programming AI: A Case Study of AlphaCode

An Exploration of npm Package Co-Usage Examples from Stack Overflow: A Case Study

Bug-Fix Variants: Visualizing Unique Source Code Changes across GitHub Forks

GitHub Sponsors: Exploring a New Way to Contribute to Open Source

Giving Back: Contributions Congruent to Library Dependency Changes in a Software Ecosystem

Intertwining Ecosystems: A Large Scale Empirical Study of Libraries that Cross Software Ecosystems

Newcomer OSS-Candidates: Characterizing Contributions of Novice Developers to GitHub

On the Use of Refactoring in Security Vulnerability Fixes: An Exploratory Study on Maven Libraries

pycefr: Python Competency Level through Code Analysis

Understanding the Role of External Pull Requests in the NPM Ecosystem

FLOSS != GitHub: A Case Study of Linux/BSD Perceptions from Microsoft's Acquisition of GitHub

SōjiTantei: Function-Call Reachability Detection of Vulnerable Code for npm Packages

A Simulation Study of Bandit Algorithms to Address External Validity of Software Fault Prediction

Code-based Vulnerability Detection in Node.js Applications: How far are we?

DevReplay: Automatic Repair with Editable Fix Pattern

From Academia to Software Development: Publication Citations in Source Code Comments

Newcomer Candidate: Characterizing Contributions of a Novice Developer to GitHub

Predicting Defective Lines Using a Model-Agnostic Technique

Towards Generation of Visual Attention Map for Source Code