Researcher profile

Yuming Zhou

Yuming Zhou contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2024arXiv

Coverage Goal Selector for Combining Multiple Criteria in Search-Based Unit Test Generation

Unit testing is critical to the software development process, ensuring the correctness of basic programming units in a program (e.g., a method). Search-based software testing (SBST) is an automated approach to generating test cases. SBST generates test cases with genetic algorithms by specifying the coverage criterion (e.g., branch coverage). However, a good test suite must have different properties, which cannot be captured using an individual coverage criterion. Therefore, the state-of-the-art approach combines multiple criteria to generate test cases. Since combining multiple coverage criteria brings multiple objectives for optimization, it hurts the test suites' coverage for certain criteria compared with using the single criterion. To cope with this problem, we propose a novel approach named \textbf{smart selection}. Based on the coverage correlations among criteria and the subsumption relationships among coverage goals, smart selection selects a subset of coverage goals to reduce the number of optimization objectives and avoid missing any properties of all criteria. We conduct experiments to evaluate smart selection on $400$ Java classes with three state-of-the-art genetic algorithms under the $2$-minute budget. On average, smart selection outperforms combining all goals on $65.1\%$ of the classes having significant differences between the two approaches. Secondly, we conduct experiments to verify our assumptions about coverage criteria relationships. Furthermore, we assess the coverage performance of smart selection under varying budgets of $5$, $8$, and $10$ minutes and explore its effect on bug detection, confirming the advantage of smart selection over combining all goals.

preprint2022arXiv

Selectively Combining Multiple Coverage Goals in Search-Based Unit Test Generation

Unit testing is a critical part of software development process, ensuring the correctness of basic programming units in a program (e.g., a method). Search-based software testing (SBST) is an automated approach to generating test cases. SBST generates test cases with genetic algorithms by specifying the coverage criterion (e.g., branch coverage). However, a good test suite must have different properties, which cannot be captured by using an individual coverage criterion. Therefore, the state-of-the-art approach combines multiple criteria to generate test cases. As combining multiple coverage criteria brings multiple objectives for optimization, it hurts the test suites' coverage for certain criteria compared with using the single criterion. To cope with this problem, we propose a novel approach named \textbf{smart selection}. Based on the coverage correlations among criteria and the coverage goals' subsumption relationships, smart selection selects a subset of coverage goals to reduce the number of optimization objectives and avoid missing any properties of all criteria. We conduct experiments to evaluate smart selection on $400$ Java classes with three state-of-the-art genetic algorithms. On average, smart selection outperforms combining all goals on $65.1\%$ of the classes having significant differences between the two approaches.

preprint2022arXiv

Test suite effectiveness metric evaluation: what do we know and what should we do?

Comparing test suite effectiveness metrics has always been a research hotspot. However, prior studies have different conclusions or even contradict each other for comparing different test suite effectiveness metrics. The problem we found most troubling to our community is that researchers tend to oversimplify the description of the ground truth they use. For example, a common expression is that "we studied the correlation between real faults and the metric to evaluate (MTE)". However, the meaning of "real faults" is not clear-cut. As a result, there is a need to scrutinize the meaning of "real faults". Without this, it will be half-knowledgeable with the conclusions. To tackle this challenge, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research. In nature, ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. First, materialize the ground truth for determining the real order in effectiveness among test suites. Second, generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, for the benchmark test suites, generate the MTE order in effectiveness by the metric to evaluate (MTE). Finally, calculate the agreement indicator between the two orders. Under ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score metrics and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, MTEs will be overestimated by more than 20% in values.

preprint2021arXiv

An extensive empirical study of inconsistent labels in multi-version-project defect data sets

The label quality of defect data sets has a direct influence on the reliability of defect prediction models. In this study, for multi-version-project defect data sets, we propose an approach to automatically detecting instances with inconsistent labels (i.e. the phenomena of instances having the same source code but different labels over multiple versions of a software project) and understand their influence on the evaluation and interpretation of defect prediction models. Based on five multi-version-project defect data sets (either widely used or the most up-to-date in the literature) collected by diverse approaches, we find that: (1) most versions in the investigated defect data sets contain inconsistent labels with varying degrees; (2) the existence of inconsistent labels in a training data set may considerably change the prediction performance of a defect prediction model as well as can lead to the identification of substantially different true defective modules; and (3) the importance ranking of independent variables in a defect prediction model can be substantially shifted due to the existence of inconsistent labels. The above findings reveal that inconsistent labels in defect data sets can profoundly change the prediction ability and interpretation of a defect prediction model. Therefore, we strongly suggest that practitioners should detect and exclude inconsistent labels in defect data sets to avoid their potential negative influence on defect prediction models. What is more, it is necessary for researchers to improve existing defect label collection approaches to reduce inconsistent labels. Furthermore, there is a need to re-examine the experimental conclusions of previous studies using multi-version-project defect data sets with a high ratio of inconsistent labels.

preprint2021arXiv

Measuring Discrimination to Boost Comparative Testing for Multiple Deep Learning Models

The boom of DL technology leads to massive DL models built and shared, which facilitates the acquisition and reuse of DL models. For a given task, we encounter multiple DL models available with the same functionality, which are considered as candidates to achieve this task. Testers are expected to compare multiple DL models and select the more suitable ones w.r.t. the whole testing context. Due to the limitation of labeling effort, testers aim to select an efficient subset of samples to make an as precise rank estimation as possible for these models. To tackle this problem, we propose Sample Discrimination based Selection (SDS) to select efficient samples that could discriminate multiple models, i.e., the prediction behaviors (right/wrong) of these samples would be helpful to indicate the trend of model performance. To evaluate SDS, we conduct an extensive empirical study with three widely-used image datasets and 80 real world DL models. The experimental results show that, compared with state-of-the-art baseline methods, SDS is an effective and efficient sample selection method to rank multiple DL models.

preprint2021arXiv

Mutant reduction evaluation: what is there and what is missing?

Background. Many mutation reduction strategies, which aim to reduce the number of mutants, have been proposed. Problem. It is important to measure the ability of a mutation reduction strategy to maintain test suite effectiveness evaluation. However, existing evaluation indicators are unable to measure the "order-preserving ability". Objective. We aim to propose evaluation indicators to measure the "order-preserving ability" of a mutation reduction strategy, which is important but missing in our community. Method. Given a test suite on a Software Under Test (SUT) with a set of original mutants, we leverage the test suite to generate a group of test suites that have a partial order relationship in fault detecting potential. When evaluating a reduction strategy, we first construct two partial order relationships among the generated test suites in terms of mutation score, one with the original mutants and another with the reduced mutants. Then, we measure the extent to which the two partial order relationships are consistent. The more consistent the two partial order relationships are, the stronger the Order Preservation (OP) of the mutation reduction strategy is, and the more effective the reduction strategy is. Furthermore, we propose Effort-aware Relative Order Preservation (EROP) to measure how much gain a mutation reduction strategy can provide compared with a random reduction strategy. Result. The experimental results show that OP and EROP are able to efficiently measure the "order-preserving ability" of a mutation reduction strategy. As a result, they have a better ability to distinguish various mutation reduction strategies compared with the existing evaluation indicators. Conclusion. We suggest, for the researchers, that OP and EROP should be used to measure the effectiveness of a mutant reduction strategy.

preprint2020arXiv

Prioritizing documentation effort: Can we do better?

Code documentations are essential for software quality assurance, but due to time or economic pressures, code developers are often unable to write documents for all modules in a project. Recently, a supervised artificial neural network (ANN) approach is proposed to prioritize important modules for documentation effort. However, as a supervised approach, there is a need to use labeled training data to train the prediction model, which may not be easy to obtain in practice. Furthermore, it is unclear whether the ANN approach is generalizable, as it is only evaluated on several small data sets. In this paper, we propose an unsupervised approach based on PageRank to prioritize documentation effort. This approach identifies "important" modules only based on the dependence relationships between modules in a project. As a result, the PageRank approach does not need any training data to build the prediction model. In order to evaluate the effectiveness of the PageRank approach, we use six additional large data sets to conduct the experiments in addition to the same data sets collected from open-source projects as used in prior studies. The experimental results show that the PageRank approach is superior to the state-of-the-art ANN approach in prioritizing important modules for documentation effort. In particular, due to the simplicity and effectiveness, we advocate that the PageRank approach should be used as an easy-to-implement baseline in future research on documentation effort prioritization, and any new approach should be compared with it to demonstrate its effectiveness.