Researcher profile

Yanhui Li

Yanhui Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

Investigating red packet fraud in Android applications: Insights from user reviews

With the popularization of smartphones, red packets have been widely used in mobile apps. However, the issues of fraud associated with them have also become increasingly prominent. As reported in user reviews from mobile app markets, many users have complained about experiencing red packet fraud and being persistently troubled by fraudulent red packets. To uncover this phenomenon, we conduct the first investigation into an extensive collection of user reviews on apps with red packets. In this paper, we first propose a novel automated approach, ReckDetector, for effectively identifying apps with red packets from app markets. We then collect over 360,000 real user reviews from 334 apps with red packets available on Google Play and three popular alternative Android app markets. We preprocess the user reviews to extract those related to red packets and fine-tune a pre-trained BERT model to identify negative reviews. Finally, based on semantic analysis, we have summarized six distinct categories of red packet fraud issues reported by users. Through our study, we found that red packet fraud is highly prevalent, significantly impacting user experience and damaging the reputation of apps. Moreover, red packets have been widely exploited by unscrupulous app developers as a deceptive incentive mechanism to entice users into completing their designated tasks, thereby maximizing their profits.

preprint2022arXiv

Test suite effectiveness metric evaluation: what do we know and what should we do?

Comparing test suite effectiveness metrics has always been a research hotspot. However, prior studies have different conclusions or even contradict each other for comparing different test suite effectiveness metrics. The problem we found most troubling to our community is that researchers tend to oversimplify the description of the ground truth they use. For example, a common expression is that "we studied the correlation between real faults and the metric to evaluate (MTE)". However, the meaning of "real faults" is not clear-cut. As a result, there is a need to scrutinize the meaning of "real faults". Without this, it will be half-knowledgeable with the conclusions. To tackle this challenge, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research. In nature, ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. First, materialize the ground truth for determining the real order in effectiveness among test suites. Second, generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, for the benchmark test suites, generate the MTE order in effectiveness by the metric to evaluate (MTE). Finally, calculate the agreement indicator between the two orders. Under ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score metrics and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, MTEs will be overestimated by more than 20% in values.

preprint2021arXiv

An extensive empirical study of inconsistent labels in multi-version-project defect data sets

The label quality of defect data sets has a direct influence on the reliability of defect prediction models. In this study, for multi-version-project defect data sets, we propose an approach to automatically detecting instances with inconsistent labels (i.e. the phenomena of instances having the same source code but different labels over multiple versions of a software project) and understand their influence on the evaluation and interpretation of defect prediction models. Based on five multi-version-project defect data sets (either widely used or the most up-to-date in the literature) collected by diverse approaches, we find that: (1) most versions in the investigated defect data sets contain inconsistent labels with varying degrees; (2) the existence of inconsistent labels in a training data set may considerably change the prediction performance of a defect prediction model as well as can lead to the identification of substantially different true defective modules; and (3) the importance ranking of independent variables in a defect prediction model can be substantially shifted due to the existence of inconsistent labels. The above findings reveal that inconsistent labels in defect data sets can profoundly change the prediction ability and interpretation of a defect prediction model. Therefore, we strongly suggest that practitioners should detect and exclude inconsistent labels in defect data sets to avoid their potential negative influence on defect prediction models. What is more, it is necessary for researchers to improve existing defect label collection approaches to reduce inconsistent labels. Furthermore, there is a need to re-examine the experimental conclusions of previous studies using multi-version-project defect data sets with a high ratio of inconsistent labels.

preprint2021arXiv

Measuring Discrimination to Boost Comparative Testing for Multiple Deep Learning Models

The boom of DL technology leads to massive DL models built and shared, which facilitates the acquisition and reuse of DL models. For a given task, we encounter multiple DL models available with the same functionality, which are considered as candidates to achieve this task. Testers are expected to compare multiple DL models and select the more suitable ones w.r.t. the whole testing context. Due to the limitation of labeling effort, testers aim to select an efficient subset of samples to make an as precise rank estimation as possible for these models. To tackle this problem, we propose Sample Discrimination based Selection (SDS) to select efficient samples that could discriminate multiple models, i.e., the prediction behaviors (right/wrong) of these samples would be helpful to indicate the trend of model performance. To evaluate SDS, we conduct an extensive empirical study with three widely-used image datasets and 80 real world DL models. The experimental results show that, compared with state-of-the-art baseline methods, SDS is an effective and efficient sample selection method to rank multiple DL models.

preprint2021arXiv

Mutant reduction evaluation: what is there and what is missing?

Background. Many mutation reduction strategies, which aim to reduce the number of mutants, have been proposed. Problem. It is important to measure the ability of a mutation reduction strategy to maintain test suite effectiveness evaluation. However, existing evaluation indicators are unable to measure the "order-preserving ability". Objective. We aim to propose evaluation indicators to measure the "order-preserving ability" of a mutation reduction strategy, which is important but missing in our community. Method. Given a test suite on a Software Under Test (SUT) with a set of original mutants, we leverage the test suite to generate a group of test suites that have a partial order relationship in fault detecting potential. When evaluating a reduction strategy, we first construct two partial order relationships among the generated test suites in terms of mutation score, one with the original mutants and another with the reduced mutants. Then, we measure the extent to which the two partial order relationships are consistent. The more consistent the two partial order relationships are, the stronger the Order Preservation (OP) of the mutation reduction strategy is, and the more effective the reduction strategy is. Furthermore, we propose Effort-aware Relative Order Preservation (EROP) to measure how much gain a mutation reduction strategy can provide compared with a random reduction strategy. Result. The experimental results show that OP and EROP are able to efficiently measure the "order-preserving ability" of a mutation reduction strategy. As a result, they have a better ability to distinguish various mutation reduction strategies compared with the existing evaluation indicators. Conclusion. We suggest, for the researchers, that OP and EROP should be used to measure the effectiveness of a mutant reduction strategy.

preprint2020arXiv

Prioritizing documentation effort: Can we do better?

Code documentations are essential for software quality assurance, but due to time or economic pressures, code developers are often unable to write documents for all modules in a project. Recently, a supervised artificial neural network (ANN) approach is proposed to prioritize important modules for documentation effort. However, as a supervised approach, there is a need to use labeled training data to train the prediction model, which may not be easy to obtain in practice. Furthermore, it is unclear whether the ANN approach is generalizable, as it is only evaluated on several small data sets. In this paper, we propose an unsupervised approach based on PageRank to prioritize documentation effort. This approach identifies "important" modules only based on the dependence relationships between modules in a project. As a result, the PageRank approach does not need any training data to build the prediction model. In order to evaluate the effectiveness of the PageRank approach, we use six additional large data sets to conduct the experiments in addition to the same data sets collected from open-source projects as used in prior studies. The experimental results show that the PageRank approach is superior to the state-of-the-art ANN approach in prioritizing important modules for documentation effort. In particular, due to the simplicity and effectiveness, we advocate that the PageRank approach should be used as an easy-to-implement baseline in future research on documentation effort prioritization, and any new approach should be compared with it to demonstrate its effectiveness.