Source author record

Ziyuan Wang

Ziyuan Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Computation and Language Computer Vision cs.CY Distributed, Parallel, and Cluster Computing eess.IV math.OC Social and Information Networks

Catalog footprint

What is connected

8works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection

The introduction of ChatGPT has garnered widespread attention in both academic and industrial communities. ChatGPT is able to respond effectively to a wide range of human questions, providing fluent and comprehensive answers that significantly surpass previous public chatbots in terms of security and usefulness. On one hand, people are curious about how ChatGPT is able to achieve such strength and how far it is from human experts. On the other hand, people are starting to worry about the potential negative impacts that large language models (LLMs) like ChatGPT could have on society, such as fake news, plagiarism, and social security issues. In this work, we collected tens of thousands of comparison responses from both human experts and ChatGPT, with questions ranging from open-domain, financial, medical, legal, and psychological areas. We call the collected dataset the Human ChatGPT Comparison Corpus (HC3). Based on the HC3 dataset, we study the characteristics of ChatGPT's responses, the differences and gaps from human experts, and future directions for LLMs. We conducted comprehensive human evaluations and linguistic analyses of ChatGPT-generated content compared with that of humans, where many interesting results are revealed. After that, we conduct extensive experiments on how to effectively detect whether a certain text is generated by ChatGPT or humans. We build three different detection systems, explore several key factors that influence their effectiveness, and evaluate them in different scenarios. The dataset, code, and models are all publicly available at https://github.com/Hello-SimpleAI/chatgpt-comparison-detection.

preprint2022arXiv

A Bregman inertial forward-reflected-backward method for nonconvex minimization

We propose a Bregman inertial forward-reflected-backward (BiFRB) method for nonconvex composite problems. Our analysis relies on a novel approach that imposes general conditions on implicit merit function parameters, which yields a stepsize condition that is independent of inertial parameters. In turn, a question of Malitsky and Tam regarding whether FRB can be equipped with a Nesterov-type acceleration is resolved. Assuming the generalized concave Kurdyka-Łojasiewicz property of a quadratic regularization of the objective, we obtain sequential convergence of BiFRB, as well as convergence rates on both the function value and actual sequence. We also present formulae for the Bregman subproblem, supplementing not only BiFRB but also the work of Boţ-Csetnek-László and Boţ-Csetnek. Numerical simulations are conducted to evaluate the performance of our proposed algorithm.

preprint2022arXiv

Program Repair: Automated vs. Manual

Various automated program repair (APR) techniques have been proposed to fix bugs automatically in the last decade. Although recent researches have made significant progress on the effectiveness and efficiency, it is still unclear how APR techniques perform with human intervention in a real debugging scenario. To bridge this gap, we conduct an extensive study to compare three state-of-the-art APR tools with manual program repair, and further investigate whether the assistance of APR tools (i.e., repair reports) can improve manual program repair. To that end, we recruit 20 participants for a controlled experiment, resulting in a total of 160 manual repair tasks and a questionnaire survey. The experiment reveals several notable observations that (1) manual program repair may be influenced by the frequency of repair actions sometimes; (2) APR tools are more efficient in terms of debugging time, while manual program repair tends to generate a correct patch with fewer attempts; (3) APR tools can further improve manual program repair regarding the number of correctly-fixed bugs, while there exists a negative impact on the patch correctness; (4) participants are used to consuming more time to identify incorrect patches, while they are still misguided easily; (5) participants are positive about the tools' repair performance, while they generally lack confidence about the usability in practice. Besides, we provide some guidelines for improving the usability of APR tools (e.g., the misleading information in reports and the observation of feedback).

preprint2022arXiv

Test suite effectiveness metric evaluation: what do we know and what should we do?

Comparing test suite effectiveness metrics has always been a research hotspot. However, prior studies have different conclusions or even contradict each other for comparing different test suite effectiveness metrics. The problem we found most troubling to our community is that researchers tend to oversimplify the description of the ground truth they use. For example, a common expression is that "we studied the correlation between real faults and the metric to evaluate (MTE)". However, the meaning of "real faults" is not clear-cut. As a result, there is a need to scrutinize the meaning of "real faults". Without this, it will be half-knowledgeable with the conclusions. To tackle this challenge, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research. In nature, ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. First, materialize the ground truth for determining the real order in effectiveness among test suites. Second, generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, for the benchmark test suites, generate the MTE order in effectiveness by the metric to evaluate (MTE). Finally, calculate the agreement indicator between the two orders. Under ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score metrics and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, MTEs will be overestimated by more than 20% in values.

preprint2021arXiv

Mutant reduction evaluation: what is there and what is missing?

Background. Many mutation reduction strategies, which aim to reduce the number of mutants, have been proposed. Problem. It is important to measure the ability of a mutation reduction strategy to maintain test suite effectiveness evaluation. However, existing evaluation indicators are unable to measure the "order-preserving ability". Objective. We aim to propose evaluation indicators to measure the "order-preserving ability" of a mutation reduction strategy, which is important but missing in our community. Method. Given a test suite on a Software Under Test (SUT) with a set of original mutants, we leverage the test suite to generate a group of test suites that have a partial order relationship in fault detecting potential. When evaluating a reduction strategy, we first construct two partial order relationships among the generated test suites in terms of mutation score, one with the original mutants and another with the reduced mutants. Then, we measure the extent to which the two partial order relationships are consistent. The more consistent the two partial order relationships are, the stronger the Order Preservation (OP) of the mutation reduction strategy is, and the more effective the reduction strategy is. Furthermore, we propose Effort-aware Relative Order Preservation (EROP) to measure how much gain a mutation reduction strategy can provide compared with a random reduction strategy. Result. The experimental results show that OP and EROP are able to efficiently measure the "order-preserving ability" of a mutation reduction strategy. As a result, they have a better ability to distinguish various mutation reduction strategies compared with the existing evaluation indicators. Conclusion. We suggest, for the researchers, that OP and EROP should be used to measure the effectiveness of a mutant reduction strategy.

preprint2020arXiv

Automatic Data Augmentation via Deep Reinforcement Learning for Effective Kidney Tumor Segmentation

Conventional data augmentation realized by performing simple pre-processing operations (\eg, rotation, crop, \etc) has been validated for its advantage in enhancing the performance for medical image segmentation. However, the data generated by these conventional augmentation methods are random and sometimes harmful to the subsequent segmentation. In this paper, we developed a novel automatic learning-based data augmentation method for medical image segmentation which models the augmentation task as a trial-and-error procedure using deep reinforcement learning (DRL). In our method, we innovatively combine the data augmentation module and the subsequent segmentation module in an end-to-end training manner with a consistent loss. Specifically, the best sequential combination of different basic operations is automatically learned by directly maximizing the performance improvement (\ie, Dice ratio) on the available validation set. We extensively evaluated our method on CT kidney tumor segmentation which validated the promising results of our method.

preprint2016arXiv

LeaveNow: A Social Network-based Smart Evacuation System for Disaster Management

The importance of timely response to natural disasters and evacuating affected people to safe areas is paramount to save lives. Emergency services are often handicapped by the amount of rescue resources at their disposal. We present a system that leverages the power of a social network forming new connections among people based on \textit{real-time location} and expands the rescue resources pool by adding private sector cars. We also introduce a car-sharing algorithm to identify safe routes in an emergency with the aim of minimizing evacuation time, maximizing pick-up of people without cars, and avoiding traffic congestion.

preprint2009arXiv

Decentralized Traffic Management Strategies for Sensor-Enabled Cars

Traffic Congestions and accidents are major concerns in today's transportation systems. This thesis investigates how to optimize traffic flow on highways, in particular for merging situations such as intersections where a ramp leads onto the highway. In our work, cars are equipped with sensors that can detect distance to neighboring cars, and communicate their velocity and acceleration readings with one another. Sensor-enabled cars can locally exchange sensed information about the traffic and adapt their behavior much earlier than regular cars. We propose proactive algorithms for merging different streams of sensor-enabled cars into a single stream. A proactive merging algorithm decouples the decision point from the actual merging point. Sensor-enabled cars allow us to decide where and when a car merges before it arrives at the actual merging point. This leads to a significant improvement in traffic flow as velocities can be adjusted appropriately. We compare proactive merging algorithms against the conventional priority-based merging algorithm in a controlled simulation environment. Experiment results show that proactive merging algorithms outperform the priority-based merging algorithm in terms of flow and delay.

Ziyuan Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection

A Bregman inertial forward-reflected-backward method for nonconvex minimization

Program Repair: Automated vs. Manual

Test suite effectiveness metric evaluation: what do we know and what should we do?

Mutant reduction evaluation: what is there and what is missing?

Automatic Data Augmentation via Deep Reinforcement Learning for Effective Kidney Tumor Segmentation

LeaveNow: A Social Network-based Smart Evacuation System for Disaster Management

Decentralized Traffic Management Strategies for Sensor-Enabled Cars