Source author record

Haipeng Cai

Haipeng Cai appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Graphics Human-Computer Interaction Programming Languages

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses

Accompanying the successes of learning-based defensive software vulnerability analyses is the lack of large and quality sets of labeled vulnerable program samples, which impedes further advancement of those defenses. Existing automated sample generation approaches have shown potentials yet still fall short of practical expectations due to the high noise in the generated samples. This paper proposes VGX, a new technique aimed for large-scale generation of high-quality vulnerability datasets. Given a normal program, VGX identifies the code contexts in which vulnerabilities can be injected, using a customized Transformer featured with a new value-flowbased position encoding and pre-trained against new objectives particularly for learning code structure and context. Then, VGX materializes vulnerability-injection code editing in the identified contexts using patterns of such edits obtained from both historical fixes and human knowledge about real-world vulnerabilities. Compared to four state-of-the-art (SOTA) baselines (pattern-, Transformer-, GNN-, and pattern+Transformer-based), VGX achieved 99.09-890.06% higher F1 and 22.45%-328.47% higher label accuracy. For in-the-wild sample production, VGX generated 150,392 vulnerable samples, from which we randomly chose 10% to assess how much these samples help vulnerability detection, localization, and repair. Our results show SOTA techniques for these three application tasks achieved 19.15-330.80% higher F1, 12.86-19.31% higher top-10 accuracy, and 85.02-99.30% higher top-50 accuracy, respectively, by adding those samples to their original training data. These samples also helped a SOTA vulnerability detector discover 13 more real-world vulnerabilities (CVEs) in critical systems (e.g., Linux kernel) that would be missed by the original model.

preprint2022arXiv

Automatically Detecting API-induced Compatibility Issues in Android Apps: A Comparative Analysis (Replicability Study)

Fragmentation is a serious problem in the Android ecosystem. This problem is mainly caused by the fast evolution of the system itself and the various customizations independently maintained by different smartphone manufacturers. Many efforts have attempted to mitigate its impact via approaches to automatically pinpoint compatibility issues in Android apps. Unfortunately, at this stage, it is still unknown if this objective has been fulfilled, and the existing approaches can indeed be replicated and reliably leveraged to pinpoint compatibility issues in the wild. We, therefore, propose to fill this gap by first conducting a literature review within this topic to identify all the available approaches. Among the nine identified approaches, we then try our best to reproduce them based on their original datasets. After that, we go one step further to empirically compare those approaches against common datasets with real-world apps containing compatibility issues. Experimental results show that existing tools can indeed be reproduced, but their capabilities are quite distinct, as confirmed by the fact that there is only a small overlap of the results reported by the selected tools. This evidence suggests that more efforts should be spent by our community to achieve sound compatibility issues detection.

preprint2021arXiv

A Lightweight Approach of Human-Like Playtesting

A playtest is the process in which human testers are recruited to play video games and to reveal software bugs. Manual testing is expensive and time-consuming, especially when there are many mobile games to test and every software version requires for extensive testing before being released. Existing testing frameworks (e.g., Android Monkey) are limited because they adopt no domain knowledge to play games. Learning-based tools (e.g., Wuji) involve a huge amount of training data and computation before testing any game. This paper presents LIT -- our lightweight approach to generalize playtesting tactics from manual testing, and to adopt the generalized tactics to automate game testing. LIT consists of two phases. In Phase I, while a human plays an Android game app G for a short period of time (e.g., eight minutes), \tool records the user's actions (e.g., swipe) and the scene before each action. Based on the collected data, LIT generalizes a set of \emph{context-aware, abstract playtesting tactics} which describe under what circumstances, what actions can be taken to play the game. In Phase II, LIT tests G based on the generalized tactics. Namely, given a randomly generated game scene, LIT searches match for the abstract context of any inferred tactic; if there is a match, LIT customizes the tactic and generates a feasible event to play the game. Our evaluation with nine games shows LIT to outperform two state-of-the-art tools. This implies that by automating playtest, LIT will significantly reduce manual testing and boost the quality of game apps.

preprint2013arXiv

Composing DTI Visualizations with End-user Programming

We present the design and prototype implementation of a scientific visualization language called Zifazah for composing 3D visualizations of diffusion tensor magnetic resonance imaging (DT-MRI or DTI) data. Unlike existing tools allowing flexible customization of data visualizations that are programmer-oriented, we focus on domain scientists as end users in order to enable them to freely compose visualizations of their scientific data set. We analyzed end-user descriptions extracted from interviews with neurologists and physicians conducting clinical practices using DTI about how they would build and use DTI visualizations to collect syntax and semantics for the language design, and have discovered the elements and structure of the proposed language. Zifazah makes use of the initial set of lexical terms and semantics to provide a declarative language in the spirit of intuitive syntax and usage. This work contributes three, among others, main design principles for scientific visualization language design as well as a practice of such language for DTI visualization with Zifazah. First, Zifazah incorporated visual symbolic mapping based on color, size and shape, which is a sub-set of Bertin's taxonomy migrated to scientific visualizations. Second, Zifazah is defined as a spatial language whereby lexical representation of spatial relationship for 3D object visualization and manipulations, which is characteristic of scientific data, can be programmed. Third, built on top of Bertin's semiology, flexible data encoding specifically for scientific visualizations is integrated in our language in order to allow end users to achieve optimal visual composition at their best. Along with sample scripts representative of our language design features, some new DTI visualizations as the running results created by end users using the novel visualization language have also been presented.