Source author record

Shaohua Wang

Shaohua Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Programming Languages

Catalog footprint

What is connected

4works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Context-Free Grammar Inference for Complex Programming Languages in Black Box Settings

Grammar inference for complex programming languages remains a significant challenge, as existing approaches fail to scale to real world datasets within practical time constraints. In our experiments, none of the state-of-the-art tools, including Arvada, Treevada and Kedavra were able to infer grammars for complex languages such as C, C++, and Java within 48 hours. Arvada and Treevada perform grammar inference directly on full-length input examples, which proves inefficient for large files commonly found in such languages. While Kedavra introduces data decomposition to create shorter examples for grammar inference, its lexical analysis still relies on the original inputs. Additionally, its strict no-overgeneralization constraint limits the construction of complex grammars. To overcome these limitations, we propose Crucio, which builds a decomposition forest to extract short examples for lexical and grammar inference via a distributional matrix. Experimental results show that Crucio is the only method capable of successfully inferring grammars for complex programming languages (where the number of nonterminals is up to 23x greater than in prior benchmarks) within reasonable time limits. On the prior simple benchmark, Crucio achieves an average recall improvement of 1.37x and 1.19x over Treevada and Kedavra, respectively, and improves F1 scores by 1.21x and 1.13x.

preprint2022arXiv

DEAR: A Novel Deep Learning-based Approach for Automated Program Repair

The existing deep learning (DL)-based automated program repair (APR) models are limited in fixing general software defects. % We present {\tool}, a DL-based approach that supports fixing for the general bugs that require dependent changes at once to one or multiple consecutive statements in one or multiple hunks of code. % We first design a novel fault localization (FL) technique for multi-hunk, multi-statement fixes that combines traditional spectrum-based (SB) FL with deep learning and data-flow analysis. It takes the buggy statements returned by the SBFL model, detects the buggy hunks to be fixed at once, and expands a buggy statement $s$ in a hunk to include other suspicious statements around $s$. We design a two-tier, tree-based LSTM model that incorporates cycle training and uses a divide-and-conquer strategy to learn proper code transformations for fixing multiple statements in the suitable fixing context consisting of surrounding subtrees. We conducted several experiments to evaluate {\tool} on three datasets: Defects4J (395 bugs), BigFix (+26k bugs), and CPatMiner (+44k bugs). On Defects4J dataset, {\tool} outperforms the baselines from 42\%--683\% in terms of the number of auto-fixed bugs with only the top-1 patches. On BigFix dataset, it fixes 31--145 more bugs than existing DL-based APR models with the top-1 patches. On CPatMiner dataset, among 667 fixed bugs, there are 169 (25.3\%) multi-hunk/multi-statement bugs. {\tool} fixes 71 and 164 more bugs, including 52 and 61 more multi-hunk/multi-statement bugs, than the state-of-the-art, DL-based APR models.

preprint2021arXiv

A Context-based Automated Approach for Method Name Consistency Checking and Suggestion

Misleading method names in software projects can confuse developers, which may lead to software defects and affect code understandability. In this paper, we present DeepName, a context-based, deep learning approach to detect method name inconsistencies and suggest a proper name for a method. The key departure point is the philosophy of "Show Me Your Friends, I'll Tell You Who You Are". Unlike the state-of-the-art approaches, in addition to the method's body, we also consider the interactions of the current method under study with the other ones including the caller and callee methods, and the sibling methods in the same enclosing class. The sequences of sub-tokens in the program entities' names in the contexts are extracted and used as the input for an RNN-based encoder-decoder to produce the representations for the current method. We modify that RNN model to integrate the copy mechanism and our newly developed component, called the non-copy mechanism, to emphasize on the possibility of a certain sub-token not to be copied to follow the current sub-token in the currently generated method name. We conducted several experiments to evaluate DeepName on large datasets with +14M methods. For consistency checking, DeepName improves the state-of-the-art approach by 2.1%, 19.6%, and 11.9% relatively in recall, precision, and F-score, respectively. For name suggestion, DeepName improves relatively over the state-of-the-art approaches in precision (1.8%--30.5%), recall (8.8%--46.1%), and F-score (5.2%--38.2%). To assess DeepName's usefulness, we detected inconsistent methods and suggested new method names in active projects. Among 50 pull requests, 12 were merged into the main branch. In total, in 30/50 cases, the team members agree that our suggested method names are more meaningful than the current names.

preprint2021arXiv

Fault Localization with Code Coverage Representation Learning

In this paper, we propose DeepRL4FL, a deep learning fault localization (FL) approach that locates the buggy code at the statement and method levels by treating FL as an image pattern recognition problem. DeepRL4FL does so via novel code coverage representation learning (RL) and data dependencies RL for program statements. Those two types of RL on the dynamic information in a code coverage matrix are also combined with the code representation learning on the static information of the usual suspicious source code. This combination is inspired by crime scene investigation in which investigators analyze the crime scene (failed test cases and statements) and related persons (statements with dependencies), and at the same time, examine the usual suspects who have committed a similar crime in the past (similar buggy code in the training data). For the code coverage information, DeepRL4FL first orders the test cases and marks error-exhibiting code statements, expecting that a model can recognize the patterns discriminating between faulty and non-faulty statements/methods. For dependencies among statements, the suspiciousness of a statement is seen taking into account the data dependencies to other statements in execution and data flows, in addition to the statement by itself. Finally, the vector representations for code coverage matrix, data dependencies among statements, and source code are combined and used as the input of a classifier built from a Convolution Neural Network to detect buggy statements/methods. Our empirical evaluation shows that DeepRL4FL improves the top-1 results over the state-of-the-art statement-level FL baselines from 173.1% to 491.7%. It also improves the top-1 results over the existing method-level FL baselines from 15.0% to 206.3%.

Shaohua Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Context-Free Grammar Inference for Complex Programming Languages in Black Box Settings

DEAR: A Novel Deep Learning-based Approach for Automated Program Repair

A Context-based Automated Approach for Method Name Consistency Checking and Suggestion

Fault Localization with Code Coverage Representation Learning