Source author record

Linzhang Wang

Linzhang Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Machine Learning Programming Languages Logic in Computer Science

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

WheaCha: A Method for Explaining the Predictions of Models of Code

Attribution methods have emerged as a popular approach to interpreting model predictions based on the relevance of input features. Although the feature importance ranking can provide insights of how models arrive at a prediction from a raw input, they do not give a clear-cut definition of the key features models use for the prediction. In this paper, we present a new method, called WheaCha, for explaining the predictions of code models. Although WheaCha employs the same mechanism of tracing model predictions back to the input features, it differs from all existing attribution methods in crucial ways. Specifically, WheaCha divides an input program into "wheat" (i.e., the defining features that are the reason for which models predict the label that they predict) and the rest "chaff" for any prediction of a learned code model. We realize WheaCha in a tool, HuoYan, and use it to explain four prominent code models: code2vec, seq-GNN, GGNN, and CodeBERT. Results show (1) HuoYan is efficient - taking on average under twenty seconds to compute the wheat for an input program in an end-to-end fashion (i.e., including model prediction time); (2) the wheat that all models use to predict input programs is made of simple syntactic or even lexical properties (i.e., identifier names); (3) Based on wheat, we present a novel approach to explaining the predictions of code models through the lens of training data.

preprint2020arXiv

Learning a Static Bug Finder from Data

We present an alternative approach to creating static bug finders. Instead of relying on human expertise, we utilize deep neural networks to train static analyzers directly from data. In particular, we frame the problem of bug finding as a classification task and train a classifier to differentiate the buggy from non-buggy programs using Graph Neural Network (GNN). Crucially, we propose a novel interval-based propagation mechanism that leads to a significantly more efficient, accurate and scalable generalization of GNN. We have realized our approach into a framework, NeurSA, and extensively evaluated it. In a cross-project prediction task, three neural bug detectors we instantiate from NeurSA are effective in catching null pointer dereference, array index out of bound and class cast bugs in unseen code. We compare NeurSA against several static analyzers (e.g. Facebook Infer and Pinpoint) on a set of null pointer dereference bugs. Results show that NeurSA is more precise in catching the real bugs and suppressing the spurious warnings. We also apply NeurSA to several popular Java projects on GitHub and discover 50 new bugs, among which 9 have been fixed, and 3 have been confirmed.

preprint2020arXiv

Learning Semantic Program Embeddings with Graph Interval Neural Network

Learning distributed representations of source code has been a challenging task for machine learning models. Earlier works treated programs as text so that natural language methods can be readily applied. Unfortunately, such approaches do not capitalize on the rich structural information possessed by source code. Of late, Graph Neural Network (GNN) was proposed to learn embeddings of programs from their graph representations. Due to the homogeneous and expensive message-passing procedure, GNN can suffer from precision issues, especially when dealing with programs rendered into large graphs. In this paper, we present a new graph neural architecture, called Graph Interval Neural Network (GINN), to tackle the weaknesses of the existing GNN. Unlike the standard GNN, GINN generalizes from a curated graph representation obtained through an abstraction method designed to aid models to learn. In particular, GINN focuses exclusively on intervals for mining the feature representation of a program, furthermore, GINN operates on a hierarchy of intervals for scaling the learning to large graphs. We evaluate GINN for two popular downstream applications: variable misuse prediction and method name prediction. Results show in both cases GINN outperforms the state-of-the-art models by a comfortable margin. We have also created a neural bug detector based on GINN to catch null pointer deference bugs in Java code. While learning from the same 9,000 methods extracted from 64 projects, GINN-based bug detector significantly outperforms GNN-based bug detector on 13 unseen test projects. Next, we deploy our trained GINN-based bug detector and Facebook Infer to scan the codebase of 20 highly starred projects on GitHub. Through our manual inspection, we confirm 38 bugs out of 102 warnings raised by GINN-based bug detector compared to 34 bugs out of 129 warnings for Facebook Infer.

preprint2011arXiv

Online Verification of Control Parameter Calculations in Communication Based Train Control System

Communication Based Train Control (CBTC) system is the state-of-the-art train control system. In a CBTC system, to guarantee the safety of train operation, trains communicate with each other intensively and adjust their control modes autonomously by computing critical control parameters, e.g. velocity range, according to the information they get. As the correctness of the control parameters generated are critical to the safety of the system, a method to verify these parameters is a strong desire in the area of train control system. In this paper, we present our ideas of how to model and verify the control parameter calculations in a CBTC system efficiently. - As the behavior of the system is highly nondeterministic, it is difficult to build and verify the complete behavior space model of the system online in advance. Thus, we propose to model the system according to the ongoing behavior model induced by the control parameters. - As the parameters are generated online and updated very quickly, the verification result will be meaningless if it is given beyond the time bound, since by that time the model will be changed already. Thus, we propose a method to verify the existence of certain dangerous scenarios in the model online quickly. To demonstrate the feasibility of these proposed approaches, we present the composed linear hybrid automata with readable shared variables as a modeling language to model the control parameters calculation and give a path-oriented reachability analysis technique for the scenario-based verification of this model. We demonstrate the model built for the CBTC system, and show the performance of our technique in fast online verification. Last but not least, as CBTC system is a typical CPS system, we also give a short discussion of the potential directions for CPS verification in this paper.