Source author record

Rui Shu

Rui Shu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Software Engineering Cryptography and Security Computation and Language Computer Vision cond-mat.mtrl-sci Digital Libraries eess.SY Systems and Control

Catalog footprint

What is connected

12works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security data class imbalanced issues and further help build better prediction models with resampled datasets. Method: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adversarial Networks with gradient penalty (cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a novel optimizer called Bayesian Optimization. We use Dazzle to generate minority class samples to resample the original imbalanced training dataset. Results: We evaluate Dazzle with three software security datasets, i.e., Moodle vulnerable files, Ambari bug reports, and JavaScript function code. We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an average of about 60% improvement rate over SMOTE in recall among all datasets). Conclusion: Based on this study, we would suggest the use of optimized GANs as an alternative method for security vulnerability data class imbalanced issues.

preprint2022arXiv

Do I really need all this work to find vulnerabilities? An empirical case study comparing vulnerability detection techniques on a Java application

CONTEXT: Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. OBJECTIVE: The goal of this research is to assist managers and other decision-makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based web application. METHOD: We apply four different categories of vulnerability detection techniques \textendash~ systematic manual penetration testing (SMPT), exploratory manual penetration testing (EMPT), dynamic application security testing (DAST), and static application security testing (SAST) \textendash\ to an open-source medical records system. RESULTS: We found the most vulnerabilities using SAST. However, EMPT found more severe vulnerabilities. With each technique, we found unique vulnerabilities not found using the other techniques. The efficiency of manual techniques (EMPT, SMPT) was comparable to or better than the efficiency of automated techniques (DAST, SAST) in terms of Vulnerabilities per Hour (VpH). CONCLUSIONS: The vulnerability detection technique practitioners should select may vary based on the goals and available resources of the project. If the goal of an organization is to find "all" vulnerabilities in a project, they need to use as many techniques as their resources allow.

preprint2022arXiv

Predicting Health Indicators for Open Source Projects (using Hyperparameter Optimization)

Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like $k$-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects.

preprint2022arXiv

Solid-state Janus nanoprecipitation enables amorphous-like heat conduction in crystalline Mg3Sb2-based thermoelectric materials

Solid-state precipitation can be used to tailor materials properties, ranging from ferromagnets and catalysts to mechanical strengthening and energy storage. Thermoelectric properties can be modified by precipitation to enhance phonon scattering while retaining charge-carrier transmission. Here, we uncover unconventional dual Janus-type nanoprecipitates in Mg3Sb1.5Bi0.5 formed by side-by-side Bi- and Ge-rich appendages, in contrast to separate nanoprecipitate formation. These Janus nanoprecipitates result from local co-melting of Bi and Ge during sintering, enabling an amorphous-like lattice thermal conductivity. A precipitate size effect on phonon scattering is observed due to the balance between alloy-disorder and nanoprecipitate scattering. The thermoelectric figure-of-merit ZT reaches 0.6 near room temperature and 1.6 at 773 K. The Janus nanoprecipitation can be introduced into other materials and may act as a general property-tailoring mechanism.

preprint2021arXiv

Anytime Sampling for Autoregressive Models via Ordered Autoencoding

Autoregressive models are widely used for tasks such as image and audio generation. The sampling process of these models, however, does not allow interruptions and cannot adapt to real-time computational resources. This challenge impedes the deployment of powerful autoregressive models, which involve a slow sampling process that is sequential in nature and typically scales linearly with respect to the data dimension. To address this difficulty, we propose a new family of autoregressive models that enables anytime sampling. Inspired by Principal Component Analysis, we learn a structured representation space where dimensions are ordered based on their importance with respect to reconstruction. Using an autoregressive model in this latent space, we trade off sample quality for computational efficiency by truncating the generation process before decoding into the original data space. Experimentally, we demonstrate in several image and audio generation tasks that sample quality degrades gracefully as we reduce the computational budget for sampling. The approach suffers almost no loss in sample quality (measured by FID) using only 60\% to 80\% of all latent dimensions for image data. Code is available at https://github.com/Newbeeer/Anytime-Auto-Regressive-Model .

preprint2021arXiv

Structuring a Comprehensive Software Security Course Around the OWASP Application Security Verification Standard

Lack of security expertise among software practitioners is a problem with many implications. First, there is a deficit of security professionals to meet current needs. Additionally, even practitioners who do not plan to work in security may benefit from increased understanding of security. The goal of this paper is to aid software engineering educators in designing a comprehensive software security course by sharing an experience running a software security course for the eleventh time. Through all the eleven years of running the software security course, the course objectives have been comprehensive - ranging from security testing, to secure design and coding, to security requirements to security risk management. For the first time in this eleventh year, a theme of the course assignments was to map vulnerability discovery to the security controls of the Open Web Application Security Project (OWASP) Application Security Verification Standard (ASVS). Based upon student performance on a final exploratory penetration testing project, this mapping may have increased students' depth of understanding of a wider range of security topics. The students efficiently detected 191 unique and verified vulnerabilities of 28 different Common Weakness Enumeration (CWE) types during a three-hour period in the OpenMRS project, an electronic health record application in active use.

preprint2020arXiv

Fair Generative Modeling via Weak Supervision

Real-world datasets are often biased with respect to key demographic factors such as race and gender. Due to the latent nature of the underlying factors, detecting and mitigating bias is especially challenging for unsupervised machine learning. We present a weakly supervised algorithm for overcoming dataset bias for deep generative models. Our approach requires access to an additional small, unlabeled reference dataset as the supervision signal, thus sidestepping the need for explicit labels on the underlying bias factors. Using this supplementary dataset, we detect the bias in existing datasets via a density ratio technique and learn generative models which efficiently achieve the twin goals of: 1) data efficiency by using training examples from both biased and reference datasets for learning; and 2) data generation close in distribution to the reference dataset at test time. Empirically, we demonstrate the efficacy of our approach which reduces bias w.r.t. latent factors by an average of up to 34.6% over baselines for comparable image generation using generative adversarial networks.

preprint2020arXiv

Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Many real-world sequential decision-making problems can be formulated as optimal control with high-dimensional observations and unknown dynamics. A promising approach is to embed the high-dimensional observations into a lower-dimensional latent representation space, estimate the latent dynamics model, then utilize this model for control in the latent space. An important open question is how to learn a representation that is amenable to existing control algorithms? In this paper, we focus on learning representations for locally-linear control algorithms, such as iterative LQR (iLQR). By formulating and analyzing the representation learning problem from an optimal control perspective, we establish three underlying principles that the learned representation should comprise: 1) accurate prediction in the observation space, 2) consistency between latent and observation space dynamics, and 3) low curvature in the latent space transitions. These principles naturally correspond to a loss function that consists of three terms: prediction, consistency, and curvature (PCC). Crucially, to make PCC tractable, we derive an amortized variational bound for the PCC loss function. Extensive experiments on benchmark domains demonstrate that the new variational-PCC learning algorithm benefits from significantly more stable and reproducible training, and leads to superior control performance. Further ablation studies give support to the importance of all three PCC components for learning a good latent space for control.

preprint2020arXiv

Predictive Coding for Locally-Linear Control

High-dimensional observations and unknown dynamics are major challenges when applying optimal control to many real-world decision making tasks. The Learning Controllable Embedding (LCE) framework addresses these challenges by embedding the observations into a lower dimensional latent space, estimating the latent dynamics, and then performing control directly in the latent space. To ensure the learned latent dynamics are predictive of next-observations, all existing LCE approaches decode back into the observation space and explicitly perform next-observation prediction---a challenging high-dimensional task that furthermore introduces a large number of nuisance parameters (i.e., the decoder) which are discarded during control. In this paper, we propose a novel information-theoretic LCE approach and show theoretically that explicit next-observation prediction can be replaced with predictive coding. We then use predictive coding to develop a decoder-free LCE model whose latent dynamics are amenable to locally-linear control. Extensive experiments on benchmark tasks show that our model reliably learns a controllable latent space that leads to superior performance when compared with state-of-the-art LCE baselines.

preprint2020arXiv

Sequential Model Optimization for Software Process Control

Many methods have been proposed to estimate how much effort is required to build and maintain software. Much of that research assumes a ``classic'' waterfall-based approach rather than contemporary projects (where the developing process may be more iterative than linear in nature). Also, much of that work tries to recommend a single method-- an approach that makes the dubious assumption that one method can handle the diversity of software project data. To address these drawbacks, we apply a configuration technique called ``ROME'' (Rapid Optimizing Methods for Estimation), which uses sequential model-based optimization (SMO) to find what combination of effort estimation techniques works best for a particular data set. We test this method using data from 1161 classic waterfall projects and 120 contemporary projects (from Github). In terms of magnitude of relative error and standardized accuracy, we find that ROME achieves better performance than existing state-of-the-art methods for both classic and contemporary problems. In addition, we conclude that we should not recommend one method for estimation. Rather, it is better to search through a wide range of different methods to find what works best for local data. To the best of our knowledge, this is the largest effort estimation experiment yet attempted and the only one to test its methods on classic and contemporary projects.

preprint2020arXiv

Weakly Supervised Disentanglement with Guarantees

Learning disentangled representations that correspond to factors of variation in real-world data is critical to interpretable and human-controllable machine learning. Recently, concerns about the viability of learning disentangled representations in a purely unsupervised manner has spurred a shift toward the incorporation of weak supervision. However, there is currently no formalism that identifies when and how weak supervision will guarantee disentanglement. To address this issue, we provide a theoretical framework to assist in analyzing the disentanglement guarantees (or lack thereof) conferred by weak supervision when coupled with learning algorithms based on distribution matching. We empirically verify the guarantees and limitations of several weak supervision methods (restricted labeling, match-pairing, and rank-pairing), demonstrating the predictive power and usefulness of our theoretical framework.

preprint2014arXiv

Automated Attribution and Intertextual Analysis

In this work, we employ quantitative methods from the realm of statistics and machine learning to develop novel methodologies for author attribution and textual analysis. In particular, we develop techniques and software suitable for applications to Classical study, and we illustrate the efficacy of our approach in several interesting open questions in the field. We apply our numerical analysis techniques to questions of authorship attribution in the case of the Greek tragedian Euripides, to instances of intertextuality and influence in the poetry of the Roman statesman Seneca the Younger, and to cases of "interpolated" text with respect to the histories of Livy.

Rui Shu

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

Do I really need all this work to find vulnerabilities? An empirical case study comparing vulnerability detection techniques on a Java application

Predicting Health Indicators for Open Source Projects (using Hyperparameter Optimization)

Solid-state Janus nanoprecipitation enables amorphous-like heat conduction in crystalline Mg3Sb2-based thermoelectric materials

Anytime Sampling for Autoregressive Models via Ordered Autoencoding

Structuring a Comprehensive Software Security Course Around the OWASP Application Security Verification Standard

Fair Generative Modeling via Weak Supervision

Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Predictive Coding for Locally-Linear Control

Sequential Model Optimization for Software Process Control

Weakly Supervised Disentanglement with Guarantees

Automated Attribution and Intertextual Analysis