Source author record

Aftab Hussain

Aftab Hussain appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Machine Learning Computation and Language Programming Languages

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Memorization and Generalization in Neural Code Intelligence Models

Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset and use various metrics to quantify the impact of noise on various aspects of training and testing. We evaluate several state-of-the-art neural code intelligence models and benchmarks based on Java, Python, and Ruby codebases. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. We observed all models manifest some forms of memorization. This can be potentially troublesome in most code intelligence tasks where they rely on rather noise-prone and repetitive data sources, such as code from GitHub. To the best of our knowledge, we provide the first study to quantify memorization effects in the domain of software engineering and code intelligence systems. This work raises awareness and provides new insights into important issues of training neural models in code intelligence systems that are usually overlooked by software engineering researchers.

preprint2022arXiv

Readle: A Formal Framework for Designing AI-based Edge Systems

With the wide spread use of AI-driven systems in the edge (a.k.a edge intelligence systems), such as autonomous driving vehicles, wearable biotech devices, intelligent manufacturing, etc., such systems are becoming very critical for our day-to-day lives. A challenge in designing edge intelligence systems is that we have to deal with a large number of constraints in two design spaces that form the basis of such systems: the edge design space and the deep learning design space. Thus in this work, a new systematic, extendable, manual approach, READLE, is proposed for creating representations of specifications in edge intelligent systems, capturing constraints in the edge system design space (e.g. timing constraints and other performance constraints) and constraints in the deep learning space (e.g. model training duration, required level of accuracy) in a coherent fashion. In particular, READLE leverages benefits of real-time logic and binary decision diagrams to generate unified specifications. Several insights learned in building READLE are also discussed, which should help in future research in the domain of formal specifications for edge intelligent systems.

preprint2022arXiv

Testing the Robustness of a BiLSTM-based Structural Story Classifier

The growing prevalence of counterfeit stories on the internet has fostered significant interest towards fast and scalable detection of fake news in the machine learning community. While several machine learning techniques for this purpose have emerged, we observe that there is a need to evaluate the impact of noise on these techniques' performance, where noise constitutes news articles being mistakenly labeled as fake (or real). This work takes a step in that direction, where we examine the impact of noise on a state-of-the-art, structural model based on BiLSTM (Bidirectional Long-Short Term Model) for fake news detection, Hierarchical Discourse-level Structure for Fake News Detection by Karimi and Tang (Reference no. 9).

preprint2016arXiv

From Query to Usable Code: An Analysis of Stack Overflow Code Snippets

Enriched by natural language texts, Stack Overflow code snippets are an invaluable code-centric knowledge base of small units of source code. Besides being useful for software developers, these annotated snippets can potentially serve as the basis for automated tools that provide working code solutions to specific natural language queries. With the goal of developing automated tools with the Stack Overflow snippets and surrounding text, this paper investigates the following questions: (1) How usable are the Stack Overflow code snippets? and (2) When using text search engines for matching on the natural language questions and answers around the snippets, what percentage of the top results contain usable code snippets? A total of 3M code snippets are analyzed across four languages: C\#, Java, JavaScript, and Python. Python and JavaScript proved to be the languages for which the most code snippets are usable. Conversely, Java and C\# proved to be the languages with the lowest usability rate. Further qualitative analysis on usable Python snippets shows the characteristics of the answers that solve the original question. Finally, we use Google search to investigate the alignment of usability and the natural language annotations around code snippets, and explore how to make snippets in Stack Overflow an adequate base for future automatic program generation.

Aftab Hussain

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Memorization and Generalization in Neural Code Intelligence Models

Readle: A Formal Framework for Designing AI-based Edge Systems

Testing the Robustness of a BiLSTM-based Structural Story Classifier

From Query to Usable Code: An Analysis of Stack Overflow Code Snippets