Source author record

Houari Sahraoui

Houari Sahraoui appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Machine Learning Artificial Intelligence Computation and Language Programming Languages

Catalog footprint

What is connected

9works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of Code

Pre-trained language models (PLMs) have become a prevalent technique in deep learning for code, utilizing a two-stage pre-training and fine-tuning procedure to acquire general knowledge about code and specialize in a variety of downstream tasks. However, the dynamic nature of software codebases poses a challenge to the effectiveness and robustness of PLMs. In particular, world-realistic scenarios potentially lead to significant differences between the distribution of the pre-training and test data, i.e., distribution shift, resulting in a degradation of the PLM's performance on downstream tasks. In this paper, we stress the need for adapting PLMs of code to software data whose distribution changes over time, a crucial problem that has been overlooked in previous works. The motivation of this work is to consider the PLM in a non-stationary environment, where fine-tuning data evolves over time according to a software evolution scenario. Specifically, we design a scenario where the model needs to learn from a stream of programs containing new, unseen APIs over time. We study two widely used PLM architectures, i.e., a GPT2 decoder and a RoBERTa encoder, on two downstream tasks, API call and API usage prediction. We demonstrate that the most commonly used fine-tuning technique from prior work is not robust enough to handle the dynamic nature of APIs, leading to the loss of previously acquired knowledge i.e., catastrophic forgetting. To address these issues, we implement five continual learning approaches, including replay-based and regularization-based methods. Our findings demonstrate that utilizing these straightforward methods effectively mitigates catastrophic forgetting in PLMs across both downstream tasks while achieving comparable or superior performance.

preprint2022arXiv

AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models

The objective of pre-trained language models is to learn contextual representations of textual data. Pre-trained language models have become mainstream in natural language processing and code modeling. Using probes, a technique to study the linguistic properties of hidden vector spaces, previous works have shown that these pre-trained language models encode simple linguistic properties in their hidden representations. However, none of the previous work assessed whether these models encode the whole grammatical structure of a programming language. In this paper, we prove the existence of a syntactic subspace, lying in the hidden representations of pre-trained language models, which contain the syntactic information of the programming language. We show that this subspace can be extracted from the models' representations and define a novel probing method, the AST-Probe, that enables recovering the whole abstract syntax tree (AST) of an input code snippet. In our experimentations, we show that this syntactic subspace exists in five state-of-the-art pre-trained language models. In addition, we highlight that the middle layers of the models are the ones that encode most of the AST information. Finally, we estimate the optimal size of this syntactic subspace and show that its dimension is substantially lower than those of the models' representation spaces. This suggests that pre-trained language models use a small part of their representation spaces to encode syntactic information of the programming languages.

preprint2022arXiv

Code Sophistication: From Code Recommendation to Logic Recommendation

A typical approach to programming is to first code the main execution scenario, and then focus on filling out alternative behaviors and corner cases. But, almost always, there exist unusual conditions that trigger atypical behaviors, which are hard to predict in program specifications, and are thus often not coded. In this paper, we consider the problem of detecting and recommending such missing behaviors, a task that we call code sophistication. Previous research on coding assistants usually focuses on recommending code fragments based on specifications of the intended behavior. In contrast, code sophistication happens in the absence of a specification, aiming to help developers complete the logic of their programs with missing and unspecified behaviors. We outline the research challenges to this problem and present early results showing how program logic can be completed by leveraging code structure and information about the usage of input parameters.

preprint2022arXiv

Recommending Metamodel Concepts during Modeling Activities with Pre-Trained Language Models

The design of conceptually sound metamodels that embody proper semantics in relation to the application domain is particularly tedious in Model-Driven Engineering. As metamodels define complex relationships between domain concepts, it is crucial for a modeler to define these concepts thoroughly while being consistent with respect to the application domain. We propose an approach to assist a modeler in the design of a metamodel by recommending relevant domain concepts in several modeling scenarios. Our approach does not require to extract knowledge from the domain or to hand-design completion rules. Instead, we design a fully data-driven approach using a deep learning model that is able to abstract domain concepts by learning from both structural and lexical metamodel properties in a corpus of thousands of independent metamodels. We evaluate our approach on a test set containing 166 metamodels, unseen during the model training, with more than 5000 test samples. Our preliminary results show that the trained model is able to provide accurate top-$5$ lists of relevant recommendations for concept renaming scenarios. Although promising, the results are less compelling for the scenario of the iterative construction of the metamodel, in part because of the conservative strategy we use to evaluate the recommendations.

preprint2022arXiv

Social Diversity for ATL Repair

Model transformations play an essential role in the Model-Driven Engineering paradigm. Writing a correct transformation program requires to be proficient with the source and target modeling languages, to have a clear understanding of the mapping between the elements of the two, as well as to master the transformation language to properly describe the transformation. Transformation programs are thus complex and error-prone, and finding and fixing errors in such programs typically involve a tedious and time-consuming effort by developers. In this paper, we propose a novel search-based approach to automatically repair transformation programs containing many semantic errors. To prevent the fitness plateaus and the single fitness peak limitations, we leverage the notion of social diversity to promote repair patches tackling errors that are less covered by the other patches of the population. We evaluate our approach on 71 semantically incorrect transformation programs written in ATL, and containing up to five semantic errors simultaneously. The evaluation shows that integrating social diversity when searching for repair patches allows to improve the quality of those patches and to speed up the convergence even when up to five semantic errors are involved.

preprint2016arXiv

Automated Inference of Software Library Usage Patterns

Modern software systems are increasingly dependent on third-party libraries. It is widely recognized that using mature and well-tested third-party libraries can improve developers' productivity, reduce time-to-market, and produce more reliable software. Today's open-source repositories provide a wide range of libraries that can be freely downloaded and used. However, as software libraries are documented separately but intended to be used together, developers are unlikely to fully take advantage of these reuse opportunities. In this paper, we present a novel approach to automatically identify third-party library usage patterns, i.e., collections of libraries that are commonly used together by developers. Our approach employs hierarchical clustering technique to group together software libraries based on external client usage. To evaluate our approach, we mined a large set of over 6,000 popular libraries from Maven Central Repository and investigated their usage by over 38,000 client systems from the Github repository. Our experiments show that our technique is able to detect the majority (77%) of highly consistent and cohesive library usage patterns across a considerable number of client systems.

preprint2016arXiv

Mining Software Components from Object-Oriented APIs

Object-oriented Application Programing Interfaces (APIs) support software reuse by providing pre-implemented functionalities. Due to the huge number of included classes, reusing and understanding large APIs is a complex task. Otherwise, software components are admitted to be more reusable and understandable entities than object-oriented ones. Thus, in this paper, we propose an approach for reengineering object-oriented APIs into component-based ones. We mine components as a group of classes based on the frequency they are used together and their ability to form a quality-centric component. To validate our approach, we experimented on 100 Java applications that used Android APIs.

preprint2016arXiv

Recovering Architectural Variability of a Family of Product Variants

A Software Product Line (SPL) aims at applying a pre-planned systematic reuse of large-grained software artifacts to increase the software productivity and reduce the development cost. The idea of SPL is to analyze the business domain of a family of products to identify the common and the variable parts between the products. However, it is common for companies to develop, in an ad-hoc manner (e.g. clone and own), a set of products that share common functionalities and differ in terms of others. Thus, many recent research contributions are proposed to re-engineer existing product variants to a SPL. Nevertheless, these contributions are mostly focused on managing the variability at the requirement level. Very few contributions address the variability at the architectural level despite its major importance. Starting from this observation, we propose, in this paper, an approach to reverse engineer the architecture of a set of product variants. Our goal is to identify the variability and dependencies among architectural-element variants at the architectural level. Our work relies on Formal Concept Analysis (FCA) to analyze the variability. To validate the proposed approach, we experimented on two families of open-source product variants; Mobile Media and Health Watcher. The results show that our approach is able to identify the architectural variability and the dependencies.

preprint2015arXiv

Modeling and Analyzing Release Trajectory based on the Process of Issue Tracking

Software release development process, that we refer to as "release trajectory", involves development activities that are usually sorted in different categories, such as incorporating new features, improving software, or fixing bugs, and associated to "issues". Release trajectory management is a difficult and crucial task. Managers must be aware of every aspect of the development process for managing the software-related issues. Issue Tracking Systems (ITS) play a central role in supporting the management of release trajectory. These systems, which support reporting and tracking issues of different kinds (such as "bug", "feature", "improvement", etc.), record rich data about the software development process. Yet, recorded historical data in ITS are still not well-modeled for supporting practical needs of release trajectory management. In this paper, we describe a sequence analysis approach for modeling and analyzing releases' trajectories, using the tracking process of reported issues. Release trajectory analysis is based on the categories of tracked issues and their temporal changing, and aims to address important questions regarding the co-habitation of unresolved issues, the transitions between different statuses in release trajectory, the recurrent patterns of release trajectories, and the properties of a release trajectory.

Houari Sahraoui

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of Code

AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models

Code Sophistication: From Code Recommendation to Logic Recommendation

Recommending Metamodel Concepts during Modeling Activities with Pre-Trained Language Models

Social Diversity for ATL Repair

Automated Inference of Software Library Usage Patterns

Mining Software Components from Object-Oriented APIs

Recovering Architectural Variability of a Family of Product Variants

Modeling and Analyzing Release Trajectory based on the Process of Issue Tracking