Researcher profile

Gustavo A. Oliva

Gustavo A. Oliva contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2026arXiv

Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study

Large Language Models (LLMs) such as GPT-4, Claude and LLaMA have shown impressive performance in code generation, typically evaluated using benchmarks (e.g., HumanEval). However, effective code generation requires models to understand and apply a wide range of language concepts. If the concepts exercised in benchmarks are not representative of those used in real-world projects, evaluations may yield incomplete. Despite this concern, the representativeness of code concepts in benchmarks has not been systematically examined. To address this gap, we present the first empirical study that analyzes the representativeness of code generation benchmarks through the lens of Knowledge Units (KUs) - cohesive sets of programming language capabilities provided by language constructs and APIs. We analyze KU coverage in two widely used Python benchmarks, HumanEval and MBPP, and compare them with 30 real-world Python projects. Our results show that each benchmark covers only half of the identified 20 KUs, whereas projects exercise all KUs with relatively balanced distributions. In contrast, benchmark tasks exhibit highly skewed KU distributions. To mitigate this misalignment, we propose a prompt-based LLM framework that synthesizes KU-based tasks to rebalance benchmark KU distributions and better align them with real-world usage. Using this framework, we generate 440 new tasks and augment existing benchmarks. The augmented benchmarks substantially improve KU coverage and achieve over a 60% improvement in distributional alignment. Evaluations of state-of-the-art LLMs on these augmented benchmarks reveal consistent and statistically significant performance drops (12.54-44.82%), indicating that existing benchmarks overestimate LLM performance due to their limited KU coverage. Our findings provide actionable guidance for building more realistic evaluations of LLM code-generation capabilities.

preprint2026arXiv

Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap

The rise of AI-assisted software engineering (SE 2.0), powered by Foundation Models (FMs) and FM-powered coding assistants, has shown promise in improving developer productivity. However, it has also exposed inherent limitations, such as cognitive overload on developers and inefficiencies. We propose a shift towards Software Engineering 3.0 (SE 3.0), an AI-native approach characterized by intent-centric, conversation-oriented development between human developers and AI teammates. SE 3.0 envisions AI systems evolving beyond task-driven copilots into intelligent collaborators, capable of deeply understanding and reasoning about software engineering principles and intents. We outline the key components of the SE 3.0 technology stack, which includes Teammate.next for adaptive and personalized AI partnership, IDE.next for intent-centric conversation-oriented development, Compiler.next for multi-objective code synthesis, and Runtime.next for SLA-aware execution with edge-computing support. Our vision addresses the inefficiencies and cognitive strain of SE 2.0 by fostering a symbiotic relationship between human developers and AI, maximizing their complementary strengths. We also present a roadmap of challenges that must be overcome to realize our vision of SE 3.0. This paper lays the foundation for future discussions on the role of AI in the next era of software engineering.

preprint2022arXiv

Is my transaction done yet? An empirical study of transaction processing times in the Ethereum Blockchain Platform

Ethereum is one of the most popular platforms for the development of blockchain-powered applications. These applications are known as Dapps. When engineering Dapps, developers need to translate requests captured in the front-end of their application into one or more smart contract transactions. Developers need to pay for these transactions and, the more they pay (i.e., the higher the gas price), the faster the transaction is likely to be processed. Therefore developers need to optimize the balance between cost (transaction fees) and user experience (transaction processing times). Online services have been developed to provide transaction issuers (e.g., Dapp developers) with an estimate of how long transactions will take to be processed given a certain gas price. These estimation services are crucial in the Ethereum domain and several popular wallets such as Metamask rely on them. However, their accuracy has not been empirically investigated so far. In this paper, we quantify the transaction processing times in Ethereum, investigate the relationship between processing times and gas prices, and determine the accuracy of state-of-the-practice estimation services. We find that transactions are processed in a median of 57s and that 90% of the transactions are processed within 8m. The higher gas prices result in faster transaction processing times with diminishing returns. In particular, we observe no practical difference in processing time between expensive and very expensive transactions. In terms of accuracy of processing time estimation services, we note that they are equivalent. However, when stratifying transactions by gas prices, Etherscan's Gas Tracker is the most accurate estimation service for very cheap and cheap transaction. EthGasStation's Gas Price API, in turn, is the most accurate estimation service for regular, expensive, and very expensive transactions.

preprint2022arXiv

What makes Ethereum blockchain transactions be processed fast or slow? An empirical study

The Ethereum platform allows developers to implement and deploy applications called Dapps onto the blockchain for public use through the use of smart contracts. To execute code within a smart contract, a paid transaction must be issued towards one of the functions that are exposed in the interface of a contract. However, such a transaction is only processed once one of the miners in the peer-to-peer network selects it, adds it to a block, and appends that block to the blockchain This creates a delay between transaction submission and code execution. It is crucial for Dapp developers to be able to precisely estimate when transactions will be processed, since this allows them to define and provide a certain Quality of Service (QoS) level (e.g., 95% of the transactions processed within 1 minute). However, the impact that different factors have on these times have not yet been studied. Processing time estimation services are used by Dapp developers to achieve predefined QoS. Yet, these services offer minimal insights into what factors impact processing times. Considering the vast amount of data that surrounds the Ethereum blockchain, changes in processing times are hard for Dapp developers to predict, making it difficult to maintain said QoS. In our study, we build random forest models to understand the factors that are associated with transaction processing times. We engineer several features that capture blockchain internal factors, as well as gas pricing behaviors of transaction issuers. By interpreting our models, we conclude that features surrounding gas pricing behaviors are very strongly associated with transaction processing times. Based on our empirical results, we provide Dapp developers with concrete insights that can help them provide and maintain high levels of QoS.