Researcher profile

Francesco Trovò

Francesco Trovò contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

ARLO: A Framework for Automated Reinforcement Learning

Automated Reinforcement Learning (AutoRL) is a relatively new area of research that is gaining increasing attention. The objective of AutoRL consists in easing the employment of Reinforcement Learning (RL) techniques for the broader public by alleviating some of its main challenges, including data collection, algorithm selection, and hyper-parameter tuning. In this work, we propose a general and flexible framework, namely ARLO: Automated Reinforcement Learning Optimizer, to construct automated pipelines for AutoRL. Based on this, we propose a pipeline for offline and one for online RL, discussing the components, interaction, and highlighting the difference between the two settings. Furthermore, we provide a Python implementation of such pipelines, released as an open-source library. Our implementation has been tested on an illustrative LQG domain and on classic MuJoCo environments, showing the ability to reach competitive performances requiring limited human intervention. We also showcase the full pipeline on a realistic dam environment, automatically performing the feature selection and the model generation tasks.

preprint2022arXiv

Last-iterate Convergence to Trembling-hand Perfect Equilibria

Designing efficient algorithms to find Nash equilibrium (NE) refinements in sequential games is of paramount importance in practice. Indeed, it is well known that the NE has several weaknesses, since it may prescribe to play sub-optimal actions in those parts of the game that are never reached at the equilibrium. NE refinements, such as the extensive-form perfect equilibrium (EFPE), amend such weaknesses by accounting for the possibility of players' mistakes. This is crucial in real-world applications, where bounded rationality players are usually involved, and it turns out being useful also in boosting the performances of superhuman agents for recreational games like Poker. Nevertheless, only few works addressed the problem of computing NE refinements. Most of them propose algorithms finding exact NE refinements by means of linear programming, and, thus, these do not have the potential of scaling up to real-world-size games. On the other hand, existing iterative algorithms that exploit the tree structure of sequential games only provide convergence guarantees to approximate refinements. In this paper, we provide the first efficient last-iterate algorithm that provably converges to an EFPE in two-player zero-sum sequential games with imperfect information. Our algorithm works by tracking a sequence of equilibria of suitably-defined, regularized-perturbed games. In order to do that, it uses a procedure that is tailored to converge last-iterate to the equilibria of such games. Crucially, the updates performed by such a procedure can be performed efficiently by visiting the game tree, thus making our algorithm potentially more scalable than its linear-programming-based competitors. Finally, we evaluate our algorithm on a standard testbed of games, showing that it produces strategies which are much more robust to players' mistakes than those of state-of-the-art NE-computation algorithms.

preprint2022arXiv

Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

There is a rising interest in industrial online applications where data becomes available sequentially. Inspired by the recommendation of playlists to users where their preferences can be collected during the listening of the entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward associated with the pull of an arm is partitioned over a finite number of consecutive rounds following the pull. This setting, unexplored so far to the best of our knowledge, is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull instead of being fully disclosed in a single, potentially delayed round. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW, which exploit the partial information disclosed by the reward collected over time. We show that our algorithms provide better asymptotical regret upper bounds than delayed-feedback bandit algorithms when a property characterizing a broad set of reward structures of practical interest, namely alpha-smoothness, holds. We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.

preprint2020arXiv

Learning Probably Approximately Correct Maximin Strategies in Simulation-Based Games with Infinite Strategy Spaces

We tackle the problem of learning equilibria in simulation-based games. In such games, the players' utility functions cannot be described analytically, as they are given through a black-box simulator that can be queried to obtain noisy estimates of the utilities. This is the case in many real-world games in which a complete description of the elements involved is not available upfront, such as complex military settings and online auctions. In these situations, one usually needs to run costly simulation processes to get an accurate estimate of the game outcome. As a result, solving these games begets the challenge of designing learning algorithms that can find (approximate) equilibria with high confidence, using as few simulator queries as possible. Moreover, since running the simulator during the game is unfeasible, the algorithms must first perform a pure exploration learning phase and, then, use the (approximate) equilibrium learned this way to play the game. In this work, we focus on two-player zero-sum games with infinite strategy spaces. Drawing from the best arm identification literature, we design two algorithms with theoretical guarantees to learn maximin strategies in these games. The first one works in the fixed-confidence setting, guaranteeing the desired confidence level while minimizing the number of queries. Instead, the second algorithm fits the fixed-budget setting, maximizing the confidence without exceeding the given maximum number of queries. First, we formally prove δ-PAC theoretical guarantees for our algorithms under some regularity assumptions, which are encoded by letting the utility functions be drawn from a Gaussian process. Then, we experimentally evaluate our techniques on a testbed made of randomly generated games and instances representing simple real-world security settings.

preprint2020arXiv

Online Joint Bid/Daily Budget Optimization of Internet Advertising Campaigns

Pay-per-click advertising includes various formats (\emph{e.g.}, search, contextual, social) with a total investment of more than 200 billion USD per year worldwide. An advertiser is given a daily budget to allocate over several, even thousands, campaigns, mainly distinguishing for the ad, target, or channel. Furthermore, publishers choose the ads to display and how to allocate them employing auctioning mechanisms, in which every day the advertisers set for each campaign a bid corresponding to the maximum amount of money per click they are willing to pay and the fraction of the daily budget to invest. In this paper, we study the problem of automating the online joint bid/daily budget optimization of pay-per-click advertising campaigns over multiple channels. We formulate our problem as a combinatorial semi-bandit problem, which requires solving a special case of the Multiple-Choice Knapsack problem every day. Furthermore, for every campaign, we capture the dependency of the number of clicks on the bid and daily budget by Gaussian Processes, thus requiring mild assumptions on the regularity of these functions. We design four algorithms and show that they suffer from a regret that is upper bounded with high probability as O(sqrt{T}), where T is the time horizon of the learning process. We experimentally evaluate our algorithms with synthetic settings generated from real data from Yahoo!, and we present the results of the adoption of our algorithms in a real-world application with a daily average spent of 1,000 Euros for more than one year.