Source author record

Francesco Trovò

Francesco Trovò appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computer Science and Game Theory Artificial Intelligence

Catalog footprint

What is connected

8works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

ARLO: A Framework for Automated Reinforcement Learning

Automated Reinforcement Learning (AutoRL) is a relatively new area of research that is gaining increasing attention. The objective of AutoRL consists in easing the employment of Reinforcement Learning (RL) techniques for the broader public by alleviating some of its main challenges, including data collection, algorithm selection, and hyper-parameter tuning. In this work, we propose a general and flexible framework, namely ARLO: Automated Reinforcement Learning Optimizer, to construct automated pipelines for AutoRL. Based on this, we propose a pipeline for offline and one for online RL, discussing the components, interaction, and highlighting the difference between the two settings. Furthermore, we provide a Python implementation of such pipelines, released as an open-source library. Our implementation has been tested on an illustrative LQG domain and on classic MuJoCo environments, showing the ability to reach competitive performances requiring limited human intervention. We also showcase the full pipeline on a realistic dam environment, automatically performing the feature selection and the model generation tasks.

preprint2022arXiv

Last-iterate Convergence to Trembling-hand Perfect Equilibria

Designing efficient algorithms to find Nash equilibrium (NE) refinements in sequential games is of paramount importance in practice. Indeed, it is well known that the NE has several weaknesses, since it may prescribe to play sub-optimal actions in those parts of the game that are never reached at the equilibrium. NE refinements, such as the extensive-form perfect equilibrium (EFPE), amend such weaknesses by accounting for the possibility of players' mistakes. This is crucial in real-world applications, where bounded rationality players are usually involved, and it turns out being useful also in boosting the performances of superhuman agents for recreational games like Poker. Nevertheless, only few works addressed the problem of computing NE refinements. Most of them propose algorithms finding exact NE refinements by means of linear programming, and, thus, these do not have the potential of scaling up to real-world-size games. On the other hand, existing iterative algorithms that exploit the tree structure of sequential games only provide convergence guarantees to approximate refinements. In this paper, we provide the first efficient last-iterate algorithm that provably converges to an EFPE in two-player zero-sum sequential games with imperfect information. Our algorithm works by tracking a sequence of equilibria of suitably-defined, regularized-perturbed games. In order to do that, it uses a procedure that is tailored to converge last-iterate to the equilibria of such games. Crucially, the updates performed by such a procedure can be performed efficiently by visiting the game tree, thus making our algorithm potentially more scalable than its linear-programming-based competitors. Finally, we evaluate our algorithm on a standard testbed of games, showing that it produces strategies which are much more robust to players' mistakes than those of state-of-the-art NE-computation algorithms.

preprint2022arXiv

Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

There is a rising interest in industrial online applications where data becomes available sequentially. Inspired by the recommendation of playlists to users where their preferences can be collected during the listening of the entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward associated with the pull of an arm is partitioned over a finite number of consecutive rounds following the pull. This setting, unexplored so far to the best of our knowledge, is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull instead of being fully disclosed in a single, potentially delayed round. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW, which exploit the partial information disclosed by the reward collected over time. We show that our algorithms provide better asymptotical regret upper bounds than delayed-feedback bandit algorithms when a property characterizing a broad set of reward structures of practical interest, namely alpha-smoothness, holds. We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.

preprint2020arXiv

Learning Probably Approximately Correct Maximin Strategies in Simulation-Based Games with Infinite Strategy Spaces

We tackle the problem of learning equilibria in simulation-based games. In such games, the players' utility functions cannot be described analytically, as they are given through a black-box simulator that can be queried to obtain noisy estimates of the utilities. This is the case in many real-world games in which a complete description of the elements involved is not available upfront, such as complex military settings and online auctions. In these situations, one usually needs to run costly simulation processes to get an accurate estimate of the game outcome. As a result, solving these games begets the challenge of designing learning algorithms that can find (approximate) equilibria with high confidence, using as few simulator queries as possible. Moreover, since running the simulator during the game is unfeasible, the algorithms must first perform a pure exploration learning phase and, then, use the (approximate) equilibrium learned this way to play the game. In this work, we focus on two-player zero-sum games with infinite strategy spaces. Drawing from the best arm identification literature, we design two algorithms with theoretical guarantees to learn maximin strategies in these games. The first one works in the fixed-confidence setting, guaranteeing the desired confidence level while minimizing the number of queries. Instead, the second algorithm fits the fixed-budget setting, maximizing the confidence without exceeding the given maximum number of queries. First, we formally prove δ-PAC theoretical guarantees for our algorithms under some regularity assumptions, which are encoded by letting the utility functions be drawn from a Gaussian process. Then, we experimentally evaluate our techniques on a testbed made of randomly generated games and instances representing simple real-world security settings.

preprint2020arXiv

Online Joint Bid/Daily Budget Optimization of Internet Advertising Campaigns

Pay-per-click advertising includes various formats (\emph{e.g.}, search, contextual, social) with a total investment of more than 200 billion USD per year worldwide. An advertiser is given a daily budget to allocate over several, even thousands, campaigns, mainly distinguishing for the ad, target, or channel. Furthermore, publishers choose the ads to display and how to allocate them employing auctioning mechanisms, in which every day the advertisers set for each campaign a bid corresponding to the maximum amount of money per click they are willing to pay and the fraction of the daily budget to invest. In this paper, we study the problem of automating the online joint bid/daily budget optimization of pay-per-click advertising campaigns over multiple channels. We formulate our problem as a combinatorial semi-bandit problem, which requires solving a special case of the Multiple-Choice Knapsack problem every day. Furthermore, for every campaign, we capture the dependency of the number of clicks on the bid and daily budget by Gaussian Processes, thus requiring mild assumptions on the regularity of these functions. We design four algorithms and show that they suffer from a regret that is upper bounded with high probability as O(sqrt{T}), where T is the time horizon of the learning process. We experimentally evaluate our algorithms with synthetic settings generated from real data from Yahoo!, and we present the results of the adoption of our algorithms in a real-world application with a daily average spent of 1,000 Euros for more than one year.

preprint2016arXiv

Machine Learning Techniques for Stackelberg Security Games: a Survey

The present survey aims at presenting the current machine learning techniques employed in security games domains. Specifically, we focused on papers and works developed by the Teamcore of University of Southern California, which deepened different directions in this field. After a brief introduction on Stackelberg Security Games (SSGs) and the poaching setting, the rest of the work presents how to model a boundedly rational attacker taking into account her human behavior, then describes how to face the problem of having attacker's payoffs not defined and how to estimate them and, finally, presents how online learning techniques have been exploited to learn a model of the attacker.

preprint2016arXiv

Unimodal Thompson Sampling for Graph-Structured Arms

We study, to the best of our knowledge, the first Bayesian algorithm for unimodal Multi-Armed Bandit (MAB) problems with graph structure. In this setting, each arm corresponds to a node of a graph and each edge provides a relationship, unknown to the learner, between two nodes in terms of expected reward. Furthermore, for any node of the graph there is a path leading to the unique node providing the maximum expected reward, along which the expected reward is monotonically increasing. Previous results on this setting describe the behavior of frequentist MAB algorithms. In our paper, we design a Thompson Sampling-based algorithm whose asymptotic pseudo-regret matches the lower bound for the considered setting. We show that -as it happens in a wide number of scenarios- Bayesian MAB algorithms dramatically outperform frequentist ones. In particular, we provide a thorough experimental evaluation of the performance of our and state-of-the-art algorithms as the properties of the graph vary.

preprint2014arXiv

Truthful Learning Mechanisms for Multi-Slot Sponsored Search Auctions with Externalities

Sponsored search auctions constitute one of the most successful applications of microeconomic mechanisms. In mechanism design, auctions are usually designed to incentivize advertisers to bid their truthful valuations and to assure both the advertisers and the auctioneer a non-negative utility. Nonetheless, in sponsored search auctions, the click-through-rates (CTRs) of the advertisers are often unknown to the auctioneer and thus standard truthful mechanisms cannot be directly applied and must be paired with an effective learning algorithm for the estimation of the CTRs. This introduces the critical problem of designing a learning mechanism able to estimate the CTRs at the same time as implementing a truthful mechanism with a revenue loss as small as possible compared to an optimal mechanism designed with the true CTRs. Previous work showed that, when dominant-strategy truthfulness is adopted, in single-slot auctions the problem can be solved using suitable exploration-exploitation mechanisms able to achieve a per-step regret (over the auctioneer's revenue) of order $O(T^{-1/3})$ (where T is the number of times the auction is repeated). It is also known that, when truthfulness in expectation is adopted, a per-step regret (over the social welfare) of order $O(T^{-1/2})$ can be obtained. In this paper we extend the results known in the literature to the case of multi-slot auctions. In this case, a model of the user is needed to characterize how the advertisers' valuations change over the slots. We adopt the cascade model that is the most famous model in the literature for sponsored search auctions. We prove a number of novel upper bounds and lower bounds both on the auctioneer's revenue loss and social welfare w.r.t. to the VCG auction and we report numerical simulations investigating the accuracy of the bounds in predicting the dependency of the regret on the auction parameters.

Francesco Trovò

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

ARLO: A Framework for Automated Reinforcement Learning

Last-iterate Convergence to Trembling-hand Perfect Equilibria

Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

Learning Probably Approximately Correct Maximin Strategies in Simulation-Based Games with Infinite Strategy Spaces

Online Joint Bid/Daily Budget Optimization of Internet Advertising Campaigns

Machine Learning Techniques for Stackelberg Security Games: a Survey

Unimodal Thompson Sampling for Graph-Structured Arms

Truthful Learning Mechanisms for Multi-Slot Sponsored Search Auctions with Externalities