Researcher profile

Claudio Zito

Claudio Zito contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.

preprint2022arXiv

Direct Mutation and Crossover in Genetic Algorithms Applied to Reinforcement Learning Tasks

Neuroevolution has recently been shown to be quite competitive in reinforcement learning (RL) settings, and is able to alleviate some of the drawbacks of gradient-based approaches. This paper will focus on applying neuroevolution using a simple genetic algorithm (GA) to find the weights of a neural network that produce optimally behaving agents. In addition, we present two novel modifications that improve the data efficiency and speed of convergence when compared to the initial implementation. The modifications are evaluated on the FrozenLake environment provided by OpenAI gym and prove to be significantly better than the baseline approach.

preprint2022arXiv

Dyna-T: Dyna-Q and Upper Confidence Bounds Applied to Trees

In this work we present a preliminary investigation of a novel algorithm called Dyna-T. In reinforcement learning (RL) a planning agent has its own representation of the environment as a model. To discover an optimal policy to interact with the environment, the agent collects experience in a trial and error fashion. Experience can be used for learning a better model or improve directly the value function and policy. Typically separated, Dyna-Q is an hybrid approach which, at each iteration, exploits the real experience to update the model as well as the value function, while planning its action using simulated data from its model. However, the planning process is computationally expensive and strongly depends on the dimensionality of the state-action space. We propose to build a Upper Confidence Tree (UCT) on the simulated experience and search for the best action to be selected during the on-line learning process. We prove the effectiveness of our proposed method on a set of preliminary tests on three testbed environments from Open AI. In contrast to Dyna-Q, Dyna-T outperforms state-of-the-art RL agents in the stochastic environments by choosing a more robust action selection strategy.

preprint2022arXiv

Grasping and Manipulation with a Multi-Fingered Hand

This thesis is concerned with deriving planning algorithms for robot manipulators. Manipulation has two effects, the robot has a physical effect on the object, and it also acquires information about the object. This thesis presents algorithms that treat both problems. First, I present an extension of the well-known piano mover's problem where a robot pushing an object must plan its movements as well as those of the object. This requires simultaneous planning in the joint space of the robot and the configuration space of the object, in contrast to the original problem which only requires planning in the latter space. The effects of a robot action on the object configuration are determined by the non-invertible rigid body mechanics. Second, I consider planning under uncertainty and in particular planning for information effects. I consider the case where a robot has to reach and grasp an object under pose uncertainty caused by shape incompleteness. The approach presented in this report is to study and possibly extend a new approach to artificial intelligence (A.I.) which has emerged in the last years in response to the necessity of building intelligent controllers for agents operating in unstructured stochastic environments. Such agents require the ability to learn by interaction with its environment an optimal action-selection behaviour. The main issue is that real-world problems are usually dynamic and unpredictable. Thus, the agent needs to update constantly its current image of the world using its sensors, which provide only a noisy description of the surrounding environment. Although there are different schools of thinking, with their own set of techniques, a brand new direction which unifies many A.I. researches is to formalise such agent/environment interactions as embedded systems with stochastic dynamics.

preprint2022arXiv

Multi-Hypothesis Scan Matching through Clustering

Graph-SLAM is a well-established algorithm for constructing a topological map of the environment while simultaneously attempting the localisation of the robot. It relies on scan matching algorithms to align noisy observations along robot's movements to compute an estimate of the current robot's location. We propose a fundamentally different approach to scan matching tasks to improve the estimation of roto-translation displacements and therefore the performance of the full SLAM algorithm. A Monte-Carlo approach is used to generate weighted hypotheses of the geometrical displacement between two scans, and then we cluster these hypotheses to compute the displacement that results in the best alignment. To cope with clusterization on roto-translations, we propose a novel clustering approach that robustly extends Gaussian Mean-Shift to orientations by factorizing the kernel density over the roto-translation components. We demonstrate the effectiveness of our method in an extensive set of experiments using both synthetic data and the Intel Research Lab's benchmarking datasets. The results confirms that our approach has superior performance in terms of matching accuracy and runtime computation than the state-of-the-art iterative point-based scan matching algorithms.

preprint2022arXiv

Underwater Object Classification and Detection: first results and open challenges

This work reviews the problem of object detection in underwater environments. We analyse and quantify the shortcomings of conventional state-of-the-art (SOTA) algorithms in the computer vision community when applied to this challenging environment, as well as providing insights and general guidelines for future research efforts. First, we assessed if pretraining with the conventional ImageNet is beneficial when the object detector needs to be applied to environments that may be characterised by a different feature distribution. We then investigate whether two-stage detectors yields to better performance with respect to single-stage detectors, in terms of accuracy, intersection of union (IoU), floating operation per second (FLOPS), and inference time. Finally, we assessed the generalisation capability of each model to a lower quality dataset to simulate performance on a real scenario, in which harsher conditions ought to be expected. Our experimental results provide evidence that underwater object detection requires searching for "ad-hoc" architectures than merely training SOTA architectures on new data, and that pretraining is not beneficial.

preprint2020arXiv

Statistical Context-Dependent Units Boundary Correction for Corpus-based Unit-Selection Text-to-Speech

In this study, we present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems. Unlike conventional techniques for speaker adaptation, which attempt to improve the accuracy of the segmentation using acoustic models that are more robust in the face of the speaker's characteristics, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques. In simple terms, we use the intuitive idea that context dependent information is tightly correlated with the related acoustic waveform. We propose a statistical model, which predicts correcting values to reduce the systematic error produced by a state-of-the-art Hidden Markov Model (HMM) based speech segmentation. Our approach consists of two phases: (1) identifying context-dependent phonetic unit classes (for instance, the class which identifies vowels as being the nucleus of monosyllabic words); and (2) building a regression model that associates the mean error value made by the ASR during the segmentation of a single speaker corpus to each class. The success of the approach is evaluated by comparing the corrected boundaries of units and the state-of-the-art HHM segmentation against a reference alignment, which is supposed to be the optimal solution. In conclusion, our work supplies a first analysis of a model sensitive to speaker-dependent characteristics, robust to defective and noisy information, and a very simple implementation which could be utilized as an alternative to either more expensive speaker-adaptation systems or of numerous manual correction sessions.

preprint2019arXiv

Let's Push Things Forward: A Survey on Robot Pushing

As robot make their way out of factories into human environments, outer space, and beyond, they require the skill to manipulate their environment in multifarious, unforeseeable circumstances. With this regard, pushing is an essential motion primitive that dramatically extends a robot's manipulation repertoire. In this work, we review the robotic pushing literature. While focusing on work concerned with predicting the motion of pushed objects, we also cover relevant applications of pushing for planning and control. Beginning with analytical approaches, under which we also subsume physics engines, we then proceed to discuss work on learning models from data. In doing so, we dedicate a separate section to deep learning approaches which have seen a recent upsurge in the literature. Concluding remarks and further research perspectives are given at the end of the paper.