Researcher profile

Matthieu Zimmer

Matthieu Zimmer contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

A recurring pattern in "reasoning without training" is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. Power sampling provides a principled way to bias decoding toward such modes by targeting p_theta(x)^alpha with alpha > 1, but practical approximations must account for future-dependent correction factors that determine which prefixes remain promising. We introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for approximating the sequence-level power target with a bounded population of partial solutions. APPS propagates hypotheses in parallel using proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries. This redistributes finite compute across competing prefixes rather than committing to a single unfolding path, while providing a direct scaling knob in the particle count and predictable peak memory. We instantiate the future-value signal with short-horizon rollouts and also study an amortized variant that replaces rollouts with a lightweight learned selection head. Across reasoning benchmarks, APPS improves the accuracy-runtime trade-off of training-free decoding and suggests that part of the gap to post-trained systems can be recovered through more faithful inference-time power approximation.

preprint2022arXiv

A Survey on Interpretable Reinforcement Learning

Although deep reinforcement learning has become a promising machine learning approach for sequential decision-making problems, it is still not mature enough for high-stake domains such as autonomous driving or medical applications. In such contexts, a learned policy needs for instance to be interpretable, so that it can be inspected before any deployment (e.g., for safety and verifiability reasons). This survey provides an overview of various approaches to achieve higher interpretability in reinforcement learning (RL). To that aim, we distinguish interpretability (as a property of a model) and explainability (as a post-hoc operation, with the intervention of a proxy) and discuss them in the context of RL with an emphasis on the former notion. In particular, we argue that interpretable RL may embrace different facets: interpretable inputs, interpretable (transition/reward) models, and interpretable decision-making. Based on this scheme, we summarize and analyze recent work related to interpretable RL with an emphasis on papers published in the past 10 years. We also discuss briefly some related research areas and point to some potential promising research directions.

preprint2022arXiv

Sample-Efficient Optimisation with Probabilistic Transformer Surrogates

Faced with problems of increasing complexity, recent research in Bayesian Optimisation (BO) has focused on adapting deep probabilistic models as flexible alternatives to Gaussian Processes (GPs). In a similar vein, this paper investigates the feasibility of employing state-of-the-art probabilistic transformers in BO. Upon further investigation, we observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. First, we notice that these models are trained on uniformly distributed inputs, which impairs predictive accuracy on non-uniform data - a setting arising from any typical BO loop due to exploration-exploitation trade-offs. Second, we realise that training losses (e.g., cross-entropy) only asymptotically guarantee accurate posterior approximations, i.e., after arriving at the global optimum, which generally cannot be ensured. At the stationary points of the loss function, however, we observe a degradation in predictive performance especially in exploratory regions of the input space. To tackle these shortcomings we introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance. In a large panel of experiments, we demonstrate, for the first time, that one transformer pre-trained on data sampled from random GP priors produces competitive results on 16 benchmark black-boxes compared to GP-based BO. Since our model is only pre-trained once and used in all tasks without any retraining and/or fine-tuning, we report an order of magnitude time-reduction, while matching and sometimes outperforming GPs.

preprint2020arXiv

Invariant Transform Experience Replay: Data Augmentation for Deep Reinforcement Learning

Deep Reinforcement Learning (RL) is a promising approach for adaptive robot control, but its current application to robotics is currently hindered by high sample requirements. To alleviate this issue, we propose to exploit the symmetries present in robotic tasks. Intuitively, symmetries from observed trajectories define transformations that leave the space of feasible RL trajectories invariant and can be used to generate new feasible trajectories, which could be used for training. Based on this data augmentation idea, we formulate a general framework, called Invariant Transform Experience Replay that we present with two techniques: (i) Kaleidoscope Experience Replay exploits reflectional symmetries and (ii) Goal-augmented Experience Replay which takes advantage of lax goal definitions. In the Fetch tasks from OpenAI Gym, our experimental results show significant increases in learning rates and success rates. Particularly, we attain a 13, 3, and 5 times speedup in the pushing, sliding, and pick-and-place tasks respectively in the multi-goal setting. Performance gains are also observed in similar tasks with obstacles and we successfully deployed a trained policy on a real Baxter robot. Our work demonstrates that invariant transformations on RL trajectories are a promising methodology to speed up learning in deep RL.

preprint2020arXiv

Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards

As the operations of autonomous systems generally affect simultaneously several users, it is crucial that their designs account for fairness considerations. In contrast to standard (deep) reinforcement learning (RL), we investigate the problem of learning a policy that treats its users equitably. In this paper, we formulate this novel RL problem, in which an objective function, which encodes a notion of fairness that we formally define, is optimized. For this problem, we provide a theoretical discussion where we examine the case of discounted rewards and that of average rewards. During this analysis, we notably derive a new result in the standard RL setting, which is of independent interest: it states a novel bound on the approximation error with respect to the optimal average reward of that of a policy optimal for the discounted reward. Since learning with discounted rewards is generally easier, this discussion further justifies finding a fair policy for the average reward by learning a fair policy for the discounted reward. Thus, we describe how several classic deep RL algorithms can be adapted to our fair optimization problem, and we validate our approach with extensive experiments in three different domains.

preprint2019arXiv

Exploiting the Sign of the Advantage Function to Learn Deterministic Policies in Continuous Domains

In the context of learning deterministic policies in continuous domains, we revisit an approach, which was first proposed in Continuous Actor Critic Learning Automaton (CACLA) and later extended in Neural Fitted Actor Critic (NFAC). This approach is based on a policy update different from that of deterministic policy gradient (DPG). Previous work has observed its excellent performance empirically, but a theoretical justification is lacking. To fill this gap, we provide a theoretical explanation to motivate this unorthodox policy update by relating it to another update and making explicit the objective function of the latter. We furthermore discuss in depth the properties of these updates to get a deeper understanding of the overall approach. In addition, we extend it and propose a new trust region algorithm, Penalized NFAC (PeNFAC). Finally, we experimentally demonstrate in several classic control problems that it surpasses the state-of-the-art algorithms to learn deterministic policies.