Source author record

Assaf Hallak

Assaf Hallak appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence

Catalog footprint

What is connected

7works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

From Actions to Words: Towards Abstractive-Textual Policy Summarization in RL

Explaining reinforcement learning agents is challenging because policies emerge from complex reward structures and neural representations that are difficult for humans to interpret. Existing approaches often rely on curated demonstrations that expose local behaviors but provide limited insight into an agent's global strategy, leaving users to infer intent from raw observations. We propose SySLLM (Synthesized Summary using Large Language Models), a framework that reframes policy interpretation as a language-generation problem. Instead of visual demonstrations, SySLLM converts spatiotemporal trajectories into structured text and prompts an LLM to generate coherent summaries describing the agent's goals, exploration style, and decision patterns. SySLLM scales to long-horizon, semantically rich environments without task-specific fine-tuning, leveraging LLM world knowledge and compositional reasoning to capture latent behavioral structure across policies. Expert evaluations show strong alignment with human analyses, and a large-scale user study found that 75.5% of participants preferred SySLLM summaries over state-of-the-art demonstration-based explanations. Together, these results position abstractive textual summarization as a paradigm for interpreting complex RL behavior.

preprint2023arXiv

Planning and Learning with Adaptive Lookahead

Some of the most powerful reinforcement learning frameworks use planning for action selection. Interestingly, their planning horizon is either fixed or determined arbitrarily by the state visitation history. Here, we expand beyond the naive fixed horizon and propose a theoretically justified strategy for adaptive selection of the planning horizon as a function of the state-dependent value estimate. We propose two variants for lookahead selection and analyze the trade-off between iteration count and computational complexity per iteration. We then devise a corresponding deep Q-network algorithm with an adaptive tree search horizon. We separate the value estimation per depth to compensate for the off-policy discrepancy between depths. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and Atari.

preprint2015arXiv

Contextual Markov Decision Processes

We consider a planning problem where the dynamics and rewards of the environment depend on a hidden static parameter referred to as the context. The objective is to learn a strategy that maximizes the accumulated reward across all contexts. The new model, called Contextual Markov Decision Process (CMDP), can model a customer's behavior when interacting with a website (the learner). The customer's behavior depends on gender, age, location, device, etc. Based on that behavior, the website objective is to determine customer characteristics, and to optimize the interaction between them. Our work focuses on one basic scenario--finite horizon with a small known number of possible contexts. We suggest a family of algorithms with provable guarantees that learn the underlying models and the latent contexts, and optimize the CMDPs. Bounds are obtained for specific naive implementations, and extensions of the framework are discussed, laying the ground for future research.

preprint2015arXiv

Emphatic TD Bellman Operator is a Contraction

Recently, \citet{SuttonMW15} introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes. In this short note, we show that the projected fixed-point equation that underlies ETD involves a contraction operator, with a $\sqrtγ$-contraction modulus (where $γ$ is the discount factor). This allows us to provide error bounds on the approximation error of ETD. To our knowledge, these are the first error bounds for an off-policy evaluation algorithm under general target and behavior policies.

preprint2015arXiv

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD($λ$), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter $β$ controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling $β$, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.

preprint2015arXiv

Off-policy evaluation for MDPs with unknown structure

Off-policy learning in dynamic decision problems is essential for providing strong evidence that a new policy is better than the one in use. But how can we prove superiority without testing the new policy? To answer this question, we introduce the G-SCOPE algorithm that evaluates a new policy based on data generated by the existing policy. Our algorithm is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment. We present a finite sample analysis of our approach and show through experiments that the algorithm scales well on high-dimensional problems with few samples.

preprint2012arXiv

How to sample if you must: on optimal functional sampling

We examine a fundamental problem that models various active sampling setups, such as network tomography. We analyze sampling of a multivariate normal distribution with an unknown expectation that needs to be estimated: in our setup it is possible to sample the distribution from a given set of linear functionals, and the difficulty addressed is how to optimally select the combinations to achieve low estimation error. Although this problem is in the heart of the field of optimal design, no efficient solutions for the case with many functionals exist. We present some bounds and an efficient sub-optimal solution for this problem for more structured sets such as binary functionals that are induced by graph walks.

Assaf Hallak

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

From Actions to Words: Towards Abstractive-Textual Policy Summarization in RL

Planning and Learning with Adaptive Lookahead

Contextual Markov Decision Processes

Emphatic TD Bellman Operator is a Contraction

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Off-policy evaluation for MDPs with unknown structure

How to sample if you must: on optimal functional sampling