Researcher profile

Rasul Tutunov

Rasul Tutunov contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

A recurring pattern in "reasoning without training" is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. Power sampling provides a principled way to bias decoding toward such modes by targeting p_theta(x)^alpha with alpha > 1, but practical approximations must account for future-dependent correction factors that determine which prefixes remain promising. We introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for approximating the sequence-level power target with a bounded population of partial solutions. APPS propagates hypotheses in parallel using proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries. This redistributes finite compute across competing prefixes rather than committing to a single unfolding path, while providing a direct scaling knob in the particle count and predictable peak memory. We instantiate the future-value signal with short-horizon rollouts and also study an amortized variant that replaces rollouts with a lightweight learned selection head. Across reasoning benchmarks, APPS improves the accuracy-runtime trade-off of training-free decoding and suggests that part of the gap to post-trained systems can be recovered through more faithful inference-time power approximation.

preprint2022arXiv

HEBO Pushing The Limits of Sample-Efficient Hyperparameter Optimisation

In this work we rigorously analyse assumptions inherent to black-box optimisation hyper-parameter tuning tasks. Our results on the Bayesmark benchmark indicate that heteroscedasticity and non-stationarity pose significant challenges for black-box optimisers. Based on these findings, we propose a Heteroscedastic and Evolutionary Bayesian Optimisation solver (HEBO). HEBO performs non-linear input and output warping, admits exact marginal log-likelihood optimisation and is robust to the values of learned parameters. We demonstrate HEBO's empirical efficacy on the NeurIPS 2020 Black-Box Optimisation challenge, where HEBO placed first. Upon further analysis, we observe that HEBO significantly outperforms existing black-box optimisers on 108 machine learning hyperparameter tuning tasks comprising the Bayesmark benchmark. Our findings indicate that the majority of hyper-parameter tuning tasks exhibit heteroscedasticity and non-stationarity, multi-objective acquisition ensembles with Pareto front solutions improve queried configurations, and robust acquisition maximisers afford empirical advantages relative to their non-robust counterparts. We hope these findings may serve as guiding principles for practitioners of Bayesian optimisation. All code is made available at https://github.com/huawei-noah/HEBO.

preprint2022arXiv

Sample-Efficient Optimisation with Probabilistic Transformer Surrogates

Faced with problems of increasing complexity, recent research in Bayesian Optimisation (BO) has focused on adapting deep probabilistic models as flexible alternatives to Gaussian Processes (GPs). In a similar vein, this paper investigates the feasibility of employing state-of-the-art probabilistic transformers in BO. Upon further investigation, we observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. First, we notice that these models are trained on uniformly distributed inputs, which impairs predictive accuracy on non-uniform data - a setting arising from any typical BO loop due to exploration-exploitation trade-offs. Second, we realise that training losses (e.g., cross-entropy) only asymptotically guarantee accurate posterior approximations, i.e., after arriving at the global optimum, which generally cannot be ensured. At the stationary points of the loss function, however, we observe a degradation in predictive performance especially in exploratory regions of the input space. To tackle these shortcomings we introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance. In a large panel of experiments, we demonstrate, for the first time, that one transformer pre-trained on data sampled from random GP priors produces competitive results on 16 benchmark black-boxes compared to GP-based BO. Since our model is only pre-trained once and used in all tasks without any retraining and/or fine-tuning, we report an order of magnitude time-reduction, while matching and sometimes outperforming GPs.

preprint2022arXiv

Self-consistent Gradient-like Eigen Decomposition in Solving Schrödinger Equations

The Schrödinger equation is at the heart of modern quantum mechanics. Since exact solutions of the ground state are typically intractable, standard approaches approximate Schrödinger equation as forms of nonlinear generalized eigenvalue problems $F(V)V = SVΛ$ in which $F(V)$, the matrix to be decomposed, is a function of its own top-$k$ smallest eigenvectors $V$, leading to a "self-consistency problem". Traditional iterative methods heavily rely on high-quality initial guesses of $V$ generated via domain-specific heuristics methods based on quantum mechanics. In this work, we eliminate such a need for domain-specific heuristics by presenting a novel framework, Self-consistent Gradient-like Eigen Decomposition (SCGLED) that regards $F(V)$ as a special "online data generator", thus allows gradient-like eigendecomposition methods in streaming $k$-PCA to approach the self-consistency of the equation from scratch in an iterative way similar to online learning. With several critical numerical improvements, SCGLED is robust to initial guesses, free of quantum-mechanism-based heuristics designs, and neat in implementation. Our experiments show that it not only can simply replace traditional heuristics-based initial guess methods with large performance advantage (achieved averagely 25x more precise than the best baseline in similar wall time), but also is capable of finding highly precise solutions independently without any traditional iterative methods.

preprint2021arXiv

Efficient Semi-Implicit Variational Inference

In this paper, we propose CI-VI an efficient and scalable solver for semi-implicit variational inference (SIVI). Our method, first, maps SIVI's evidence lower bound (ELBO) to a form involving a nonlinear functional nesting of expected values and then develops a rigorous optimiser capable of correctly handling bias inherent to nonlinear nested expectations using an extrapolation-smoothing mechanism coupled with gradient sketching. Our theoretical results demonstrate convergence to a stationary point of the ELBO in general non-convex settings typically arising when using deep network models and an order of $O(t^{-\frac{4}{5}})$ gradient-bias-vanishing rate. We believe these results generalise beyond the specific nesting arising from SIVI to other forms. Finally, in a set of experiments, we demonstrate the effectiveness of our algorithm in approximating complex posteriors on various data-sets including those from natural language processing.

preprint2020arXiv

$α^α$-Rank: Practically Scaling $α$-Rank through Stochastic Optimisation

Recently, $α$-Rank, a graph-based algorithm, has been proposed as a solution to ranking joint policy profiles in large scale multi-agent systems. $α$-Rank claimed tractability through a polynomial time implementation with respect to the total number of pure strategy profiles. Here, we note that inputs to the algorithm were not clearly specified in the original presentation; as such, we deem complexity claims as not grounded, and conjecture solving $α$-Rank is NP-hard. The authors of $α$-Rank suggested that the input to $α$-Rank can be an exponentially-sized payoff matrix; a claim promised to be clarified in subsequent manuscripts. Even though $α$-Rank exhibits a polynomial-time solution with respect to such an input, we further reflect additional critical problems. We demonstrate that due to the need of constructing an exponentially large Markov chain, $α$-Rank is infeasible beyond a small finite number of agents. We ground these claims by adopting amount of dollars spent as a non-refutable evaluation metric. Realising such scalability issue, we present a stochastic implementation of $α$-Rank with a double oracle mechanism allowing for reductions in joint strategy spaces. Our method, $α^α$-Rank, does not need to save exponentially-large transition matrix, and can terminate early under required precision. Although theoretically our method exhibits similar worst-case complexity guarantees compared to $α$-Rank, it allows us, for the first time, to practically conduct large-scale multi-agent evaluations. On $10^4 \times 10^4$ random matrices, we achieve $1000x$ speed reduction. Furthermore, we also show successful results on large joint strategy profiles with a maximum size in the order of $\mathcal{O}(2^{25})$ ($\approx 33$ million joint strategies) -- a setting not evaluable using $α$-Rank with reasonable computational budget.

preprint2020arXiv

Compositional ADAM: An Adaptive Compositional Solver

In this paper, we present C-ADAM, the first adaptive solver for compositional problems involving a non-linear functional nesting of expected values. We proof that C-ADAM converges to a stationary point in $\mathcal{O}(δ^{-2.25})$ with $δ$ being a precision parameter. Moreover, we demonstrate the importance of our results by bridging, for the first time, model-agnostic meta-learning (MAML) and compositional optimisation showing fastest known rates for deep network adaptation to-date. Finally, we validate our findings in a set of experiments from portfolio optimisation and meta-learning. Our results manifest significant sample complexity reductions compared to both standard and compositional solvers.