Source author record

Krzysztof Choromanski

Krzysztof Choromanski appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.CO Artificial Intelligence Neural and Evolutionary Computing Robotics Computer Vision Data Structures and Algorithms math.OC Computation and Language Databases Discrete Mathematics math.CA math.DS Numerical Analysis Programming Languages

Catalog footprint

What is connected

39works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Evaluating Gemini Robotics Policies in a Veo World Simulator

Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.

preprint2026arXiv

RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

We present a new class of efficient attention mechanisms applying universal 3D Relative Positional Encoding (RPE) methods given by arbitrary integrable modulation functions $f$. They lead to the new class of 3D-Transformer models, called \textit{RelFlexformers}, flexibly integrating those RPEs, and characterized by the $O(L \log L)$ time complexity of the attention computation for the $L$-length input sequences. RelFlexformers builds on the theory of the Non-Uniform Fourier Transform (NU-FFT), naturally generalizing several existing efficient RPE-attention methods from structured settings with tokens homogeneously embedded in unweighted grids into general non-structured heterogeneous scenarios, where tokens' positions are arbitrarily distributed in the corresponding 3D spaces. As such, RelFlexformers can be applied in particular to model point clouds. Our extensive empirical evaluation on a large portfolio of 3D datasets confirms quality improvements provided by the NU-FFT-driven attention modulation techniques in the RelFlexformers.

preprint2022arXiv

Chefs' Random Tables: Non-Trigonometric Random Features

We introduce chefs' random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the optimal positive random features (OPRFs), is to our knowledge the first RF method for unbiased softmax kernel estimation with positive and bounded RFs, resulting in exponentially small tails and much lower variance than its counterparts. As we show, orthogonal random features applied in OPRFs provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity.

preprint2022arXiv

Hybrid Random Features

We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.

preprint2022arXiv

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

preprint2021arXiv

CWY Parametrization: a Solution for Parallelized Optimization of Orthogonal and Stiefel Matrices

We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs. As in earlier work, we parametrize an orthogonal matrix as a product of Householder reflections. However, to overcome low parallelization capabilities of computing Householder reflections sequentially, we propose employing an accumulation scheme called the compact WY (or CWY) transform -- a compact parallelization-friendly matrix representation for the series of Householder reflections. We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization which has a competitive complexity and, again, yields benefits when computed on GPUs and TPUs. We prove that our CWY and T-CWY methods lead to convergence to a stationary point of the training objective when coupled with stochastic gradient descent. We apply our methods to train recurrent neural network architectures in the tasks of neural machine translation and video prediction.

preprint2021arXiv

MLGO: a Machine Learning Guided Compiler Optimizations Framework

Leveraging machine-learning (ML) techniques for compiler optimizations has been widely studied and explored in academia. However, the adoption of ML in general-purpose, industry strength compilers has yet to happen. We propose MLGO, a framework for integrating ML techniques systematically in an industrial compiler -- LLVM. As a case study, we present the details and results of replacing the heuristics-based inlining-for-size optimization in LLVM with machine learned models. To the best of our knowledge, this work is the first full integration of ML in a complex compiler pass in a real-world setting. It is available in the main LLVM repository. We use two different ML algorithms: Policy Gradient and Evolution Strategies, to train the inlining-for-size model, and achieve up to 7\% size reduction, when compared to state of the art LLVM -Oz. The same model, trained on one corpus, generalizes well to a diversity of real-world targets, as well as to the same set of targets after months of active development. This property of the trained models is beneficial to deploy ML techniques in real-world settings.

preprint2020arXiv

An Ode to an ODE

We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the orthogonal group O(d). This nested system of two flows, where the parameter-flow is constrained to lie on the compact manifold, provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem which is intrinsically related to training deep neural network architectures such as Neural ODEs. Consequently, it leads to better downstream models, as we show on the example of training reinforcement learning policies with evolution strategies, and in the supervised learning setting, by comparing with previous SOTA baselines. We provide strong convergence results for our proposed mechanism that are independent of the depth of the network, supporting our empirical studies. Our results show an intriguing connection between the theory of deep neural networks and the field of matrix flows on compact manifolds.

preprint2020arXiv

Demystifying Orthogonal Monte Carlo and Beyond

Orthogonal Monte Carlo (OMC) is a very effective sampling algorithm imposing structural geometric conditions (orthogonality) on samples for variance reduction. Due to its simplicity and superior performance as compared to its Quasi Monte Carlo counterparts, OMC is used in a wide spectrum of challenging machine learning applications ranging from scalable kernel methods to predictive recurrent neural networks, generative models and reinforcement learning. However theoretical understanding of the method remains very limited. In this paper we shed new light on the theoretical principles behind OMC, applying theory of negatively dependent random variables to obtain several new concentration results. We also propose a novel extensions of the method leveraging number theory techniques and particle algorithms, called Near-Orthogonal Monte Carlo (NOMC). We show that NOMC is the first algorithm consistently outperforming OMC in applications ranging from kernel methods to approximating distances in probabilistic metric spaces.

preprint2020arXiv

ES-MAML: Simple Hessian-Free Meta Learning

We introduce ES-MAML, a new framework for solving the model agnostic meta learning (MAML) problem based on Evolution Strategies (ES). Existing algorithms for MAML are based on policy gradients, and incur significant difficulties when attempting to estimate second derivatives using backpropagation on stochastic policies. We show how ES can be applied to MAML to obtain an algorithm which avoids the problem of estimating second derivatives, and is also conceptually simple and easy to implement. Moreover, ES-MAML can handle new types of nonsmooth adaptation operators, and other techniques for improving performance and estimation of ES methods become applicable. We show empirically that ES-MAML is competitive with existing methods and often yields better adaptation with fewer queries.

preprint2020arXiv

Learning to Score Behaviors for Guided Policy Optimization

We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo.

preprint2020arXiv

Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

Off-policy learning algorithms have been known to be sensitive to the choice of hyper-parameters. However, unlike near on-policy algorithms for which hyper-parameters could be optimized via e.g. meta-gradients, similar techniques could not be straightforwardly applied to off-policy learning. In this work, we propose a framework which entails the application of Evolutionary Strategies to online hyper-parameter tuning in off-policy learning. Our formulation draws close connections to meta-gradients and leverages the strengths of black-box optimization with relatively low-dimensional search spaces. We show that our method outperforms state-of-the-art off-policy learning baselines with static hyper-parameters and recent prior work over a wide range of continuous control benchmarks.

preprint2020arXiv

Rapidly Adaptable Legged Robots via Evolutionary Meta-Learning

Learning adaptable policies is crucial for robots to operate autonomously in our complex and quickly changing world. In this work, we present a new meta-learning method that allows robots to quickly adapt to changes in dynamics. In contrast to gradient-based meta-learning algorithms that rely on second-order gradient estimation, we introduce a more noise-tolerant Batch Hill-Climbing adaptation operator and combine it with meta-learning based on evolutionary strategies. Our method significantly improves adaptation to changes in dynamics in high noise settings, which are common in robotics applications. We validate our approach on a quadruped robot that learns to walk while subject to changes in dynamics. We observe that our method significantly outperforms prior gradient-based approaches, enabling the robot to adapt its policy to changes based on less than 3 minutes of real data.

preprint2020arXiv

Ready Policy One: World Building Through Active Learning

Model-Based Reinforcement Learning (MBRL) offers a promising direction for sample efficient learning, often achieving state of the art results for continuous control tasks. However, many existing MBRL methods rely on combining greedy policies with exploration heuristics, and even those which utilize principled exploration bonuses construct dual objectives in an ad hoc fashion. In this paper we introduce Ready Policy One (RP1), a framework that views MBRL as an active learning problem, where we aim to improve the world model in the fewest samples possible. RP1 achieves this by utilizing a hybrid objective function, which crucially adapts during optimization, allowing the algorithm to trade off reward v.s. exploration at different stages of learning. In addition, we introduce a principled mechanism to terminate sample collection once we have a rich enough trajectory batch to improve the model. We rigorously evaluate our method on a variety of continuous control tasks, and demonstrate statistically significant gains over existing approaches.

preprint2020arXiv

Robotic Table Tennis with Model-Free Reinforcement Learning

We propose a model-free algorithm for learning efficient policies capable of returning table tennis balls by controlling robot joints at a rate of 100Hz. We demonstrate that evolutionary search (ES) methods acting on CNN-based policy architectures for non-visual inputs and convolving across time learn compact controllers leading to smooth motions. Furthermore, we show that with appropriately tuned curriculum learning on the task and rewards, policies are capable of developing multi-modal styles, specifically forehand and backhand stroke, whilst achieving 80\% return rate on a wide range of ball throws. We observe that multi-modality does not require any architectural priors, such as multi-head architectures or hierarchical policies.

preprint2020arXiv

Stochastic Flows and Geometric Optimization on the Orthogonal Group

We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group $O(d)$ and naturally reductive homogeneous manifolds obtained from the action of the rotation group $SO(d)$. We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult $\mathrm{Humanoid}$ agent from $\mathrm{OpenAI}$ $\mathrm{Gym}$ and improving convolutional neural networks.

preprint2020arXiv

Time Dependence in Non-Autonomous Neural ODEs

Neural Ordinary Differential Equations (ODEs) are elegant reinterpretations of deep networks where continuous time can replace the discrete notion of depth, ODE solvers perform forward propagation, and the adjoint method enables efficient, constant memory backpropagation. Neural ODEs are universal approximators only when they are non-autonomous, that is, the dynamics depends explicitly on time. We propose a novel family of Neural ODEs with time-varying weights, where time-dependence is non-parametric, and the smoothness of weight trajectories can be explicitly controlled to allow a tradeoff between expressiveness and efficiency. Using this enhanced expressiveness, we outperform previous Neural ODE variants in both speed and representational capacity, ultimately outperforming standard ResNet and CNN models on select image classification and video prediction tasks.

preprint2020arXiv

Variance Reduction for Evolution Strategies via Structured Control Variates

Evolution Strategies (ES) are a powerful class of blackbox optimization techniques that recently became a competitive alternative to state-of-the-art policy gradient (PG) algorithms for reinforcement learning (RL). We propose a new method for improving accuracy of the ES algorithms, that as opposed to recent approaches utilizing only Monte Carlo structure of the gradient estimator, takes advantage of the underlying MDP structure to reduce the variance. We observe that the gradient estimator of the ES objective can be alternatively computed using reparametrization and PG estimators, which leads to new control variate techniques for gradient estimation in ES optimization. We provide theoretical insights and show through extensive experiments that this RL-specific variance reduction approach outperforms general purpose variance reduction methods.

preprint2016arXiv

Binary embeddings with structured hashed projections

We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) mappings. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed "budget of randomness" is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson-Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier.

preprint2016arXiv

Fast nonlinear embeddings via structured matrices

We present a new paradigm for speeding up randomized computations of several frequently used functions in machine learning. In particular, our paradigm can be applied for improving computations of kernels based on random embeddings. Above that, the presented framework covers multivariate randomized functions. As a byproduct, we propose an algorithmic approach that also leads to a significant reduction of space complexity. Our method is based on careful recycling of Gaussian vectors into structured matrices that share properties of fully random matrices. The quality of the proposed structured approach follows from combinatorial properties of the graphs encoding correlations between rows of these structured matrices. Our framework covers as special cases already known structured approaches such as the Fast Johnson-Lindenstrauss Transform, but is much more general since it can be applied also to highly nonlinear embeddings. We provide strong concentration results showing the quality of the presented paradigm.

preprint2016arXiv

On the boosting ability of top-down decision tree learning algorithm for multiclass classification

We analyze the performance of the top-down multiclass classification algorithm for decision tree learning called LOMtree, recently proposed in the literature Choromanska and Langford (2014) for solving efficiently classification problems with very large number of classes. The algorithm online optimizes the objective function which simultaneously controls the depth of the tree and its statistical accuracy. We prove important properties of this objective and explore its connection to three well-known entropy-based decision tree objectives, i.e. Shannon entropy, Gini-entropy and its modified version, for which instead online optimization schemes were not yet developed. We show, via boosting-type guarantees, that maximizing the considered objective leads also to the reduction of all of these entropy-based objectives. The bounds we obtain critically depend on the strong-concavity properties of the entropy-based criteria, where the mildest dependence on the number of classes (only logarithmic) corresponds to the Shannon entropy.

preprint2016arXiv

Orthogonal Random Features

We present an intriguing discovery related to Random Fourier Features: in Gaussian kernel approximation, replacing the random Gaussian matrix by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for this behavior. Motivated by this discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from $\mathcal{O}(d^2)$ to $\mathcal{O}(d \log d)$, where $d$ is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF. Experiments on several datasets verify the effectiveness of ORF and SORF over the existing methods. We also provide discussions on using the same type of discrete orthogonal structure for a broader range of applications.

preprint2016arXiv

Recycling Randomness with Structure for Sublinear time Kernel Expansions

We propose a scheme for recycling Gaussian random vectors into structured matrices to approximate various kernel functions in sublinear time via random embeddings. Our framework includes the Fastfood construction as a special case, but also extends to Circulant, Toeplitz and Hankel matrices, and the broader family of structured matrices that are characterized by the concept of low-displacement rank. We introduce notions of coherence and graph-theoretic structural constants that control the approximation quality, and prove unbiasedness and low-variance properties of random feature maps that arise within our framework. For the case of low-displacement matrices, we show how the degree of structure and randomness can be controlled to reduce statistical variance at the cost of increased computation and storage requirements. Empirical results strongly support our theory and justify the use of a broader family of structured matrices for scaling up kernel methods using random features.

preprint2016arXiv

TripleSpin - a generic compact paradigm for fast machine learning computations

We present a generic compact computational framework relying on structured random matrices that can be applied to speed up several machine learning algorithms with almost no loss of accuracy. The applications include new fast LSH-based algorithms, efficient kernel computations via random feature maps, convex optimization algorithms, quantization techniques and many more. Certain models of the presented paradigm are even more compressible since they apply only bit matrices. This makes them suitable for deploying on mobile devices. All our findings come with strong theoretical guarantees. In particular, as a byproduct of the presented techniques and by using relatively new Berry-Esseen-type CLT for random vectors, we give the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the $\textbf{HD}_{3}\textbf{HD}_{2}\textbf{HD}_{1}$ structured matrix ("Practical and Optimal LSH for Angular Distance"). These guarantees as well as theoretical results for other aforementioned applications follow from the same general theoretical principle that we present in the paper. Our structured family contains as special cases all previously considered structured schemes, including the recently introduced $P$-model. Experimental evaluation confirms the accuracy and efficiency of TripleSpin matrices.

preprint2015arXiv

$P_{k}$-freeness implies small dichromatic number

We propose a purely combinatorial quadratic time algorithm that for any $n$-vertex $P_{k}$-free tournament $T$, where $P_{k}$ is a directed path of length $k$, finds in $T$ a transitive subset of order $n^{\frac{c}{k\log(k)^{2}}}$. As a byproduct of our method, we obtain subcubic $O(n^{1-\frac{c}{k\log(k)^{2}}})$-approximation algorithm for the optimal acyclic coloring problem on $P_{k}$-free tournaments. Our results are tight up to the $\log(k)$-factor in the following sense: there exist infinite families of $P_{k}$-free tournaments with largest transitive subsets of order at most $n^{\frac{c\log(k)}{k}}$. As a corollary, we give tight asymptotic results regarding the so-called \textit{Erdős-Hajnal coefficients} of directed paths. These are some of the first asymptotic results on these coefficients for infinite families of prime graphs.

preprint2015arXiv

An $\tilde{O}(\frac{1}{\sqrt{T}})$-error online algorithm for retrieving heavily perturbated statistical databases in the low-dimensional querying mode

We give the first $\tilde{O}(\frac{1}{\sqrt{T}})$-error online algorithm for reconstructing noisy statistical databases, where $T$ is the number of (online) sample queries received. The algorithm, which requires only $O(\log T)$ memory, aims to learn a hidden database-vector $w^{*} \in \mathbb{R}^{D}$ in order to accurately answer a stream of queries regarding the hidden database, which arrive in an online fashion from some unknown distribution $\mathcal{D}$. We assume the distribution $\mathcal{D}$ is defined on the neighborhood of a low-dimensional manifold. The presented algorithm runs in $O(dD)$-time per query, where $d$ is the dimensionality of the query-space. Contrary to the classical setting, there is no separate training set that is used by the algorithm to learn the database --- the stream on which the algorithm will be evaluated must also be used to learn the database-vector. The algorithm only has access to a binary oracle $\mathcal{O}$ that answers whether a particular linear function of the database-vector plus random noise is larger than a threshold, which is specified by the algorithm. We note that we allow for a significant $O(D)$ amount of noise to be added while other works focused on the low noise $o(\sqrt{D})$-setting. For a stream of $T$ queries our algorithm achieves an average error $\tilde{O}(\frac{1}{\sqrt{T}})$ by filtering out random noise, adapting threshold values given to the oracle based on its previous answers and, as a consequence, recovering with high precision a projection of a database-vector $w^{*}$ onto the manifold defining the query-space.

preprint2015arXiv

Coloring tournaments with forbidden substructures

Coloring graphs is an important algorithmic problem in combinatorics with many applications in computer science. In this paper we study coloring tournaments. A chromatic number of a random tournament is of order $Ω(\frac{n}{\log(n)})$. The question arises whether the chromatic number can be proven to be smaller for more structured nontrivial classes of tournaments. We analyze the class of tournaments defined by a forbidden subtournament $H$. This paper gives a first quasi-polynomial algorithm running in time $e^{O(\log(n)^{2})}$ that constructs colorings of $H$-free tournaments using only $O(n^{1-ε(H)}\log(n))$ colors, where $ε(H) \geq 2^{-2^{50|H|^{2}+1}}$ for many forbidden tournaments $H$. To the best of our knowledge all previously known related results required at least sub-exponential time and relied on the regularity lemma. Since we do not use the regularity lemma, we obtain the first known lower bounds on $ε(H)$ that can be given by a closed-form expression. As a corollary, we give a constructive proof of the celebrated open Erdős-Hajnal conjecture with explicitly given lower bounds on the EH coefficients for all classes of prime tournaments for which the conjecture is known. Such a constractive proof was not known before. Thus we significantly reduce the gap between best lower and upper bounds on the EH coefficients from the conjecture for all known prime tournaments that satisfy it. We also briefly explain how our methods may be used for coloring $H$-free tournaments under the following conditions: $H$ is any tournament with $\leq 5$ vertices or: $H$ is any but one tournament of six vertices.

preprint2015arXiv

Differentially- and non-differentially-private random decision trees

We consider supervised learning with random decision trees, where the tree construction is completely random. The method is popularly used and works well in practice despite the simplicity of the setting, but its statistical mechanism is not yet well-understood. In this paper we provide strong theoretical guarantees regarding learning with random decision trees. We analyze and compare three different variants of the algorithm that have minimal memory requirements: majority voting, threshold averaging and probabilistic averaging. The random structure of the tree enables us to adapt these methods to a differentially-private setting thus we also propose differentially-private versions of all three schemes. We give upper-bounds on the generalization error and mathematically explain how the accuracy depends on the number of random decision trees. Furthermore, we prove that only logarithmic (in the size of the dataset) number of independently selected random decision trees suffice to correctly classify most of the data, even when differential-privacy guarantees must be maintained. We empirically show that majority voting and threshold averaging give the best accuracy, also for conservative users requiring high privacy guarantees. Furthermore, we demonstrate that a simple majority voting rule is an especially good candidate for the differentially-private classifier since it is much less sensitive to the choice of forest parameters than other methods.

preprint2015arXiv

Efficient data hashing with structured binary embeddings

We present here new mechanisms for hashing data via binary embeddings. Contrary to most of the techniques presented before, the embedding matrix of our mechanism is highly structured. That enables us to perform hashing more efficiently and use less memory. What is crucial and nonintuitive is the fact that imposing structured mechanism does not affect the quality of the produced hash. To the best of our knowledge, we are the first to give strong theoretical guarantees of the proposed binary hashing method by proving the efficiency of the mechanism for several classes of structured projection matrices. As a corollary, we obtain binary hashing mechanisms with strong concentration results for circulant and Topelitz matrices. Our approach is however much more general.

preprint2015arXiv

Excluding hooks and their complements

The celebrated Erdos-Hajnal conjecture states that for every $n$-vertex undirected graph $H$ there exists $\eps(H)>0$ such that every graph $G$ that does not contain $H$ as an induced subgraph contains a clique or an independent set of size at least $n^{\eps(H)}$. A weaker version of the conjecture states that the polynomial-size clique/independent set phenomenon occurs if one excludes both $H$ and its complement $H^{\compl}$. We show that the weaker conjecture holds if $H$ is any path with a pendant edge at its third vertex; thus we give a new infinite family of graphs for which the conjecture holds.

preprint2015arXiv

Fast Online Clustering with Randomized Skeleton Sets

We present a new fast online clustering algorithm that reliably recovers arbitrary-shaped data clusters in high throughout data streams. Unlike the existing state-of-the-art online clustering methods based on k-means or k-medoid, it does not make any restrictive generative assumptions. In addition, in contrast to existing nonparametric clustering techniques such as DBScan or DenStream, it gives provable theoretical guarantees. To achieve fast clustering, we propose to represent each cluster by a skeleton set which is updated continuously as new data is seen. A skeleton set consists of weighted samples from the data where weights encode local densities. The size of each skeleton set is adapted according to the cluster geometry. The proposed technique automatically detects the number of clusters and is robust to outliers. The algorithm works for the infinite data stream where more than one pass over the data is not feasible. We provide theoretical guarantees on the quality of the clustering and also demonstrate its advantage over the existing state-of-the-art on several datasets.

preprint2015arXiv

Learning how to rank from heavily perturbed statistics - digraph clustering approach

Ranking is one of the most fundamental problems in machine learning with applications in many branches of computer science such as: information retrieval systems, recommendation systems, machine translation and computational biology. Ranking objects based on possibly conflicting preferences is a central problem in voting research and social choice theory. In this paper we present a new simple combinatorial ranking algorithm adapted to the preference-based setting. We apply this new algorithm to the well-known scenario where the edges of the preference tournament are determined by the majority-voting model. It outperforms existing methods when it cannot be assumed that there exists global ranking of good enough quality and applies combinatorial techniques that havent been used in the ranking context before. Performed experiments show the superiority of the new algorithm over existing methods, also over these that were designed to handle heavily perturbed statistics. By combining our techniques with those presented in \cite{mohri}, we obtain a purely combinatorial algorithm that answers correctly most of the queries in the heterogeneous scenario, where the preference tournament is only locally of good quality but is not necessarily pseudotransitive. As a byproduct of our methods, we obtain the algorithm solving clustering problem for the directed planted partition model. To the best of our knowledge, it is the first purely combinatorial algorithm tackling this problem.

preprint2015arXiv

On Learning from Label Proportions

Learning from Label Proportions (LLP) is a learning setting, where the training data is provided in groups, or "bags", and only the proportion of each class in each bag is known. The task is to learn a model to predict the class labels of the individual instances. LLP has broad applications in political science, marketing, healthcare, and computer vision. This work answers the fundamental question, when and why LLP is possible, by introducing a general framework, Empirical Proportion Risk Minimization (EPRM). EPRM learns an instance label classifier to match the given label proportions on the training data. Our result is based on a two-step analysis. First, we provide a VC bound on the generalization error of the bag proportions. We show that the bag sample complexity is only mildly sensitive to the bag size. Second, we show that under some mild assumptions, good bag proportion prediction guarantees good instance label prediction. The results together provide a formal guarantee that the individual labels can indeed be learned in the LLP setting. We discuss applications of the analysis, including justification of LLP algorithms, learning with population proportions, and a paradigm for learning algorithms with privacy guarantees. We also demonstrate the feasibility of LLP based on a case study in real-world setting: predicting income based on census data.

preprint2015arXiv

On the Erdős-Hajnal conjecture for six-vertex tournaments

A celebrated unresolved conjecture of Erdős and Hajnal states that for every undirected graph $H$ there exists $ε(H)>0$ such that every undirected graph on $n$ vertices that does not contain $H$ as an induced subgraph contains a clique or stable set of size at least $n^{ε(H)}$. The conjecture has a directed equivalent version stating that for every tournament $H$ there exists $ε(H)>0$ such that every $H$-free $n$-vertex tournament $T$ contains a transitive subtournament of order at least $n^{ε(H)}$. We say that a tournament is \textit{prime} if it does not have nontrivial homogeneous sets. So far the conjecture was proved only for some specific families of prime tournaments (\cite{chorochudber, choromanski2}) and tournaments constructed according to the so-called \textit{substitution procedure}(\cite{alon}). In particular, recently the conjecture was proved for all five-vertex tournaments (\cite{chorochudber}), but the question about the correctness of the conjecture for all six-vertex tournaments remained open. In this paper we prove that all but at most one six-vertex tournament satisfy the Erdős-Hajnal conjecture. That reduces the six-vertex case to a single tournament.

preprint2015arXiv

Quantization based Fast Inner Product Search

We propose a quantization based approach for fast approximate Maximum Inner Product Search (MIPS). Each database vector is quantized in multiple subspaces via a set of codebooks, learned directly by minimizing the inner product quantization error. Then, the inner product of a query to a database vector is approximated as the sum of inner products with the subspace quantizers. Different from recently proposed LSH approaches to MIPS, the database vectors and queries do not need to be augmented in a higher dimensional feature space. We also provide a theoretical analysis of the proposed approach, consisting of the concentration results under mild assumptions. Furthermore, if a small sample of example queries is given at the training time, we propose a modified codebook learning procedure which further improves the accuracy. Experimental results on a variety of datasets including those arising from deep neural networks show that the proposed approach significantly outperforms the existing state-of-the-art.

preprint2014arXiv

All known prime Erdős-Hajnal tournaments satisfy $ε(H) = Ω(\frac{1}{|H|^{5}\log(|H|)})$

We prove that there exists $C>0$ such that $ε(H) \geq \frac{C}{|H|^{5}\log(|H|)}$, where $ε(H)$ is the Erdős-Hajnal coefficient of the tournament $H$, for every prime tournament $H$ for which the celebrated Erdős-Hajnal Conjecture has been proven so far. This is the first polynomial bound on the EH coefficient obtained for all known prime Erdős-Hajnal tournaments, in particular for infinitely many prime tournaments. As a byproduct of our analysis, we answer affirmatively the question whether there exists an infinite family of prime tournaments $H$ with $ε(H)$ lower-bounded by $\frac{1}{\textit{poly}(|H|)}$, where $\textit{poly}$ is a polynomial function. Furthermore, we give much tighter bounds than those known so far for the EH coefficients of tournaments without large homogeneous sets. This enables us to significantly reduce the gap between best known lower and upper bounds for the EH coefficients of tournaments. As a corollary we prove that every known prime Erdős-Hajnal tournament $H$ satisfies: $-5 + o(1) \leq \frac{\log(ε(H))}{\log(|H|)} \leq -1 + o(1)$. No lower bound on that expression was known before. We also show the applications of those results to the tournament coloring problem. In particular, we prove that for every known prime Erdős-Hajnal tournament $H$ every $H$-free tournament has \textit{chromatic number} at most $O(n^{1-\frac{C}{|H|^{5}\log(|H|)}}\log(n))$, where $C>0$ is some universal constant. The related coloring can be constructed algorithmically in the quasipolynomial time by following straightforwadly the proof of our main result. In comparison, the standard Ramsey theory gives only $O(\frac{n}{\log(n)})$ bounds for the tournament chromatic number.

preprint2014arXiv

Excluding pairs of tournaments

The Erdős-Hajnal conjecture states that for every given undirected graph $H$ there exists a constant $c(H)>0$ such that every graph $G$ that does not contain $H$ as an induced subgraph contains a clique or a stable set of size at least $|V(G)|^{c(H)}$. The conjecture is still open. Its equivalent directed version states that for every given tournament $H$ there exists a constant $c(H)>0$ such that every $H$-free tournament $T$ contains a transitive subtournament of order at least $|V(T)|^{c(H)}$. We prove in this paper that $\{H_{1},H_{2}\}$-free tournaments $T$ contain transitive subtournaments of size at least $|V(T)|^{c(H_{1},H_{2})}$ for some $c(H_{1},H_{2})>0$ and several pairs of tournaments: $H_{1}$, $H_{2}$. In particular we prove that $\{H,H^{c}\}$-freeness implies existence of the polynomial-size transitive subtournaments for several tournaments $H$ for which the conjecture is still open ($H^{c}$ stands for the \textit{complement of $H$}). To the best of our knowledge these are first nontrivial results of this type.

preprint2014arXiv

Notes on using Determinantal Point Processes for Clustering with Applications to Text Clustering

In this paper, we compare three initialization schemes for the KMEANS clustering algorithm: 1) random initialization (KMEANSRAND), 2) KMEANS++, and 3) KMEANSD++. Both KMEANSRAND and KMEANS++ have a major that the value of k needs to be set by the user of the algorithms. (Kang 2013) recently proposed a novel use of determinantal point processes for sampling the initial centroids for the KMEANS algorithm (we call it KMEANSD++). They, however, do not provide any evaluation establishing that KMEANSD++ is better than other algorithms. In this paper, we show that the performance of KMEANSD++ is comparable to KMEANS++ (both of which are better than KMEANSRAND) with KMEANSD++ having an additional that it can automatically approximate the value of k.

preprint2014arXiv

The Strong EH-Property and the Erdős-Hajnal Conjecture

The Erdős-Hajnal Conjecture states that for every $H$ there exists a constant $ε(H)>0$ such that every graph $G$ that does not contain $H$ as an induced subgraph contains a clique or a stable set of size at least $|V(G)|^{ε(H)}$. The Conjecture is still open. Some time ago its directed version was formulated (see:\cite{alon}). In the directed version graphs are replaced by tournaments, and cliques and stable sets by transitive subtournaments. If the Conjecture is not true then the smallest counterexample is a prime tournament. For a long time the Conjecture was known only for finitely many prime tournaments. Recently in \cite{bcc} and \cite{choromanski2} the Conjecture was proven for the families of galaxies and constellations that contain infinitely many prime tournaments. In \cite{bcc} the Conjecture was also proven for all $5$-vertex tournaments. We say that a tournament $H$ has the $EH$-property if it satisfies the Conjecture. In this paper we introduce the so-called \textit{strong EH-property} which enables us to prove the Conjecture for new prime tournaments, but what is even more interesting, provides a mechanism to combine tournaments satisfying the Conjecture to get bigger tournaments that do so and are not necessarily nonprime. We give several examples of families of tournaments constructed according to this procedure. The only procedure known before used to construct bigger tournaments satisfying the Conjecture from smaller tournaments satisfying the Conjecture was the so-called \textit{substitution procedure} (see: \cite{alon}). However an outcome of this procedure is always a nonprime tournament and, from what we have said before, prime tournaments are those that play crucial role in the research on the Conjecture. Our method may be potentially used to prove the Conjecture for several new classes of tournaments.

Krzysztof Choromanski

What is connected

Connect this record

See the researcher in context

Building this map preview

39 published item(s)

Evaluating Gemini Robotics Policies in a Veo World Simulator

RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

Chefs' Random Tables: Non-Trigonometric Random Features

Hybrid Random Features

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

CWY Parametrization: a Solution for Parallelized Optimization of Orthogonal and Stiefel Matrices

MLGO: a Machine Learning Guided Compiler Optimizations Framework

An Ode to an ODE

Demystifying Orthogonal Monte Carlo and Beyond

ES-MAML: Simple Hessian-Free Meta Learning

Learning to Score Behaviors for Guided Policy Optimization

Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

Rapidly Adaptable Legged Robots via Evolutionary Meta-Learning

Ready Policy One: World Building Through Active Learning

Robotic Table Tennis with Model-Free Reinforcement Learning

Stochastic Flows and Geometric Optimization on the Orthogonal Group

Time Dependence in Non-Autonomous Neural ODEs

Variance Reduction for Evolution Strategies via Structured Control Variates

Binary embeddings with structured hashed projections

Fast nonlinear embeddings via structured matrices

On the boosting ability of top-down decision tree learning algorithm for multiclass classification

Orthogonal Random Features

Recycling Randomness with Structure for Sublinear time Kernel Expansions

TripleSpin - a generic compact paradigm for fast machine learning computations

$P_{k}$-freeness implies small dichromatic number

An $\tilde{O}(\frac{1}{\sqrt{T}})$-error online algorithm for retrieving heavily perturbated statistical databases in the low-dimensional querying mode

Coloring tournaments with forbidden substructures

Differentially- and non-differentially-private random decision trees

Efficient data hashing with structured binary embeddings

Excluding hooks and their complements

Fast Online Clustering with Randomized Skeleton Sets

Learning how to rank from heavily perturbed statistics - digraph clustering approach

On Learning from Label Proportions

On the Erdős-Hajnal conjecture for six-vertex tournaments

Quantization based Fast Inner Product Search

All known prime Erdős-Hajnal tournaments satisfy $ε(H) = Ω(\frac{1}{|H|^{5}\log(|H|)})$

Excluding pairs of tournaments

Notes on using Determinantal Point Processes for Clustering with Applications to Text Clustering

The Strong EH-Property and the Erdős-Hajnal Conjecture