Source author record

Daniil Ryabko

Daniil Ryabko appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Information Theory math.IT math.ST Statistics Theory Artificial Intelligence math.PR Computational Complexity Computer Vision Cryptography and Security Data Structures and Algorithms math.OC Quantitative Methods

Catalog footprint

What is connected

26works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2019arXiv

Unsupervised model-free representation learning

Numerous control and learning problems face the situation where sequences of high-dimensional highly dependent data are available but no or little feedback is provided to the learner, which makes any inference rather challenging. To address this challenge, we formulate the following problem. Given a series of observations $X_0,\dots,X_n$ coming from a large (high-dimensional) space $\mathcal X$, find a representation function $f$ mapping $\mathcal X$ to a finite space $\mathcal Y$ such that the series $f(X_0),\dots,f(X_n)$ preserves as much information as possible about the original time-series dependence in $X_0,\dots,X_n$. We show that, for stationary time series, the function $f$ can be selected as the one maximizing a certain information criterion that we call time-series information. Some properties of this functions are investigated, including its uniqueness and consistency of its empirical estimates. Implications for the problem of optimal control are presented.

preprint2016arXiv

Things Bayes can't do

The problem of forecasting conditional probabilities of the next event given the past is considered in a general probabilistic setting. Given an arbitrary (large, uncountable) set C of predictors, we would like to construct a single predictor that performs asymptotically as well as the best predictor in C, on any data. Here we show that there are sets C for which such predictors exist, but none of them is a Bayesian predictor with a prior concentrated on C. In other words, there is a predictor with sublinear regret, but every Bayesian predictor must have a linear regret. This negative finding is in sharp contrast with previous results that establish the opposite for the case when one of the predictors in $C$ achieves asymptotically vanishing error. In such a case, if there is a predictor that achieves asymptotically vanishing error for any measure in C, then there is a Bayesian predictor that also has this property, and whose prior is concentrated on (a countable subset of) C.

preprint2016arXiv

Universality of Bayesian mixture predictors

The problem is that of sequential probability forecasting for finite-valued time series. The data is generated by an unknown probability distribution over the space of all one-way infinite sequences. It is known that this measure belongs to a given set C, but the latter is completely arbitrary (uncountably infinite, without any structure given). The performance is measured with asymptotic average log loss. In this work it is shown that the minimax asymptotic performance is always attainable, and it is attained by a convex combination of a countably many measures from the set C (a Bayesian mixture). This was previously only known for the case when the best achievable asymptotic error is 0. This also contrasts previous results that show that in the non-realizable case all Bayesian mixtures may be suboptimal, while there is a predictor that achieves the optimal performance.

preprint2015arXiv

Characterizing predictable classes of processes

The problem is sequence prediction in the following setting. A sequence x1,..., xn,... of discrete-valued observations is generated according to some unknown probabilistic law (measure) mu. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure mu belongs to an arbitrary class C of stochastic processes. We are interested in predictors ? whose conditional probabilities converge to the 'true' mu-conditional probabilities if any mu { C is chosen to generate the data. We show that if such a predictor exists, then a predictor can also be obtained as a convex combination of a countably many elements of C. In other words, it can be obtained as a Bayesian predictor whose prior is concentrated on a countable set. This result is established for two very different measures of performance of prediction, one of which is very strong, namely, total variation, and the other is very weak, namely, prediction in expected average Kullback-Leibler divergence.

preprint2015arXiv

Multiple Change Point Estimation in Stationary Ergodic Time Series

Given a heterogeneous time-series sample, the objective is to find points in time (called change points) where the probability distribution generating the data has changed. The data are assumed to have been generated by arbitrary unknown stationary ergodic distributions. No modelling, independence or mixing assumptions are made. A novel, computationally efficient, nonparametric method is proposed, and is shown to be asymptotically consistent in this general framework. The theoretical results are complemented with experimental evaluations.

preprint2015arXiv

Predicting the outcomes of every process for which an asymptotically accurate stationary predictor exists is impossible

The problem of prediction consists in forecasting the conditional distribution of the next outcome given the past. Assume that the source generating the data is such that there is a stationary ergodic predictor whose error converges to zero (in a certain sense). The question is whether there is a universal predictor for all such sources, that is, a predictor whose error goes to zero if any of the sources that have this property is chosen to generate the data. This question is answered in the negative, contrasting a number of previously established positive results concerning related but smaller sets of processes.

preprint2014arXiv

A criterion for hypothesis testing for stationary processes

Given a finite-valued sample $X_1,...,X_n$ we wish to test whether it was generated by a stationary ergodic process belonging to a family $H_0$, or it was generated by a stationary ergodic process outside $H_0$. We require the Type I error of the test to be uniformly bounded, while the type II error has to be mande not more than a finite number of times with probability 1. For this notion of consistency we provide necessary and sufficient conditions on the family $H_0$ for the existence of a consistent test. This criterion is illustrated with applications to testing for a membership to parametric families, generalizing some existing results. In addition, we analyze a stronger notion of consistency, which requires finite-sample guarantees on error of both types, and provide some necessary and some sufficient conditions for the existence of a consistent test. We emphasize that no assumption on the process distributions are made beyond stationarity and ergodicity.

preprint2014arXiv

On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

A sequence $x_1,\dots,x_n,\dots$ of discrete-valued observations is generated according to some unknown probabilistic law (measure) $μ$. After observing each outcome, one is required to give conditional probabilities of the next observation. The realizable case is when the measure $μ$ belongs to an arbitrary but known class $\mathcal C$ of process measures. The non-realizable case is when $μ$ is completely arbitrary, but the prediction performance is measured with respect to a given set $\mathcal C$ of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and finding these solutions. We show that if the quality of prediction is measured using the total variation distance, then these problems coincide, while if it is measured using the expected average KL divergence, then they are different. For some of the formalizations we also show that when a solution exists, it can be obtained as a Bayes mixture over a countable subset of $\mathcal C$. We also obtain several characterization of those sets $\mathcal C$ for which solutions to the considered problems exist. As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all finite-memory processes, but does not exist for the set of all stationary processes. It should be emphasized that the framework is completely general: the processes measures considered are not required to be i.i.d., mixing, stationary, or to belong to any parametric family.

preprint2014arXiv

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning

We consider a reinforcement learning setting introduced in (Maillard et al., NIPS 2011) where the learner does not have explicit access to the states of the underlying Markov decision process (MDP). Instead, she has access to several models that map histories of past interactions to states. Here we improve over known regret bounds in this setting, and more importantly generalize to the case where the models given to the learner do not contain a true model resulting in an MDP representation but only approximations of it. We also give improved error bounds for state aggregation.

preprint2014arXiv

Uniform hypothesis testing for ergodic time series distributions

Given a discrete-valued sample $X_1,...,X_n$ we wish to decide whether it was generated by a distribution belonging to a family $H_0$, or it was generated by a distribution belonging to a family $H_1$. In this work we assume that all distributions are stationary ergodic, and do not make any further assumptions (e.g. no independence or mixing rate assumptions). We would like to have a test whose probability of error (both Type I and Type II) is uniformly bounded. More precisely, we require that for each $ε$ there exist a sample size $n$ such that probability of error is upper-bounded by $ε$ for samples longer than $n$. We find some necessary and some sufficient conditions on $H_0$ and $H_1$ under which a consistent test (with this notion of consistency) exists. These conditions are topological, with respect to the topology of distributional distance.

preprint2013arXiv

A consistent clustering-based approach to estimating the number of change-points in highly dependent time-series

The problem of change-point estimation is considered under a general framework where the data are generated by unknown stationary ergodic process distributions. In this context, the consistent estimation of the number of change-points is provably impossible. However, it is shown that a consistent clustering method may be used to estimate the number of change points, under the additional constraint that the correct number of process distributions that generate the data is provided. This additional parameter has a natural interpretation in many real-world applications. An algorithm is proposed that estimates the number of change-points and locates the changes. The proposed algorithm is shown to be asymptotically consistent; its empirical evaluations are provided.

preprint2013arXiv

Clustering processes

The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. (again, no parametric or independence assumptions). In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.

preprint2013arXiv

Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are Holder continuity of rewards and transition probabilities.

preprint2013arXiv

Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning

We consider an agent interacting with an environment in a single stream of actions, observations, and rewards, with no reset. This process is not assumed to be a Markov Decision Process (MDP). Rather, the agent has several representations (mapping histories of past interactions to a discrete state space) of the environment with unknown dynamics, only some of which result in an MDP. The goal is to minimize the average regret criterion against an agent who knows an MDP representation giving the highest optimal reward, and acts optimally in it. Recent regret bounds for this setting are of order $O(T^{2/3})$ with an additive term constant yet exponential in some characteristics of the optimal MDP. We propose an algorithm whose regret after $T$ time steps is $O(\sqrt{T})$, with all constants reasonably small. This is optimal in $T$ since $O(\sqrt{T})$ is the optimal regret in the setting of learning in a (single discrete) MDP.

preprint2013arXiv

Reducing statistical time-series problems to binary classification

We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data.

preprint2013arXiv

Selecting the State-Representation in Reinforcement Learning

The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T^{2/3} where T is the horizon time.

preprint2012arXiv

Confidence Sets in Time-Series Filtering

The problem of filtering of finite-alphabet stationary ergodic time series is considered. A method for constructing a confidence set for the (unknown) signal is proposed, such that the resulting set has the following properties: First, it includes the unknown signal with probability $γ$, where $γ$ is a parameter supplied to the filter. Second, the size of the confidence sets grows exponentially with the rate that is asymptotically equal to the conditional entropy of the signal given the data. Moreover, it is shown that this rate is optimal.

preprint2012arXiv

Nonparametric Statistical Inference for Ergodic Processes

In this work a method for statistical analysis of time series is proposed, which is used to obtain solutions to some classical problems of mathematical statistics under the only assumption that the process generating the data is stationary ergodic. Namely, three problems are considered: goodness-of-fit (or identity) testing, process classification, and the change point problem. For each of the problems a test is constructed that is asymptotically accurate for the case when the data is generated by stationary ergodic processes. The tests are based on empirical estimates of distributional distance.

preprint2012arXiv

Regret Bounds for Restless Markov Bandits

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm that after $T$ steps achieves $\tilde{O}(\sqrt{T})$ regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.

preprint2011arXiv

Constructing Perfect Steganographic Systems

We propose steganographic systems for the case when covertexts (containers) are generated by a finite-memory source with possibly unknown statistics. The probability distributions of covertexts with and without hidden information are the same; this means that the proposed stegosystems are perfectly secure, i.e. an observer cannot determine whether hidden information is being transmitted. The speed of transmission of hidden information can be made arbitrary close to the theoretical limit - the Shannon entropy of the source of covertexts. An interesting feature of the suggested stegosystems is that they do not require any (secret or public) key. At the same time, we outline some principled computational limitations on steganography. We show that there are such sources of covertexts, that any stegosystem that has linear (in the length of the covertext) speed of transmission of hidden text must have an exponential Kolmogorov complexity. This shows, in particular, that some assumptions on the sources of covertext are necessary.

preprint2010arXiv

Clustering processes

preprint2009arXiv

Characterizing predictable classes of processes

The problem is sequence prediction in the following setting. A sequence $x_1,...,x_n,...$ of discrete-valued observations is generated according to some unknown probabilistic law (measure) $μ$. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure $μ$ belongs to an arbitrary class $\C$ of stochastic processes. We are interested in predictors $ρ$ whose conditional probabilities converge to the "true" $μ$-conditional probabilities if any $μ\in\C$ is chosen to generate the data. We show that if such a predictor exists, then a predictor can also be obtained as a convex combination of a countably many elements of $\C$. In other words, it can be obtained as a Bayesian predictor whose prior is concentrated on a countable set. This result is established for two very different measures of performance of prediction, one of which is very strong, namely, total variation, and the other is very weak, namely, prediction in expected average Kullback-Leibler divergence.

preprint2009arXiv

On Finding Predictors for Arbitrary Families of Processes

The problem is sequence prediction in the following setting. A sequence $x_1,...,x_n,...$ of discrete-valued observations is generated according to some unknown probabilistic law (measure) $μ$. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure $μ$ belongs to an arbitrary but known class $C$ of stochastic process measures. We are interested in predictors $ρ$ whose conditional probabilities converge (in some sense) to the "true" $μ$-conditional probabilities if any $μ\in C$ is chosen to generate the sequence. The contribution of this work is in characterizing the families $C$ for which such predictors exist, and in providing a specific and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also find several sufficient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family $C$, as well as in terms of local behaviour of the measures in $C$, which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family.

preprint2007arXiv

Using Data Compressors to Construct Rank Tests

Nonparametric rank tests for homogeneity and component independence are proposed, which are based on data compressors. For homogeneity testing the idea is to compress the binary string obtained by ordering the two joint samples and writing 0 if the element is from the first sample and 1 if it is from the second sample and breaking ties by randomization (extension to the case of multiple samples is straightforward). $H_0$ should be rejected if the string is compressed (to a certain degree) and accepted otherwise. We show that such a test obtained from an ideal data compressor is valid against all alternatives. Component independence is reduced to homogeneity testing by constructing two samples, one of which is the first half of the original and the other is the second half with one of the components randomly permuted.

preprint2005arXiv

On sample complexity for computational pattern recognition

In statistical setting of the pattern recognition problem the number of examples required to approximate an unknown labelling function is linear in the VC dimension of the target learning class. In this work we consider the question whether such bounds exist if we restrict our attention to computable pattern recognition methods, assuming that the unknown labelling function is also computable. We find that in this case the number of examples required for a computable method to approximate the labelling function not only is not linear, but grows faster (in the VC dimension of the class) than any computable function. No time or space constraints are put on the predictors or target functions; the only resource we consider is the training examples. The task of pattern recognition is considered in conjunction with another learning problem -- data compression. An impossibility result for the task of data compression allows us to estimate the sample complexity for pattern recognition.

preprint2005arXiv

Pattern Recognition for Conditionally Independent Data

In this work we consider the task of relaxing the i.i.d assumption in pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically defined labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We find a broad class of learning algorithms for which estimations of the probability of a classification error achieved under the classical i.i.d. assumption can be generalised to the similar estimates for the case of conditionally i.i.d. examples.

Daniil Ryabko

What is connected

Connect this record

See the researcher in context

Building this map preview

26 published item(s)

Unsupervised model-free representation learning

Things Bayes can't do

Universality of Bayesian mixture predictors

Characterizing predictable classes of processes

Multiple Change Point Estimation in Stationary Ergodic Time Series

Predicting the outcomes of every process for which an asymptotically accurate stationary predictor exists is impossible

A criterion for hypothesis testing for stationary processes

On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning

Uniform hypothesis testing for ergodic time series distributions

A consistent clustering-based approach to estimating the number of change-points in highly dependent time-series

Clustering processes

Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning

Reducing statistical time-series problems to binary classification

Selecting the State-Representation in Reinforcement Learning

Confidence Sets in Time-Series Filtering

Nonparametric Statistical Inference for Ergodic Processes

Regret Bounds for Restless Markov Bandits

Constructing Perfect Steganographic Systems

Clustering processes

Characterizing predictable classes of processes

On Finding Predictors for Arbitrary Families of Processes

Using Data Compressors to Construct Rank Tests

On sample complexity for computational pattern recognition

Pattern Recognition for Conditionally Independent Data