Source author record

Holden Lee

Holden Lee appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Data Structures and Algorithms math.PR Computation and Language Computational Complexity eess.SY math.NT math.OC math.ST Statistics Theory Systems and Control

Catalog footprint

What is connected

8works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Principled Gradient-based Markov Chain Monte Carlo for Text Generation

Recent papers have demonstrated the possibility of energy-based text generation by adapting gradient-based sampling algorithms, a paradigm of MCMC algorithms that promises fast convergence. However, as we show in this paper, previous attempts on this approach to text generation all fail to sample correctly from the target language model distributions. To address this limitation, we consider the problem of designing text samplers that are faithful, meaning that they have the target text distribution as its limiting distribution. We propose several faithful gradient-based sampling algorithms to sample from the target energy-based text distribution correctly, and study their theoretical properties. Through experiments on various forms of text generation, we demonstrate that faithful samplers are able to generate more fluent text while adhering to the control objectives better.

preprint2022arXiv

Improved rates for prediction and identification of partially observed linear dynamical systems

Identification of a linear time-invariant dynamical system from partial observations is a fundamental problem in control theory. Particularly challenging are systems exhibiting long-term memory. A natural question is how learn such systems with non-asymptotic statistical rates depending on the inherent dimensionality (order) $d$ of the system, rather than on the possibly much larger memory length. We propose an algorithm that given a single trajectory of length $T$ with gaussian observation noise, learns the system with a near-optimal rate of $\widetilde O\left(\sqrt\frac{d}{T}\right)$ in $\mathcal{H}_2$ error, with only logarithmic, rather than polynomial dependence on memory length. We also give bounds under process noise and improved bounds for learning a realization of the system. Our algorithm is based on multi-scale low-rank approximation: SVD applied to Hankel matrices of geometrically increasing sizes. Our analysis relies on careful application of concentration bounds on the Fourier domain -- we give sharper concentration bounds for sample covariance of correlated inputs and for $\mathcal H_\infty$ norm estimation, which may be of independent interest.

preprint2022arXiv

Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods

We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.

preprint2020arXiv

Estimating Normalizing Constants for Log-Concave Distributions: Algorithms and Lower Bounds

Estimating the normalizing constant of an unnormalized probability distribution has important applications in computer science, statistical physics, machine learning, and statistics. In this work, we consider the problem of estimating the normalizing constant $Z=\int_{\mathbb{R}^d} e^{-f(x)}\,\mathrm{d}x$ to within a multiplication factor of $1 \pm \varepsilon$ for a $μ$-strongly convex and $L$-smooth function $f$, given query access to $f(x)$ and $\nabla f(x)$. We give both algorithms and lowerbounds for this problem. Using an annealing algorithm combined with a multilevel Monte Carlo method based on underdamped Langevin dynamics, we show that $\widetilde{\mathcal{O}}\Bigl(\frac{d^{4/3}κ+ d^{7/6}κ^{7/6}}{\varepsilon^2}\Bigr)$ queries to $\nabla f$ are sufficient, where $κ= L / μ$ is the condition number. Moreover, we provide an information theoretic lowerbound, showing that at least $\frac{d^{1-o(1)}}{\varepsilon^{2-o(1)}}$ queries are necessary. This provides a first nontrivial lowerbound for the problem.

preprint2020arXiv

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima -- at least those discovered by gradient-based optimization -- turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.

preprint2020arXiv

Simulated Tempering Langevin Monte Carlo II: An Improved Proof using Soft Markov Chain Decomposition

A key task in Bayesian machine learning is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). One prevalent example of this is sampling posteriors in parametric distributions, such as latent-variable generative models. However sampling (even very approximately) can be #P-hard. Classical results going back to Bakry and Émery (1985) on sampling focus on log-concave distributions, and show a natural Markov chain called Langevin diffusion mixes in polynomial time. However, all log-concave distributions are uni-modal, while in practice it is very common for the distribution of interest to have multiple modes. In this case, Langevin diffusion suffers from torpid mixing. We address this problem by combining Langevin diffusion with simulated tempering. The result is a Markov chain that mixes more rapidly by transitioning between different temperatures of the distribution. We analyze this Markov chain for a mixture of (strongly) log-concave distributions of the same shape. In particular, our technique applies to the canonical multi-modal distribution: a mixture of gaussians (of equal variance). Our algorithm efficiently samples from these distributions given only access to the gradient of the log-pdf. For the analysis, we introduce novel techniques for proving spectral gaps based on decomposing the action of the generator of the diffusion. Previous approaches rely on decomposing the state space as a partition of sets, while our approach can be thought of as decomposing the stationary measure as a mixture of distributions (a "soft partition"). Additional materials for the paper can be found at http://holdenlee.github.io/Simulated%20tempering%20Langevin%20Monte%20Carlo.html. The proof and results have been improved and generalized from the precursor at arXiv:1710.02736.

preprint2015arXiv

l-adic properties of partition functions

Folsom, Kent, and Ono used the theory of modular forms modulo $\ell$ to establish remarkable ``self-similarity'' properties of the partition function and give an overarching explanation of many partition congruences. We generalize their work to analyze powers $p_r$ of the partition function as well as Andrews's spt-function. By showing that certain generating functions reside in a small space made up of reductions of modular forms, we set up a general framework for congruences for $p_r$ and spt on arithmetic progressions of the form $\ell^mn+δ$ modulo powers of $\ell$. Our work gives a conceptual explanation of the exceptional congruences of $p_r$ observed by Boylan, as well as striking congruences of spt modulo 5, 7, and 13 recently discovered by Andrews and Garvan.

preprint2015arXiv

Quadratic polynomials of small modulus cannot represent OR

An open problem in complexity theory is to find the minimal degree of a polynomial representing the $n$-bit OR function modulo composite $m$. This problem is related to understanding the power of circuits with $\text{MOD}_m$ gates where $m$ is composite. The OR function is of particular interest because it is the simplest function not amenable to bounds from communication complexity. Tardos and Barrington established a lower bound of $Ω((\log n)^{O_m(1)})$, and Barrington, Beigel, and Rudich established an upper bound of $n^{O_m(1)}$. No progress has been made on closing this gap for twenty years, and progress will likely require new techniques. We make progress on this question viewed from a different perspective: rather than fixing the modulus $m$ and bounding the minimum degree $d$ in terms of the number of variables $n$, we fix the degree $d$ and bound $n$ in terms of the modulus $m$. For degree $d=2$, we prove a quasipolynomial bound of $n\le m^{O(d)}\le m^{O(\log m)}$, improving the previous best bound of $2^{O(m)}$ implied by Tardos and Barrington's general bound. To understand the computational power of quadratic polynomials modulo $m$, we introduce a certain dichotomy which may be of independent interest. Namely, we define a notion of boolean rank of a quadratic polynomial $f$ and relate it to the notion of diagonal rigidity. Using additive combinatorics, we show that when the rank is low, $f(\mathbf x)=0$ must have many solutions. Using techniques from exponential sums, we show that when the rank of $f$ is high, $f$ is close to equidistributed. In either case, $f$ cannot represent the OR function in many variables.

Holden Lee

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Principled Gradient-based Markov Chain Monte Carlo for Text Generation

Improved rates for prediction and identification of partially observed linear dynamical systems

Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods

Estimating Normalizing Constants for Log-Concave Distributions: Algorithms and Lower Bounds

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

Simulated Tempering Langevin Monte Carlo II: An Improved Proof using Soft Markov Chain Decomposition

l-adic properties of partition functions

Quadratic polynomials of small modulus cannot represent OR