Source author record

William Chan

William Chan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computation and Language math.LO Computer Vision Neural and Evolutionary Computing eess.AS Sound Artificial Intelligence Digital Libraries Information Retrieval math.CO

Catalog footprint

What is connected

17works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Graph Polynomial from Chromatic Symmetric Functions

This paper describes how many known graph polynomials arise from the coefficients of chromatic symmetric function expansions in different bases, and studies a new polynomial arising by expanding over a basis given by chromatic symmetric functions of trees.

preprint2022arXiv

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.

preprint2022arXiv

Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality

Diffusion models have emerged as an expressive family of generative models rivaling GANs in sample quality and autoregressive models in likelihood scores. Standard diffusion models typically require hundreds of forward passes through the model to generate a single high-fidelity sample. We introduce Differentiable Diffusion Sampler Search (DDSS): a method that optimizes fast samplers for any pre-trained diffusion model by differentiating through sample quality scores. We also present Generalized Gaussian Diffusion Models (GGDM), a family of flexible non-Markovian samplers for diffusion models. We show that optimizing the degrees of freedom of GGDM samplers by maximizing sample quality scores via gradient descent leads to improved sample quality. Our optimization procedure backpropagates through the sampling process using the reparametrization trick and gradient rematerialization. DDSS achieves strong results on unconditional image generation across various datasets (e.g., FID scores on LSUN church 128x128 of 11.6 with only 10 inference steps, and 4.82 with 20 steps, compared to 51.1 and 14.9 with strongest DDPM/DDIM baselines). Our method is compatible with any pre-trained diffusion model without fine-tuning or re-training required.

preprint2022arXiv

Palette: Image-to-Image Diffusion Models

This paper develops a unified framework for image-to-image translation based on conditional diffusion models and evaluates this framework on four challenging image-to-image translation tasks, namely colorization, inpainting, uncropping, and JPEG restoration. Our simple implementation of image-to-image diffusion models outperforms strong GAN and regression baselines on all tasks, without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss or sophisticated new techniques needed. We uncover the impact of an L2 vs. L1 loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention in the neural architecture through empirical studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, with human evaluation and sample quality scores (FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against original images). We expect this standardized evaluation protocol to play a role in advancing image-to-image translation research. Finally, we show that a generalist, multi-task diffusion model performs as well or better than task-specific specialist counterparts. Check out https://diffusion-palette.github.io for an overview of the results.

preprint2022arXiv

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

preprint2022arXiv

Video Diffusion Models

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/

preprint2020arXiv

An Introduction to Combinatorics of Determinacy

This article is an introduction to combinatorics under the axiom of determinacy with a focus on partition properties and infinity Borel codes.

preprint2020arXiv

Imputer: Sequence Modelling via Imputation and Dynamic Programming

This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations. The Imputer is an iterative generative model, requiring only a constant number of generation steps independent of the number of input or output tokens. The Imputer can be trained to approximately marginalize over all possible alignments between the input and output sequences, and all possible generation orders. We present a tractable dynamic programming training algorithm, which yields a lower bound on the log marginal likelihood. When applied to end-to-end speech recognition, the Imputer outperforms prior non-autoregressive models and achieves competitive results to autoregressive models. On LibriSpeech test-other, the Imputer achieves 11.1 WER, outperforming CTC at 13.0 WER and seq2seq at 12.5 WER.

preprint2020arXiv

Insertion-Deletion Transformer

We propose the Insertion-Deletion Transformer, a novel transformer-based neural architecture and training method for sequence generation. The model consists of two phases that are executed iteratively, 1) an insertion phase and 2) a deletion phase. The insertion phase parameterizes a distribution of insertions on the current output hypothesis, while the deletion phase parameterizes a distribution of deletions over the current output hypothesis. The training method is a principled and simple algorithm, where the deletion model obtains its signal directly on-policy from the insertion model output. We demonstrate the effectiveness of our Insertion-Deletion Transformer on synthetic translation tasks, obtaining significant BLEU score improvement over an insertion-only model.

preprint2016arXiv

The Countable Admissible Ordinal Equivalence Relation

Let $F_{ω_1}$ be the countable admissible ordinal equivalence relation defined on ${}^ω2$ by $x \ F_{ω_1} \ y$ if and only if $ω_1^x = ω_1^y$. It will be shown that $F_{ω_1}$ is classifiable by countable structures and must be classified by structures of high Scott rank. If $E$ and $F$ are equivalence relations, then $E$ is almost Borel reducible to $F$ if and only if there is a Borel reduction of $E$ to $F$, except possibly on countably many $E$-classes. Let $E_{ω_1}$ denote the equivalence of order types of reals coding well-orderings. It will be shown that in the constructible universe $L$ and set generic extensions of $L$, $E_{ω_1}$ is not almost Borel reducible to $F_{ω_1}$, although a result of Zapletal implies such an almost Borel reduction exists if there is a measurable cardinal. Lastly, it will be shown that the isomorphism relation induced by a counterexample to Vaught's conjecture cannot be Borel reducible to $F_{ω_1}$ in $L$ and set generic extensions of $L$. This shows the consistency of a negative answer to a question of Sy-David Friedman.

preprint2016arXiv

Very Deep Convolutional Networks for End-to-End Speech Recognition

Sequence-to-sequence models have shown success in end-to-end speech recognition. However these models have only used shallow acoustic encoder networks. In our work, we successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models. We apply network-in-network principles, batch normalization, residual connections and convolutional LSTMs to build very deep recurrent and convolutional structures. Our models exploit the spectral structure in the feature space and add computational depth without overfitting issues. We experiment with the WSJ ASR task and achieve 10.5\% word error rate without any dictionary or language using a 15 layer deep network.

preprint2016arXiv

When an Equivalence Relation with All Borel Classes will be Borel Somewhere?

In $\mathsf{ZFC}$, if there is a measurable cardinal with infinitely many Woodin cardinals below it, then for every equivalence relation $E \in L(\mathbb{R})$ on $\mathbb{R}$ with all $\mathbfΔ_1^1$ classes and every $σ$-ideal $I$ on $\mathbb{R}$ so that the associated forcing $\mathbb{P}_I$ of $I^+$ $\mathbfΔ_1^1$ subsets is proper, there exists some $I^+$ $\mathbfΔ_1^1$ set $C$ so that $E \upharpoonright C$ is a $\mathbfΔ_1^1$ equivalence relation. In $\mathsf{ZF} + \mathsf{DC} + \mathsf{AD}_\mathbb{R} + V = L(\mathscr{P}(\mathbb{R}))$, for every equivalence relation $E$ on $\mathbb{R}$ with all $\mathbfΔ_1^1$ classes and every $σ$-ideal $I$ on $\mathbb{R}$ so that the associated forcing $\mathbb{P}_I$ is proper, there is some $I^+$ $\mathbfΔ_1^1$ set $C$ so that $E \upharpoonright C$ is a $\mathbfΔ_1^1$ equivalence relation.

preprint2015arXiv

Deep Recurrent Neural Networks for Acoustic Modelling

We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.

preprint2015arXiv

Equivalence Relations Which Are Borel Somewhere

The following will be shown: Let $I$ be a $σ$-ideal on a Polish space $X$ with the property that the associated forcing of $I^+$ Borel subsets ordered by $\subseteq$ is a proper forcing. Let E be an analytic or coanalytic equivalence relation on this Polish space with all equivalence classes Borel. If sharps of certain sets exist, then there is an $I^+$ Borel subset $C$ of $X$ such that $E \upharpoonright C$ is a Borel equivalence relation.

preprint2015arXiv

Listen, Attend and Spell

We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

preprint2015arXiv

Transferring Knowledge from a RNN to a DNN

Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.

preprint2012arXiv

Leveraging Subjective Human Annotation for Clustering Historic Newspaper Articles

The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the newspapers are scanned and high resolution Optical Character Recognition (OCR) software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled editorial without further grouping. Manually sorting articles into fine-grained categories is time consuming if not impossible given the size of the corpus. This paper studies techniques for automatic categorization of newspaper articles so as to enhance search and retrieval on the archive. We explore unsupervised (e.g. KMeans) and semi-supervised (e.g. constrained clustering) learning algorithms to develop article categorization schemes geared towards the needs of end-users. A pilot study was designed to understand whether there was unanimous agreement amongst patrons regarding how articles can be categorized. It was found that the task was very subjective and consequently automated algorithms that could deal with subjective labels were used. While the small scale pilot study was extremely helpful in designing machine learning algorithms, a much larger system needs to be developed to collect annotations from users of the archive. The "BODHI" system currently being developed is a step in that direction, allowing users to correct wrongly scanned OCR and providing keywords and tags for newspaper articles used frequently. On successful implementation of the beta version of this system, we hope that it can be integrated with existing software being developed for the Chronicling America project.

William Chan

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

A Graph Polynomial from Chromatic Symmetric Functions

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality

Palette: Image-to-Image Diffusion Models

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Video Diffusion Models

An Introduction to Combinatorics of Determinacy

Imputer: Sequence Modelling via Imputation and Dynamic Programming

Insertion-Deletion Transformer

The Countable Admissible Ordinal Equivalence Relation

Very Deep Convolutional Networks for End-to-End Speech Recognition

When an Equivalence Relation with All Borel Classes will be Borel Somewhere?

Deep Recurrent Neural Networks for Acoustic Modelling

Equivalence Relations Which Are Borel Somewhere

Listen, Attend and Spell

Transferring Knowledge from a RNN to a DNN

Leveraging Subjective Human Annotation for Clustering Historic Newspaper Articles