Source author record

Maxim Krikun

Maxim Krikun appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Machine Learning Artificial Intelligence math.CO math.PR Populations and Evolution

Catalog footprint

What is connected

9works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Building Machine Translation Systems for the Next Thousand Languages

In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.

preprint2022arXiv

Data Scaling Laws in NMT: The Effect of Noise and Architecture

In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the following (1) Architecture and task setup: We compare to a transformer-LSTM hybrid, and a decoder-only transformer with a language modeling loss (2) Noise level in the training distribution: We experiment with filtering, and adding iid synthetic noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data, can significantly degrade the scaling exponent.

preprint2022arXiv

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.

preprint2022arXiv

LaMDA: Language Models for Dialog Applications

We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency.

preprint2020arXiv

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

preprint2016arXiv

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.

preprint2013arXiv

Five Statistical Questions about the Tree of Life

Stochastic modeling of phylogenies raises five questions that have received varying levels of attention from quantitatively inclined biologists. 1) How large do we expect (from the model) the ration of maximum historical diversity to current diversity to be? 2) From a correct phylogeny of the extant species of a clade, what can we deduce about past speciation and extinction rates? 3) What proportion of extant species are in fact descendants of still-extant ancestral species, and how does this compare with predictions od models? 4) When one moves from trees on species to trees on sets of species (whether traditional higher order taxa or clades from PhyloCode), does one expect trees to become more unbiased as a purely logical consequence of tree structure, without signifying any real biological phenomenon? 5) How do we expect that fluctuation rates for counts of higher order taxa should compare with fluctuation rates for number of species? WE present a mathematician's view based on an oversimplified modeling framework in which all these questions can be studied coherently.

preprint2012arXiv

Birth and death processes on certain random trees: Classification and stationary laws

The main substance of the paper concerns the growth rate and the classification (ergodicity, transience) of a family of random trees. In the basic model, new edges appear according to a Poisson process of parameter $λ$ and leaves can be deleted at a rate $μ$. The main results lay the stress on the famous number $e$. A complete classification of the process is given in terms of the intensity factor $ρ=λ/μ$: it is ergodic if $ρ\leq e^{-1}$, and transient if $ρ>e^{-1}$. There is a phase transition phenomenon: the usual region of null recurrence (in the parameter space) here does not exist. This fact is rare for countable Markov chains with exponentially distributed jumps. Some basic stationary laws are computed, e.g. the number of vertices and the height. Various bounds, limit laws and ergodic-like theorems are obtained, both for the transient and ergodic regimes. In particular, when the system is transient, the height of the tree grows linearly as the time $t\to\infty$, at a rate which is explicitly computed. Some of the results are extended to the so-called multiclass model.

preprint2007arXiv

Explicit enumeration of triangulations with multiple boundaries

We enumerate rooted triangulations of a sphere with multiple holes by the total number of edges and the length of each boundary component. The proof relies on a combinatorial identity due to W.T. Tutte.

Maxim Krikun

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Building Machine Translation Systems for the Next Thousand Languages

Data Scaling Laws in NMT: The Effect of Noise and Architecture

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

LaMDA: Language Models for Dialog Applications

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Five Statistical Questions about the Tree of Life

Birth and death processes on certain random trees: Classification and stationary laws

Explicit enumeration of triangulations with multiple boundaries