Source author record

Uri Alon

Uri Alon appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Biological Physics Computation and Language Biomolecules Molecular Networks Programming Languages Quantitative Methods Genomics Information Theory math.IT math.OC nlin.CD Populations and Evolution Systems and Control

Catalog footprint

What is connected

13works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Why do Nearest Neighbor Language Models Work?

Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform a careful analysis of the various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate these insights into the model architecture or the training procedure of the standard parametric LM, improving its results without the need for an explicit retrieval component. The code is available at https://github.com/frankxu2004/knnlm-why.

preprint2022arXiv

A Systematic Evaluation of Large Language Models of Code

Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at https://github.com/VHellendoorn/Code-LMs, which enables future research and application in this area.

preprint2022arXiv

How Attentive are Graph Attention Networks?

Graph Attention Networks (GATs) are one of the most popular GNN architectures and are considered as the state-of-the-art architecture for representation learning with graphs. In GAT, every node attends to its neighbors given its own representation as the query. However, in this paper we show that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express: in a controlled problem, we show that static attention hinders GAT from even fitting the training data. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2: a dynamic graph attention variant that is strictly more expressive than GAT. We perform an extensive evaluation and show that GATv2 outperforms GAT across 11 OGB and other benchmarks while we match their parametric costs. Our code is available at https://github.com/tech-srl/how_attentive_are_gats . GATv2 is available as part of the PyTorch Geometric library, the Deep Graph Library, and the TensorFlow GNN library.

preprint2022arXiv

Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over $k$NN-LM (Khandelwal et al., 2020) without hurting perplexity. Our code and trained models are available at https://github.com/neulab/retomaton .

preprint2022arXiv

Oversquashing in GNNs through the lens of information contraction and graph expansion

The quality of signal propagation in message-passing graph neural networks (GNNs) strongly influences their expressivity as has been observed in recent works. In particular, for prediction tasks relying on long-range interactions, recursive aggregation of node features can lead to an undesired phenomenon called "oversquashing". We present a framework for analyzing oversquashing based on information contraction. Our analysis is guided by a model of reliable computation due to von Neumann that lends a new insight into oversquashing as signal quenching in noisy computation graphs. Building on this, we propose a graph rewiring algorithm aimed at alleviating oversquashing. Our algorithm employs a random local edge flip primitive motivated by an expander graph construction. We compare the spectral expansion properties of our algorithm with that of an existing curvature-based non-local rewiring strategy. Synthetic experiments show that while our algorithm in general has a slower rate of expansion, it is overall computationally cheaper, preserves the node degrees exactly and never disconnects the graph.

preprint2021arXiv

On the Bottleneck of Graph Neural Networks and its Practical Implications

Since the proposal of the graph neural network (GNN) by Gori et al. (2005) and Scarselli et al. (2008), one of the major problems in training GNNs was their struggle to propagate information between distant nodes in the graph. We propose a new explanation for this problem: GNNs are susceptible to a bottleneck when aggregating messages across a long path. This bottleneck causes the over-squashing of exponentially growing information into fixed-size vectors. As a result, GNNs fail to propagate messages originating from distant nodes and perform poorly when the prediction task depends on long-range interaction. In this paper, we highlight the inherent problem of over-squashing in GNNs: we demonstrate that the bottleneck hinders popular GNNs from fitting long-range signals in the training data; we further show that GNNs that absorb incoming edges equally, such as GCN and GIN, are more susceptible to over-squashing than GAT and GGNN; finally, we show that prior work, which extensively tuned GNN models of long-range problems, suffers from over-squashing, and that breaking the bottleneck improves their state-of-the-art results without any tuning or additional weights. Our code is available at https://github.com/tech-srl/bottleneck/ .

preprint2020arXiv

Structural Language Models of Code

We address the problem of any-code completion - generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous techniques that have severely restricted the kinds of expressions that can be generated in this task, our approach can generate arbitrary code in any programming language. Our model significantly outperforms both seq2seq and a variety of structured approaches in generating Java and C# code. Our code, data, and trained models are available at http://github.com/tech-srl/slm-code-generation/ . An online demo is available at http://AnyCodeGen.org .

preprint2014arXiv

Evolution of bow-tie architectures in biology

Bow-tie or hourglass structure is a common architectural feature found in biological and technological networks. A bow-tie in a multi-layered structure occurs when intermediate layers have much fewer components than the input and output layers. Examples include metabolism where a handful of building blocks mediate between multiple input nutrients and multiple output biomass components, and signaling networks where information from numerous receptor types passes through a small set of signaling pathways to regulate multiple output genes. Little is known, however, about how bow-tie architectures evolve. Here, we address the evolution of bow-tie architectures using simulations of multi-layered systems evolving to fulfill a given input-output goal. We find that bow-ties spontaneously evolve when two conditions are met: (i) the evolutionary goal is rank deficient, where the rank corresponds to the minimal number of input features on which the outputs depend, and (ii) The effects of mutations on interaction intensities between components are described by product rule - namely the mutated element is multiplied by a random number. Product-rule mutations are more biologically realistic than the commonly used sum-rule mutations that add a random number to the mutated element. These conditions robustly lead to bow-tie structures. The minimal width of the intermediate network layers (the waist or knot of the bow-tie) equals the rank of the evolutionary goal. These findings can help explain the presence of bow-ties in diverse biological systems, and can also be relevant for machine learning applications that employ multi-layered networks.

preprint2013arXiv

Entrainment of noise-induced and limit cycle oscillators under weak noise

Theoretical models that describe oscillations in biological systems are often either a limit cycle oscillator, where the deterministic nonlinear dynamics gives sustained periodic oscillations, or a noise-induced oscillator, where a fixed point is linearly stable with complex eigenvalues and addition of noise gives oscillations around the fixed point with fluctuating amplitude. We investigate how each class of model behaves under the external periodic forcing, taking the well-studied van der Pol equation as an example. We find that, when the forcing is additive, the noise-induced oscillator can show only one-to-one entrainment to the external frequency, in contrast to the limit cycle oscillator which is known to entrain to any ratio. When the external forcing is multiplicative, on the other hand, the noise-induced oscillator can show entrainment to a few ratios other than one-to-one, while the limit cycle oscillator shows entrain to any ratio. The noise blurs the entrainment in general, but clear entrainment regions for limit cycles can be identified as long as the noise is not too strong.

preprint2013arXiv

Mutation Rules and the Evolution of Sparseness and Modularity in Biological Systems

Biological systems exhibit two structural features on many levels of organization: sparseness, in which only a small fraction of possible interactions between components actually occur; and modularity - the near decomposability of the system into modules with distinct functionality. Recent work suggests that modularity can evolve in a variety of circumstances, including goals that vary in time such that they share the same subgoals (modularly varying goals), or when connections are costly. Here, we studied the origin of modularity and sparseness focusing on the nature of the mutation process, rather than on connection cost or variations in the goal. We use simulations of evolution with different mutation rules. We found that commonly used sum-rule mutations, in which interactions are mutated by adding random numbers, do not lead to modularity or sparseness except for in special situations. In contrast, product-rule mutations in which interactions are mutated by multiplying by random numbers - a better model for the effects of biological mutations - led to sparseness naturally. When the goals of evolution are modular, in the sense that specific groups of inputs affect specific groups of outputs, product-rule mutations also lead to modular structure; sum-rule mutations do not. Product-rule mutations generate sparseness and modularity because they tend to reduce interactions, and to keep small interaction terms small.

preprint2010arXiv

Coding limits on the number of transcription factors

Transcription factor proteins bind specific DNA sequences to control the expression of genes. They contain DNA binding domains which belong to several super-families, each with a specific mechanism of DNA binding. The total number of transcription factors encoded in a genome increases with the number of genes in the genome. Here, we examined the number of transcription factors from each super-family in diverse organisms. We find that the number of transcription factors from most super-families appears to be bounded. For example, the number of winged helix factors does not generally exceed 300, even in very large genomes. The magnitude of the maximal number of transcription factors from each super-family seems to correlate with the number of DNA bases effectively recognized by the binding mechanism of that super-family. Coding theory predicts that such upper bounds on the number of transcription factors should exist, in order to minimize cross-binding errors between transcription factors. This theory further predicts that factors with similar binding sequences should tend to have similar biological effect, so that errors based on mis-recognition are minimal. We present evidence that transcription factors with similar binding sequences tend to regulate genes with similar biological functions, supporting this prediction. The present study suggests limits on the transcription factor repertoire of cells, and suggests coding constraints that might apply more generally to the mapping between binding sites and biological function.

preprint2010arXiv

Rules for biological regulation based on error minimization

The control of gene expression involves complex mechanisms that show large variation in design. For example, genes can be turned on either by the binding of an activator (positive control) or the unbinding of a repressor (negative control). What determines the choice of mode of control for each gene? This study proposes rules for gene regulation based on the assumption that free regulatory sites are exposed to nonspecific binding errors, whereas sites bound to their cognate regulators are protected from errors. Hence, the selected mechanisms keep the sites bound to their designated regulators for most of the time, thus minimizing fitness-reducing errors. This offers an explanation of the empirically demonstrated Savageau demand rule: Genes that are needed often in the natural environment tend to be regulated by activators, and rarely needed genes tend to be regulated by repressors; in both cases, sites are bound for most of the time, and errors are minimized. The fitness advantage of error minimization appears to be readily selectable. The present approach can also generate rules for multi-regulator systems. The error-minimization framework raises several experimentally testable hypotheses. It may also apply to other biological regulation systems, such as those involving protein-protein interactions.

preprint2010arXiv

Symmetry invariance for adapting biological systems

We study in this paper certain properties of the responses of dynamical systems to external inputs. The motivation arises from molecular systems biology. and, in particular, the recent discovery of an important transient property, related to Weber's law in psychophysics: "fold-change detection" in adapting systems, the property that scale uncertainty does not affect responses. FCD appears to play an important role in key signaling transduction mechanisms in eukaryotes, including the ERK and Wnt pathways, as well as in E.coli and possibly other prokaryotic chemotaxis pathways. In this paper, we provide further theoretical results regarding this property. Far more generally, we develop a necessary and sufficient characterization of adapting systems whose transient behaviors are invariant under the action of a set (often, a group) of symmetries in their sensory field. A particular instance is FCD, which amounts to invariance under the action of the multiplicative group of positive real numbers. Our main result is framed in terms of a notion which extends equivariant actions of compact Lie groups. Its proof relies upon control theoretic tools, and in particular the uniqueness theorem for minimal realizations.

Uri Alon

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Why do Nearest Neighbor Language Models Work?

A Systematic Evaluation of Large Language Models of Code

How Attentive are Graph Attention Networks?

Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

Oversquashing in GNNs through the lens of information contraction and graph expansion

On the Bottleneck of Graph Neural Networks and its Practical Implications

Structural Language Models of Code

Evolution of bow-tie architectures in biology

Entrainment of noise-induced and limit cycle oscillators under weak noise

Mutation Rules and the Evolution of Sparseness and Modularity in Biological Systems

Coding limits on the number of transcription factors

Rules for biological regulation based on error minimization

Symmetry invariance for adapting biological systems