Researcher profile

William Bialek

William Bialek contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2025arXiv

Large language models and the entropy of English

We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to $N\sim 10^4$ characters, implying that there are direct dependencies or interactions across these distances. A corollary is that there are small but significant correlations between characters at these separations, as we show from the data independent of models. The distribution of code lengths reveals an emergent certainty about an increasing fraction of characters at large $N$. Over the course of model training, we observe different dynamics at long and short context lengths, suggesting that long-ranged structure is learned only gradually. Our results constrain efforts to build statistical physics models of LLMs or language itself.

preprint2020arXiv

Exploring a strongly non-Markovian animal behavior

A freely walking fly visits roughly 100 stereotyped states in a strongly non-Markovian sequence. To explore these dynamics, we develop a generalization of the information bottleneck method, compressing the large number of behavioral states into a more compact description that maximally preserves the correlations between successive states. Surprisingly, preserving these short time correlations with a compression into just two states captures the long ranged correlations seen in the raw data. Having reduced the behavior to a binary sequence, we describe the distribution of these sequences by an Ising model with pairwise interactions, which is the maximum entropy model that matches the two-point correlations. Matching the correlation function at longer and longer times drives the resulting model toward the Ising model with inverse square interactions and near zero magnetic field. The emergence of this statistical physics problem from the analysis real data on animal behavior is unexpected.

preprint2020arXiv

Transcription-dependent spatial organization of a gene locus

There is growing appreciation that gene function is connected to the dynamic structure of the chromosome. Here we explore the interplay between three-dimensional structure and transcriptional activity at the single cell level. We show that inactive loci are spatially more compact than active ones, and that within active loci the enhancer driving transcription is closest to the promoter. On the other hand, even this shortest distance is too long to support direct physical contact between the enhancer-promoter pair when the locus is transcriptionally active. Artificial manipulation of genomic separations between enhancers and the promoter produces changes in physical distance and transcriptional activity, recapitulating the correlation seen in wild-type embryos, but disruption of topological domain boundaries has no effect. Our results suggest a complex interdependence between transcription and the spatial organization of cis-regulatory elements.

preprint2019arXiv

Information costs in the control of protein synthesis

Efficient protein synthesis depends on the availability of charged tRNA molecules. With 61 different codons, shifting the balance among the tRNA abundances can lead to large changes in the protein synthesis rate. Previous theoretical work has asked about the optimization of these abundances, and there is some evidence that regulatory mechanisms bring cells close to this optimum, on average. We formulate the tradeoff between the precision of control and the efficiency of synthesis, asking for the maximum entropy distribution of tRNA abundances consistent with a desired mean rate of protein synthesis. Our analysis, using data from E. coli, indicates that reasonable synthesis rates are consistent only with rather low entropies, so that the cell's regulatory mechanisms must encode a large amount of information about the "correct" tRNA abundances.

preprint2018arXiv

Optimal local estimates of visual motion in a natural environment

Many organisms, from flies to humans, use visual signals to estimate their motion through the world. To explore the motion estimation problem, we have constructed a camera/gyroscope system that allows us to sample, at high temporal resolution, the joint distribution of input images and rotational motions during a long walk in the woods. From these data we construct the optimal estimator of velocity based on spatial and temporal derivatives of image intensity in small patches of the visual world. Over the bulk of the naturally occurring dynamic range, the optimal estimator exhibits the same systematic errors seen in neural and behavioral responses, including the confounding of velocity and contrast. These results suggest that apparent errors of sensory processing may reflect an optimal response to the physical signals in the environment.

preprint2018arXiv

The statistical mechanics of Twitter

We build models for the distribution of social states in Twitter communities. States can be defined by the participation vs silence of individuals in conversations that surround key words, and we approximate the joint distribution of these binary variables using the maximum entropy principle, finding the least structured models that match the mean probability of individuals tweeting and their pairwise correlations. These models provide very accurate, quantitative descriptions of higher order structure in these social networks. The parameters of these models seem poised close to critical surfaces in the space of possible models, and we observe scaling behavior of the data under coarse-graining. These results suggest that simple models, grounded in statistical physics, may provide a useful point of view on the larger data sets now emerging from complex social systems.