Source author record

Yunhua Zhou

Yunhua Zhou appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.DS Artificial Intelligence Computation and Language

Catalog footprint

What is connected

8works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

How to Set the Batch Size for Large-Scale Pre-training?

The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised E(S) relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption E and steps S during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch Size Scheduler. Extensive experiments demonstrate that our revised formula precisely captures the dynamics of large-scale pre-training, and the resulting scheduling strategy significantly enhances both training efficiency and final model quality.

preprint2026arXiv

How to Set the Learning Rate for Large-Scale Pre-training?

Optimal configuration of the learning rate (LR) is a fundamental yet formidable challenge in large-scale pre-training. Given the stringent trade-off between training costs and model performance, the pivotal question is whether the optimal LR can be accurately extrapolated from low-cost experiments. In this paper, we formalize this investigation into two distinct research paradigms: Fitting and Transfer. Within the Fitting Paradigm, we innovatively introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n^3) to O(n*C_D*C_η) via predictive modeling. Within the Transfer Paradigm, we extend the principles of $μ$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons. By pushing the boundaries of existing hyperparameter research in terms of scale, we conduct a comprehensive comparison between these two paradigms. Our empirical results challenge the scalability of the widely adopted $μ$ Transfer in large-scale pre-training scenarios. Furthermore, we provide a rigorous analysis through the dual lenses of training stability and feature learning to elucidate the underlying reasons why module-wise parameter tuning underperforms in large-scale settings. This work offers systematic practical guidelines and a fresh theoretical perspective for optimizing industrial-level pre-training.

preprint2026arXiv

Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at https://github.com/guox18/formal-language-prepretraining.

preprint2020arXiv

Unstable Topological Pressure for Partially Hyperbolic Diffeomorphisms with Sub-additive Potentials

In this paper, we introduce the unstable topological pressure for C^1-smooth partially hyperbolic diffeomorphisms with sub-additive potentials. Moreover, without any additional assumption, we have established the expected variational principle which connects this unstable topological pressure and the unstable measure theoretic entropy, as well as the corresponding Lyapunov exponent.

preprint2014arXiv

Quasi-Shadowing and Quasi-Stability for Dynamically Coherent Partially Hyperbolic Diffeomorphisms

Let $f$ be a partially hyperbolic diffeomorphism. $f$ is called has the quasi-shadowing property if for any pseudo orbit $\{x_k\}_{k\in \mathbb{Z}}$, there is a sequence $\{y_k\}_{k\in \mathbb{Z}}$ tracing it in which $y_{k+1}$ lies in the local center leaf of $f(y_k)$ for any $k\in \mathbb{Z}$. $f$ is called topologically quasi-stable if for any homeomorphism $g$ $C^0$-close to $f$, there exist a continuous map $π$ and a motion $τ$ along the center foliation such that $π\circ g=τ\circ f\circπ$. In this paper we prove that if $f$ is dynamically coherent then it has quasi-shadowing and topological quasi-stability properties.

preprint2013arXiv

Generic Continuity of Metric Entropy for Volume-preserving Diffeomorphisms

Let $M$ be a compact manifold and $\text{Diff}^1_m(M)$ be the set of $C^1$ volume-preserving diffeomorphisms of $M$. We prove that there is a residual subset $\mathcal {R}\subset \text{Diff}^1_m(M)$ such that each $f\in \mathcal{R}$ is a continuity point of the map $g\to h_m(g)$ from $\text{Diff}^1_m(M)$ to $\mathbb{R}$, where $h_m(g)$ is the metric entropy of $g$ with respect to volume measure $m$.

preprint2011arXiv

The local $C^1$-density of stable ergodicity

The center bundle of a conservative partially hyperbolic diffeomorphism $f$ is called robustly non-hyperbolic if any conservative diffeomorphism which is $C^1$-close to $f$ has non-hyperbolic center bundle. In this paper, we prove that stable ergodicity is $C^1$-dense among conservative partially hyperbolic systems with robust non-hyperbolic center.

preprint2007arXiv

Topological entropies of equivalent smooth flows

Two flows defined on a smooth manifold are equivalent if there exists a homeomorphism of the manifold that sends each orbit of one flow onto an orbit of the other flow while preserving the time orientation. The topological entropy of a flow is defined as the entropy of its time-1 map. While topological entropy is an invariant for equivalent homeomorphisms, finite non-zero topological entropy for a flow cannot be an invariant because its value is affected by time reparameterization. However, 0 and $\infty$ topological entropy are invariants for equivalent flows without fixed points. In equivalent flows with fixed points there exists a counterexample, constructed by Ohno, showing that neither 0 nor $\infty$ topological entropy is preserved by equivalence. The two flows constructed by Ohno are suspensions of a transitive subshift and thus are not differentiable. Note that a differentiable flow on a compact manifold cannot have $\infty$ entropy. These facts led Ohno in 1980 to ask the following: "Is 0 topological entropy an invariant for equivalent differentiable flows?" In this paper, we construct two equivalent $C^\infty$ smooth flows with a singularity, one of which has positive topological entropy while the other has zero topological entropy. This gives a negative answer to Ohno's question in the class $C^\infty$.

Yunhua Zhou

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

How to Set the Batch Size for Large-Scale Pre-training?

How to Set the Learning Rate for Large-Scale Pre-training?

Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

Unstable Topological Pressure for Partially Hyperbolic Diffeomorphisms with Sub-additive Potentials

Quasi-Shadowing and Quasi-Stability for Dynamically Coherent Partially Hyperbolic Diffeomorphisms

Generic Continuity of Metric Entropy for Volume-preserving Diffeomorphisms

The local $C^1$-density of stable ergodicity

Topological entropies of equivalent smooth flows