Source author record

Yang Bai

Yang Bai appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Artificial Intelligence Methodology

Catalog footprint

What is connected

3works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Efficient Context Scaling with LongCat ZigZag Attention

We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

preprint2026arXiv

Multi-transport Distributional Regression

We study distribution-on-distribution regression problems in which a response distribution depends on multiple distributional predictors. Such settings arise naturally in applications where the outcome distribution is driven by several heterogeneous distributional sources, yet remain challenging due to the nonlinear geometry of the Wasserstein space. We propose an intrinsic regression framework that aggregates predictor-specific transported distributions through a weighted Fréchet mean in the Wasserstein space. The resulting model admits multiple distributional predictors, assigns interpretable weights quantifying their relative contributions, and defines a flexible regression operator that is invariant to auxiliary construction choices, such as the selection of a reference distribution. From a theoretical perspective, we establish identifiability of the induced regression operator and derive asymptotic guarantees for its estimation under a predictive Wasserstein semi-norm, which directly characterizes convergence of the composite prediction map. Extensive simulation studies and a real data application demonstrate the improved predictive performance and interpretability of the proposed approach compared with existing Wasserstein regression methods.

preprint2026arXiv

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

Yang Bai

What is connected

Connect this record

See the researcher in context

Building this map preview

3 published item(s)

Efficient Context Scaling with LongCat ZigZag Attention

Multi-transport Distributional Regression

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation