Researcher profile

Shuai Zhao

Shuai Zhao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
22works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

22 published item(s)

preprint2026arXiv

EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation

Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.

preprint2026arXiv

Global Context Compression with Interleaved Vision-Text Transformation

Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.

preprint2026arXiv

Testing a Linear Relation: Short-Range Correlations and the EMC Effect for Gluons and Quarks in Nuclei

In this work, we focus on the possible linear relation between short-range correlations (SRCs) and the EMC effect for partons in nuclei. First, we test a linear relationship pertaining to gluons in bound nuclei; it is manifested as a correlation between the slope of the reduced cross section ratio in deep inelastic scattering (DIS) and the cross section of sub-threshold $J/ψ$ photoproduction. For comparison, the results from four different global analyses groups of nuclear parton distribution functions (nPDFs) are utilized. These results show a good linear correlation between the gluons in bound nuclei and the slope of the reduced cross section ratio, consistent with the possible presence of nuclear effects in the gluon distributions. Second, we investigate the linear relationship of quarks in the proton-induced Drell-Yan process. The corresponding results for quarks show strong sensitivity to the parameterization forms adopted by the different groups. These findings enhance our understanding of the substructure in bound nuclei and provide valuable reference for future global fitting of nPDFs.

preprint2026arXiv

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis, we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy. Experiments on six mathematical reasoning benchmarks demonstrate that OPEFO improves training stability and final performance. We will release the code and models upon publication.

preprint2022arXiv

A Roadmap for Big Model

With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view.

preprint2022arXiv

Analyzing Modality Robustness in Multimodal Sentiment Analysis

Building robust multimodal models are crucial for achieving reliable deployment in the wild. Despite its importance, less attention has been paid to identifying and improving the robustness of Multimodal Sentiment Analysis (MSA) models. In this work, we hope to address that by (i) Proposing simple diagnostic checks for modality robustness in a trained multimodal model. Using these checks, we find MSA models to be highly sensitive to a single modality, which creates issues in their robustness; (ii) We analyze well-known robust training strategies to alleviate the issues. Critically, we observe that robustness can be achieved without compromising on the original performance. We hope our extensive study-performed across five models and two benchmark datasets-and proposed procedures would make robustness an integral component in MSA research. Our diagnostic checks and robust training solutions are simple to implement and available at https://github. com/declare-lab/MSA-Robustness.

preprint2022arXiv

Exploring Entity Interactions for Few-Shot Relation Learning (Student Abstract)

Few-shot relation learning refers to infer facts for relations with a limited number of observed triples. Existing metric-learning methods for this problem mostly neglect entity interactions within and between triples. In this paper, we explore this kind of fine-grained semantic meanings and propose our model TransAM. Specifically, we serialize reference entities and query entities into sequence and apply transformer structure with local-global attention to capture both intra- and inter-triple entity interactions. Experiments on two public benchmark datasets NELL-One and Wiki-One with 1-shot setting prove the effectiveness of TransAM.

preprint2022arXiv

Inverse moment of the $B$-meson quasi distribution amplitude

We perform a study on the structure of inverse moment (IM) of quasi distributions, by taking $B$-meson quasi distribution amplitude (quasi-DA) as an example. Based on a one-loop calculation, we derive the renormalization group equation and velocity evolution equation for the first IM of quasi-DA. We find that, in the large velocity limit, the first IM of $B$-meson quasi-DA can be factorized into IM as well as logarithmic moments of light-cone distribution amplitude (LCDA), accompanied by short distance coefficients. Our results can be useful either in understanding the patterns of perturbative matching in Large Momentum Effective Theory or evaluating inverse moment of $B$-meson LCDA on the lattice.

preprint2022arXiv

Latent Heterogeneous Graph Network for Incomplete Multi-View Learning

Multi-view learning has progressed rapidly in recent years. Although many previous studies assume that each instance appears in all views, it is common in real-world applications for instances to be missing from some views, resulting in incomplete multi-view data. To tackle this problem, we propose a novel Latent Heterogeneous Graph Network (LHGN) for incomplete multi-view learning, which aims to use multiple incomplete views as fully as possible in a flexible manner. By learning a unified latent representation, a trade-off between consistency and complementarity among different views is implicitly realized. To explore the complex relationship between samples and latent representations, a neighborhood constraint and a view-existence constraint are proposed, for the first time, to construct a heterogeneous graph. Finally, to avoid any inconsistencies between training and test phase, a transductive learning technique is applied based on graph learning for classification tasks. Extensive experimental results on real-world datasets demonstrate the effectiveness of our model over existing state-of-the-art approaches.

preprint2022arXiv

Learning Personalized Representations using Graph Convolutional Network

Generating representations that precisely reflect customers' behavior is an important task for providing personalized skill routing experience in Alexa. Currently, Dynamic Routing (DR) team, which is responsible for routing Alexa traffic to providers or skills, relies on two features to be served as personal signals: absolute traffic count and normalized traffic count of every skill usage per customer. Neither of them considers the network based structure for interactions between customers and skills, which contain richer information for customer preferences. In this work, we first build a heterogeneous edge attributed graph based customers' past interactions with the invoked skills, in which the user requests (utterances) are modeled as edges. Then we propose a graph convolutional network(GCN) based model, namely Personalized Dynamic Routing Feature Encoder(PDRFE), that generates personalized customer representations learned from the built graph. Compared with existing models, PDRFE is able to further capture contextual information in the graph convolutional function. The performance of our proposed model is evaluated by a downstream task, defect prediction, that predicts the defect label from the learned embeddings of customers and their triggered skills. We observe up to 41% improvements on the cross entropy metric for our proposed models compared to the baselines.

preprint2022arXiv

Parton distribution function for topological charge at one loop

We present results for the $gg$-part of the one-loop corrections to the recently introduced "topological charge" GPD $\widetilde F(x,q^2)$. In particular, we give expression for its evolution kernel. To enforce strict compliance with the gauge invariance requirements, we have used on-shell states for external gluons, and have obtained identical results both in Feynman and light-cone gauges. No "zero mode" $δ(x)$ terms were found for the twist-4 gluon GPD $\widetilde F(x,q^2)$.

preprint2022arXiv

SCALoss: Side and Corner Aligned Loss for Bounding Box Regression

Bounding box regression is an important component in object detection. Recent work achieves promising performance by optimizing the Intersection over Union~(IoU). However, IoU-based loss has the gradient vanish problem in the case of low overlapping bounding boxes, and the model could easily ignore these simple cases. In this paper, we propose Side Overlap~(SO) loss by maximizing the side overlap of two bounding boxes, which puts more penalty for low overlapping bounding box cases. Besides, to speed up the convergence, the Corner Distance~(CD) is added into the objective function. Combining the Side Overlap and Corner Distance, we get a new regression objective function, \textit{Side and Corner Align Loss~(SCALoss)}. The SCALoss is well-correlated with IoU loss, which also benefits the evaluation metric but produces more penalty for low-overlapping cases. It can serve as a comprehensive similarity measure, leading to better localization performance and faster convergence speed. Experiments on COCO, PASCAL VOC, and LVIS benchmarks show that SCALoss can bring consistent improvement and outperform $\ell_n$ loss and IoU based loss with popular object detectors such as YOLOV3, SSD, Faster-RCNN. Code is available at: \url{https://github.com/Turoad/SCALoss}.

preprint2022arXiv

Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training

The width of a neural network matters since increasing the width will necessarily increase the model capacity. However, the performance of a network does not improve linearly with the width and soon gets saturated. In this case, we argue that increasing the number of networks (ensemble) can achieve better accuracy-efficiency trade-offs than purely increasing the width. To prove it, one large network is divided into several small ones regarding its parameters and regularization components. Each of these small networks has a fraction of the original one's parameters. We then train these small networks together and make them see various views of the same data to increase their diversity. During this co-training process, networks can also learn from each other. As a result, small networks can achieve better ensemble performance than the large one with few or no extra parameters or FLOPs, \ie, achieving better accuracy-efficiency trade-offs. Small networks can also achieve faster inference speed than the large one by concurrent running. All of the above shows that the number of networks is a new dimension of model scaling. We validate our argument with 8 different neural architectures on common benchmarks through extensive experiments. The code is available at \url{https://github.com/FreeformRobotics/Divide-and-Co-training}.

preprint2022arXiv

WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

Compared with the domain-specific model, the vision-language pre-training models (VLPMs) have shown superior performance on downstream tasks with fast fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of large-scare training data. However, due to the unevenness of data distribution including size, task type and quality, using the mixture of multiple datasets for model training can be problematic. In this work, we introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs. Specifically, about 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation, and the other 50 million strong-related image-text pairs are collected from some high-quality graphic websites. We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training. Besides, we trained both an understanding and a generation vision-language (VL) model to test the dataset effectiveness. The results show that WuDaoMM can be applied as an efficient dataset for VLPMs, especially for the model in text-to-image generation task. The data is released at https://data.wudaoai.cn

preprint2020arXiv

$B$-meson light-cone distribution amplitude from the Euclidean quantity

A new method for the model-independent determination of the light-cone distribution amplitude (LCDA) of the $B$-meson in heavy quark effective theory (HQET) is proposed by combining the large momentum effective theory (LaMET) and the numerical simulation technique on the Euclidean lattice. We demonstrate the autonomous scale dependence of the non-local quasi-HQET operator with the aid of the auxiliary field approach, and further determine the perturbative matching coefficient entering the hard-collinear factorization formula for the $B$-meson quasi-distribution amplitude at the one-loop accuracy. These results will be crucial to explore the partonic structure of heavy-quark hadrons in the static limit and to improve the theory description of exclusive $B$-meson decay amplitudes based upon perturbative QCD factorization theorems.

preprint2020arXiv

Adversarial-Learned Loss for Domain Adaptation

Recently, remarkable progress has been made in learning transferable representation across domains. Previous works in domain adaptation are majorly based on two techniques: domain-adversarial learning and self-training. However, domain-adversarial learning only aligns feature distributions between domains but does not consider whether the target features are discriminative. On the other hand, self-training utilizes the model predictions to enhance the discrimination of target features, but it is unable to explicitly align domain distributions. In order to combine the strengths of these two methods, we propose a novel method called Adversarial-Learned Loss for Domain Adaptation (ALDA). We first analyze the pseudo-label method, a typical self-training method. Nevertheless, there is a gap between pseudo-labels and the ground truth, which can cause incorrect training. Thus we introduce the confusion matrix, which is learned through an adversarial manner in ALDA, to reduce the gap and align the feature distributions. Finally, a new loss function is auto-constructed from the learned confusion matrix, which serves as the loss for unlabeled target samples. Our ALDA outperforms state-of-the-art approaches in four standard domain adaptation datasets. Our code is available at https://github.com/ZJULearning/ALDA.

preprint2020arXiv

Heavy quark expansion for heavy-light light-cone operators

We generalize the celebrated heavy quark expansion to nonlocal QCD operators. By taking nonlocal heavy-light current on the light-cone as an example, we confirm that the collinear singularities are common between QCD operator and the corresponding operator in heavy quark effective theory (HQET), at the leading power of $1/M$ expansion. Based on a perturbative calculation in operator form at one-loop level, a factorization formula linking QCD and HQET operators is investigated and the matching coefficient is determined. The matching between QCD and HQET light-cone distribution amplitudes (LCDAs) as well as other momentum distributions of hadron can be derived as a consequence.

preprint2020arXiv

Multi-Objective Parameter-less Population Pyramid for Solving Industrial Process Planning Problems

Evolutionary methods are effective tools for obtaining high-quality results when solving hard practical problems. Linkage learning may increase their effectiveness. One of the state-of-the-art methods that employ linkage learning is the Parameter-less Population Pyramid (P3). P3 is dedicated to solving single-objective problems in discrete domains. Recent research shows that P3 is highly competitive when addressing problems with so-called overlapping blocks, which are typical for practical problems. In this paper, we consider a multi-objective industrial process planning problem that arises from practice and is NP-hard. To handle it, we propose a multi-objective version of P3. The extensive research shows that our proposition outperforms the competing methods for the considered practical problem and typical multi-objective benchmarks.

preprint2020arXiv

Phase-Matching Quantum Cryptographic Conferencing

Quantum cryptographic conferencing (QCC) holds promise for distributing information-theoretic secure keys among multiple users over long distance. Limited by the fragility of Greenberger-Horne-Zeilinger (GHZ) state, QCC networks based on directly distributing GHZ states at long distance still face big challenge. Another two potential approaches are measurement device independent QCC and conference key agreement with single-photon interference, which was proposed based on the post-selection of GHZ states and the post-selection of W state, respectively. However, implementations of the former protocol are still heavily constrained by the transmission rate $η$ of optical channels and the complexity of the setups for post-selecting GHZ states. Meanwhile, the latter protocol cannot be cast to a measurement device independent prepare-and-measure scheme. Combining the idea of post-selecting GHZ state and recently proposed twin-field quantum key distribution protocols, we report a QCC protocol based on weak coherent state interferences named phase-matching quantum cryptographic conferencing, which is immune to all detector side-channel attacks. The proposed protocol can improve the key generation rate from $\mathrm{O}(η^N)$ to $\mathrm{O}(η^{N-1})$ compared with the measurement device independent QCC protocols. Meanwhile, it can be easily scaled up to multiple parties due to its simple setup.

preprint2020arXiv

Study on the atmospheric pressure homogeneous discharge in air assisted with a floating carbon fibre microelectrode

Based on the consideration of increasing the number of initial electrons and providing an appropriate distribution of electric field strength for discharge space, a method of adopting the wire-cylindrical type electrode structure with a floating carbon fibre electrode to achieve atmospheric pressure homogeneous discharge (APHD) in air is proposed in this paper. Studies of the electrode characteristics show that this structure can make full use of the microdischarge process of the carbon fibre microelectrode and the good discharge effect of the helical electrode to provide plenty of initial electrons for the discharge space. Besides, the non-uniform electric field distribution with gradual change is conducive to the slow growth of electron avalanches in this structure. Thus, the initial discharge voltage can be reduced and the formation of filamentary discharge channels can be inhibited, which provides theoretical possibilities for the homogeneous discharge in atmospheric air. Experiments show that a three-dimensional uniform discharge phenomenon can be realized under a 6.0 kV applied voltage in a PFA tube of 6 mm inner diameter, displaying good uniformity and large scale.

preprint2020arXiv

To be Tough or Soft: Measuring the Impact of Counter-Ad-blocking Strategies on User Engagement

The fast growing ad-blocker usage results in large revenue decrease for ad-supported online websites. Facing this problem, many online publishers choose either to cooperate with ad-blocker software companies to show acceptable ads or to build a wall that requires users to whitelist the site for content access. However, there is lack of studies on the impact of these two counter-ad-blocking strategies on user behaviors. To address this issue, we conduct a randomized field experiment on the website of Forbes Media, a major US media publisher. The ad-blocker users are divided into a treatment group, which receives the wall strategy, and a control group, which receives the acceptable ads strategy. We utilize the difference-in-differences method to estimate the causal effects. Our study shows that the wall strategy has an overall negative impact on user engagements. However, it has no statistically significant effect on high-engaged users as they would view the pages no matter what strategy is used. It has a big impact on low-engaged users, who have no loyalty to the site. Our study also shows that revisiting behavior decreases over time, but the ratio of session whitelisting increases over time as the remaining users have relatively high loyalty and high engagement. The paper concludes with discussions of managerial insights for publishers when determining counter-ad-blocking strategies.

preprint2020arXiv

Value-driven Manufacturing Planning using Cloud-based Evolutionary Optimisation

This paper considers manufacturing planning and scheduling of manufacturing orders whose value decreases over time. The value decrease is modelled with a so-called value curve. Two genetic-algorithm-based methods for multi-objective optimisation have been proposed, implemented and deployed to a cloud. The first proposed method allocates and schedules manufacturing of all the ordered elements optimising both the makespan and the total value, whereas the second method selects only the profitable orders for manufacturing. The proposed evolutionary optimisation has been performed for a set of real-world-inspired manufacturing orders. Both the methods yield a similar total value, but the latter method leads to a shorter makespan.