Source author record

Siqi Zhang

Siqi Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence astro-ph.HE Computation and Language Computer Vision Machine Learning

Catalog footprint

What is connected

3works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

preprint2024arXiv

On using the counting method to constrain the anisotropy of kilonova radiation

A large number of binary neutron star (BNS) mergers are expected to be detected by gravitational wave (GW) detectors and the electromagnetic (EM) counterparts (e.g., kilonovae) of a fraction of these mergers may be detected in multi-bands by large area survey telescopes. For a given number of BNS mergers detected by their GW signals, the expected numbers of their EM counterparts that can be detected by a survey with given selection criteria depend on the kilonova properties, including the anisotropy. In this paper, we investigate whether the anisotropy of kilonova radiation and the kilonova model can be constrained statistically by the counting method, i.e., using the numbers of BNS mergers detected via GW and multi-band EM signals. Adopting simple models for the BNS mergers, afterglows, and a simple two (blue and red)-component model for kilonovae, we generate mock samples for GW detected BNS mergers, their associated kilonovae and afterglows detected in multi-bands. By assuming some criteria for searching the EM counterparts, we simulate the observations of these EM counterparts and obtain the EM observed samples in different bands. With the numbers of BNS mergers detected by GW detectors and EM survey telescopes in different bands, we show that the anisotropy of kilonova radiation and the kilonova model can be well constrained by using the Bayesian analysis. Our results suggest that the anisotropy of kilonova radiation may be demographically and globally constrained by simply using the detection numbers of BNS mergers by GW detectors and EM survey telescopes in multi-bands.

preprint2022arXiv

A Dynamic 3D Spontaneous Micro-expression Database: Establishment and Evaluation

Micro-expressions are spontaneous, unconscious facial movements that show people's true inner emotions and have great potential in related fields of psychological testing. Since the face is a 3D deformation object, the occurrence of an expression can arouse spatial deformation of the face, but limited by the available databases are 2D videos, lacking the description of 3D spatial information of micro-expressions. Therefore, we proposed a new micro-expression database containing 2D video sequences and 3D point clouds sequences. The database includes 373 micro-expressions sequences, and these samples were classified using the objective method based on facial action coding system, as well as the non-objective method that combines video contents and participants' self-reports. We extracted 2D and 3D features using the local binary patterns on three orthogonal planes (LBP-TOP) and curvature algorithms, respectively, and evaluated the classification accuracies of these two features and their fusion results with leave-one-subject-out (LOSO) and 10-fold cross-validation. Further, we performed various neural network algorithms for database classification, the results show that classification accuracies are improved by fusing 3D features than using only 2D features. The database offers original and cropped micro-expression samples, which will facilitate the exploration and research on 3D Spatio-temporal features of micro-expressions.