Researcher profile

Zhengsu Chen

Zhengsu Chen contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.

preprint2021arXiv

A Survey on Incorporating Domain Knowledge into Deep Learning for Medical Image Analysis

Although deep learning models like CNNs have achieved great success in medical image analysis, the small size of medical datasets remains a major bottleneck in this area. To address this problem, researchers have started looking for external information beyond current available medical datasets. Traditional approaches generally leverage the information from natural images via transfer learning. More recent works utilize the domain knowledge from medical doctors, to create networks that resemble how medical doctors are trained, mimic their diagnostic patterns, or focus on the features or areas they pay particular attention to. In this survey, we summarize the current progress on integrating medical domain knowledge into deep learning models for various tasks, such as disease diagnosis, lesion, organ and abnormality detection, lesion and organ segmentation. For each task, we systematically categorize different kinds of medical domain knowledge that have been utilized and their corresponding integrating methods. We also provide current challenges and directions for future research.

preprint2020arXiv

Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio

Automatic designing computationally efficient neural networks has received much attention in recent years. Existing approaches either utilize network pruning or leverage the network architecture search methods. This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs, so that under each network configuration, one can estimate the FLOPs utilization ratio (FUR) for each layer and use it to determine whether to increase or decrease the number of channels on the layer. Note that FUR, like the gradient of a non-linear function, is accurate only in a small neighborhood of the current network. Hence, we design an iterative mechanism so that the initial network undergoes a number of steps, each of which has a small `adjusting rate' to control the changes to the network. The computational overhead of the entire search process is reasonable, i.e., comparable to that of re-training the final model from scratch. Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach, which consistently outperforms the pruning counterpart. The code is available at https://github.com/danczs/NetworkAdjustment.

preprint2020arXiv

SelectScale: Mining More Patterns from Images via Selective and Soft Dropout

Convolutional neural networks (CNNs) have achieved remarkable success in image recognition. Although the internal patterns of the input images are effectively learned by the CNNs, these patterns only constitute a small proportion of useful patterns contained in the input images. This can be attributed to the fact that the CNNs will stop learning if the learned patterns are enough to make a correct classification. Network regularization methods like dropout and SpatialDropout can ease this problem. During training, they randomly drop the features. These dropout methods, in essence, change the patterns learned by the networks, and in turn, forces the networks to learn other patterns to make the correct classification. However, the above methods have an important drawback. Randomly dropping features is generally inefficient and can introduce unnecessary noise. To tackle this problem, we propose SelectScale. Instead of randomly dropping units, SelectScale selects the important features in networks and adjusts them during training. Using SelectScale, we improve the performance of CNNs on CIFAR and ImageNet.

preprint2020arXiv

Weight-Sharing Neural Architecture Search: A Battle to Shrink the Optimization Gap

Neural architecture search (NAS) has attracted increasing attentions in both academia and industry. In the early age, researchers mostly applied individual search methods which sample and evaluate the candidate architectures separately and thus incur heavy computational overheads. To alleviate the burden, weight-sharing methods were proposed in which exponentially many architectures share weights in the same super-network, and the costly training procedure is performed only once. These methods, though being much faster, often suffer the issue of instability. This paper provides a literature review on NAS, in particular the weight-sharing methods, and points out that the major challenge comes from the optimization gap between the super-network and the sub-architectures. From this perspective, we summarize existing approaches into several categories according to their efforts in bridging the gap, and analyze both advantages and disadvantages of these methodologies. Finally, we share our opinions on the future directions of NAS and AutoML. Due to the expertise of the authors, this paper mainly focuses on the application of NAS to computer vision problems and may bias towards the work in our group.