Researcher profile

Yan Feng

Yan Feng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
15topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.

preprint2025arXiv

UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified structure with a concise model, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. A common assumption in unified vision-language models is that adding generation will naturally strengthen understanding. However, this is not always true at scale. At 200M+ pretraining samples, generation helps understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM. Once pixel-level objectives (e.g., diffusion losses) directly interfere with the LLM, understanding performance often degrades. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. Unified generation-understanding demonstrates a superior scaling trend compared to understanding alone, revealing a more effective way to learn vision-only knowledge directive from vision modality rather than captioning to text. (3) Autoregression on Input Embedding is effective to capture visual details. Compared to the commonly-used vision encoder, make visual autoregression on input embedding shows less cumulative error and is modality independent, which can be extend to all modalities. The learned semantic representations capture visual information such as objects, locations, shapes, and colors; further enable pixel-level image generation.

preprint2023arXiv

Generalizable Black-Box Adversarial Attack with Meta Learning

In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.

preprint2022arXiv

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years, generally with elaborately designed knowledge representations, which in turn increase the difficulty of model development and interpretation. In contrast, we empirically show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap. We directly reuse the discriminative classifier from the pre-trained teacher model for student inference and train a student encoder through feature alignment with a single $\ell_2$ loss. In this way, the student model is able to achieve exactly the same performance as the teacher model provided that their extracted features are perfectly aligned. An additional projector is developed to help the student encoder match with the teacher classifier, which renders our technique applicable to various teacher and student architectures. Extensive experiments demonstrate that our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.

preprint2022arXiv

Multi-Scale Architectures Matter: On the Adversarial Robustness of Flow-based Lossless Compression

As a probabilistic modeling technique, the flow-based model has demonstrated remarkable potential in the field of lossless compression \cite{idf,idf++,lbb,ivpf,iflow},. Compared with other deep generative models (eg. Autoregressive, VAEs) \cite{bitswap,hilloc,pixelcnn++,pixelsnail} that explicitly model the data distribution probabilities, flow-based models perform better due to their excellent probability density estimation and satisfactory inference speed. In flow-based models, multi-scale architecture provides a shortcut from the shallow layer to the output layer, which significantly reduces the computational complexity and avoid performance degradation when adding more layers. This is essential for constructing an advanced flow-based learnable bijective mapping. Furthermore, the lightweight requirement of the model design in practical compression tasks suggests that flows with multi-scale architecture achieve the best trade-off between coding complexity and compression efficiency.

preprint2022arXiv

sqSGD: Locally Private and Communication Efficient Federated Learning

Federated learning (FL) is a technique that trains machine learning models from decentralized data sources. We study FL under local notions of privacy constraints, which provides strong protection against sensitive data disclosures via obfuscating the data before leaving the client. We identify two major concerns in designing practical privacy-preserving FL algorithms: communication efficiency and high-dimensional compatibility. We then develop a gradient-based learning algorithm called \emph{sqSGD} (selective quantized stochastic gradient descent) that addresses both concerns. The proposed algorithm is based on a novel privacy-preserving quantization scheme that uses a constant number of bits per dimension per client. Then we improve the base algorithm in three ways: first, we apply a gradient subsampling strategy that simultaneously offers better training performance and smaller communication costs under a fixed privacy budget. Secondly, we utilize randomized rotation as a preprocessing step to reduce quantization error. Thirdly, an adaptive gradient norm upper bound shrinkage strategy is adopted to improve accuracy and stabilize training. Finally, the practicality of the proposed framework is demonstrated on benchmark datasets. Experiment results show that sqSGD successfully learns large models like LeNet and ResNet with local privacy constraints. In addition, with fixed privacy and communication level, the performance of sqSGD significantly dominates that of various baseline algorithms.

preprint2022arXiv

The crossed-sine wavefront sensor: first tests and results

The crossed-sine wavefront sensor (WFS) is a pupil plane wavefront sensor that measures the first derivatives of the wavefront. It is made by three main components: a gradient transmission filter (GTF) built from a product of sine functions rotated by 45 degrees around the optical axis, a 2x2 mini-lens array (MLA) at the focus of the tested optical system and a detector array located on a plane conjugated to the pupil. The basic principle consists in acquiring four pupil images simultaneously, each image being observed from different points located behind the GTF. After the simulation work which demonstrated the wavefront reconstruction capability, we are now in the phase of implementation of the prototype in the lab. The crossed-sine WFS could achieve a simultaneous high spatial resolution at the pupil of the tested optics and absolute measurement accuracy comparable to that attained by laser-interferometers. In this paper we introduce seven customized phase masks and make measurements of them.First tests and resultsare demonstrated, based on which we explore the performance of our crossed-sine WFS and make comparisons with that of the laser-interferomete

preprint2021arXiv

Spectral compression by phase doubling in second harmonic generation

In second harmonic generation, the phase of the optical field is doubled which has important implication. Here the phase doubling effect is utilized to solve a long-standing challenge in power scaling of single frequency laser. When a (-π/2, π/2) binary phase modulation is applied to a single frequency seed laser to broaden the spectrum and suppress the stimulated Brillouin scattering in high power fiber amplifier, the second harmonic of the phase-modulated laser will return to single frequency, because the (-π/2, π/2) modulation is doubled to (-π, π) for the second harmonic. A compression rate as high as 95% is demonstrated in the experiment limited by the electronic bandwidth of the setup, which can be improved with optimized devices.

preprint2021arXiv

Spectral Variability Augmented Sparse Unmixing of Hyperspectral Images

Spectral unmixing (SU) expresses the mixed pixels existed in hyperspectral images as the product of endmember and abundance, which has been widely used in hyperspectral imagery analysis. However, the influence of light, acquisition conditions and the inherent properties of materials, results in that the identified endmembers can vary spectrally within a given image (construed as spectral variability). To address this issue, recent methods usually use a priori obtained spectral library to represent multiple characteristic spectra of the same object, but few of them extracted the spectral variability explicitly. In this paper, a spectral variability augmented sparse unmixing model (SVASU) is proposed, in which the spectral variability is extracted for the first time. The variable spectra are divided into two parts of intrinsic spectrum and spectral variability for spectral reconstruction, and modeled synchronously in the SU model adding the regular terms restricting the sparsity of abundance and the generalization of the variability coefficient. It is noted that the spectral variability library and the intrinsic spectral library are all constructed from the In-situ observed image. Experimental results over both synthetic and real-world data sets demonstrate that the augmented decomposition by spectral variability significantly improves the unmixing performance than the decomposition only by spectral library, as well as compared to state-of-the-art algorithms.

preprint2020arXiv

Adversarial Attack on Deep Product Quantization Network for Image Retrieval

Deep product quantization network (DPQN) has recently received much attention in fast image retrieval tasks due to its efficiency of encoding high-dimensional visual features especially when dealing with large-scale datasets. Recent studies show that deep neural networks (DNNs) are vulnerable to input with small and maliciously designed perturbations (a.k.a., adversarial examples). This phenomenon raises the concern of security issues for DPQN in the testing/deploying stage as well. However, little effort has been devoted to investigating how adversarial examples affect DPQN. To this end, we propose product quantization adversarial generation (PQ-AG), a simple yet effective method to generate adversarial examples for product quantization based retrieval systems. PQ-AG aims to generate imperceptible adversarial perturbations for query images to form adversarial queries, whose nearest neighbors from a targeted product quantizaiton model are not semantically related to those from the original queries. Extensive experiments show that our PQ-AQ successfully creates adversarial examples to mislead targeted product quantization retrieval models. Besides, we found that our PQ-AG significantly degrades retrieval performance in both white-box and black-box settings.

preprint2020arXiv

Experimental demonstration of a dusty plasma ratchet rectification and its reversal

The naturally persistent flow of hundreds of dust particles is experimentally achieved in a dusty plasma system with the asymmetric sawteeth of gears on the electrode. It is also demonstrated that the direction of the dust particle flowcan be controlled by changing the plasma conditions of the gas pressure or the plasma power. Numerical simulations of dust particles with the ion drag inside the asymmetric sawteeth verify the experimental observations of the flow rectification of dust particles. Both experiments and simulations suggest that the asymmetric potential and the collective effect are the twokeys in this dusty plasma ratchet.With the nonequilibrium ion drag, the dust flow along the asymmetric orientation of this electric potential of the ratchet can be reversed by changing the balance height of dust particles using different plasma conditions.

preprint2020arXiv

Fast Adaptively Weighted Matrix Factorization for Recommendation with Implicit Feedback

Recommendation from implicit feedback is a highly challenging task due to the lack of the reliable observed negative data. A popular and effective approach for implicit recommendation is to treat unobserved data as negative but downweight their confidence. Naturally, how to assign confidence weights and how to handle the large number of the unobserved data are two key problems for implicit recommendation models. However, existing methods either pursuit fast learning by manually assigning simple confidence weights, which lacks flexibility and may create empirical bias in evaluating user's preference; or adaptively infer personalized confidence weights but suffer from low efficiency. To achieve both adaptive weights assignment and efficient model learning, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder. The personalized data confidence weights are adaptively assigned with a parameterized neural network (function) and the network can be inferred from the data. Further, to support fast and stable learning of FAWMF, a new specific batch-based learning algorithm fBGD has been developed, which trains on all feedback data but its complexity is linear to the number of observed data. Extensive experiments on real-world datasets demonstrate the superiority of the proposed FAWMF and its learning algorithm fBGD.

preprint2020arXiv

High order tensor moments of random vectors

A random vector $\bx\in \R^n$ is a vector whose coordinates are all random variables. A random vector is called a Gaussian vector if it follows Gaussian distribution. These terminology can also be extended to a random (Gaussian) matrix and random (Gaussian) tensor. The classical form of an $k$-order moment (for any positive integer $k$) of a random vector $\bx\in \R^n$ is usually expressed in a matrix form of size $n\times n^{k-1}$ generated from the $k$th derivative of the characteristic function or the moment generating function of $\bx$ , and the expression of an $k$-order moment is very complicate even for a standard normal distributed vector. With the tensor form, we can simplify all the expressions related to high order moments. The main purpose of this paper is to introduce the high order moments of a random vector in tensor forms and the high order moments of a standard normal distributed vector. Finally we present an expression of high order moments of a random vector that follows a Gaussian distribution.

preprint2020arXiv

Toward Adversarial Robustness via Semi-supervised Robust Training

Adversarial examples have been shown to be the severe threat to deep neural networks (DNNs). One of the most effective adversarial defense methods is adversarial training (AT) through minimizing the adversarial risk $R_{adv}$, which encourages both the benign example $x$ and its adversarially perturbed neighborhoods within the $\ell_{p}$-ball to be predicted as the ground-truth label. In this work, we propose a novel defense method, the robust training (RT), by jointly minimizing two separated risks ($R_{stand}$ and $R_{rob}$), which is with respect to the benign example and its neighborhoods respectively. The motivation is to explicitly and jointly enhance the accuracy and the adversarial robustness. We prove that $R_{adv}$ is upper-bounded by $R_{stand} + R_{rob}$, which implies that RT has similar effect as AT. Intuitively, minimizing the standard risk enforces the benign example to be correctly predicted, and the robust risk minimization encourages the predictions of the neighbor examples to be consistent with the prediction of the benign example. Besides, since $R_{rob}$ is independent of the ground-truth label, RT is naturally extended to the semi-supervised mode ($i.e.$, SRT), to further enhance the adversarial robustness. Moreover, we extend the $\ell_{p}$-bounded neighborhood to a general case, which covers different types of perturbations, such as the pixel-wise ($i.e.$, $x + δ$) or the spatial perturbation ($i.e.$, $ AX + b$). Extensive experiments on benchmark datasets not only verify the superiority of the proposed SRT method to state-of-the-art methods for defensing pixel-wise or spatial perturbations separately, but also demonstrate its robustness to both perturbations simultaneously. The code for reproducing main results is available at \url{https://github.com/THUYimingLi/Semi-supervised_Robust_Training}.