Source author record

Xiaofeng Zhang

Xiaofeng Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computer Vision Machine Learning Computation and Language cond-mat.mtrl-sci hep-ph hep-th Information Retrieval Social and Information Networks

Catalog footprint

What is connected

12works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts -- where missed cues can stereotype communities and undermine usability. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit (stated) as well as implicit (unstated, implied by the prompt's cultural context) cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we show that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, provide a concrete testbed, and outline actionable directions for developing culturally informed T2I models and metrics that improve global usability.

preprint2026arXiv

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCOT) reasoning trajectories. In addition, we propose a fine-grained Direct Preference Optimization (fDPO) method that introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves relative performance gains of 4.1% and 9.0% over standard DPO on spatial qualitative and quantitative tasks, respectively. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SpatialRGPT-Bench, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.

preprint2026arXiv

GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer-oriented gradient direction, and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into a sample-level value for subset selection, using only the model's internal optimization signals and no external reward models or step annotations. To make this scalable, GRACE introduces a representation-level gradient proxy that estimates step-level alignment from token-level upstream signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reaches 108.8% of the full-data performance with 20% of the data and retains 100.2% with only 5%, with subsets that transfer effectively across model backbones.

preprint2026arXiv

MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing

Graph Neural Networks (GNNs) suffer from over-squashing in deep message passing, where information from exponentially growing neighborhoods is compressed into fixed-dimensional representations. We show that this issue becomes a distinct failure mode in multi-label graphs: neighboring nodes often share only limited labels while differing across many irrelevant ones, causing predictive signals to be diluted by noisy label information. To address this challenge, we propose the Multi-Label Graph Information Bottleneck (MLGIB), which formulates multi-label message passing as constrained information transmission under irrelevant label noise. MLGIB balances expressiveness and robustness by preserving predictive label signals while suppressing irrelevant noise. Specifically, it constructs a Markovian dependence space and derives tractable variational bounds, where the lower bound maximizes mutual information with target labels and the upper bound constrains redundant source information. These bounds lead to an end-to-end label-aware message-passing architecture. Extensive experiments on multiple benchmarks demonstrate consistent improvements over existing methods, validating the effectiveness and generality of the proposed framework.

preprint2026arXiv

Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.

preprint2022arXiv

DAGNN: Demand-aware Graph Neural Networks for Session-based Recommendation

Session-based recommendations have been widely adopted for various online video and E-commerce Websites. Most existing approaches are intuitively proposed to discover underlying interests or preferences out of the anonymous session data. This apparently ignores the fact these sequential behaviors usually reflect session user's potential demand, i.e., a semantic level factor, and therefore how to estimate underlying demands from a session is challenging. To address aforementioned issue, this paper proposes a demand-aware graph neural networks (DAGNN). Particularly, a demand modeling component is designed to first extract session demand and the underlying multiple demands of each session is estimated using the global demand matrix. Then, the demand-aware graph neural network is designed to extract session demand graph to learn the demand-aware item embedddings for the later recommendations. The mutual information loss is further designed to enhance the quality of the learnt embeddings. Extensive experiments are evaluated on several real-world datasets and the proposed model achieves the SOTA model performance.

preprint2022arXiv

NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

This paper reviews the NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video. In this challenge, we proposed the LDV 2.0 dataset, which includes the LDV dataset (240 videos) and 95 additional videos. This challenge includes three tracks. Track 1 aims at enhancing the videos compressed by HEVC at a fixed QP. Track 2 and Track 3 target both the super-resolution and quality enhancement of HEVC compressed video. They require x2 and x4 super-resolution, respectively. The three tracks totally attract more than 600 registrations. In the test phase, 8 teams, 8 teams and 12 teams submitted the final results to Tracks 1, 2 and 3, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution and quality enhancement of compressed video. The proposed LDV 2.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge (including open-sourced codes) is at https://github.com/RenYang-home/NTIRE22_VEnh_SR.

preprint2022arXiv

The mass-degenerate SM-like Higgs and anomaly of $(g-2)_μ$ in $μ$-term extended NMSSM

We chose the $μ$-term extended next-to-minimal supersymmetric standard model ($μ$NMSSM) for this work, and the phenomenological research is based on the assumption of double Higgs resonance state as the Standard Model (SM)-like Higgs considering the recent $(g-2)_μ$ result. The study also take into account a variety of experimental results, including direct detection of dark matter (DM) and searching results for sparticles at the Large Hadron Collider (LHC). We study the characteristic of DM confronted with limitations of direct detection experiments. Following that, we concentrate on the properties of the mass-degenerate SM-like Higgs bosons and explaining the anomaly of $(g-2)_μ$. We conclude that the anomaly of $(g-2)_μ$ can be explained in the scenario with two mass-degenerate SM-like Higgs, and there are samples that meet all current constraints and outperform SM in fitting Higgs data.

preprint2021arXiv

VDPC: Variational Density Peak Clustering Algorithm

The widely applied density peak clustering (DPC) algorithm makes an intuitive cluster formation assumption that cluster centers are often surrounded by data points with lower local density and far away from other data points with higher local density. However, this assumption suffers from one limitation that it is often problematic when identifying clusters with lower density because they might be easily merged into other clusters with higher density. As a result, DPC may not be able to identify clusters with variational density. To address this issue, we propose a variational density peak clustering (VDPC) algorithm, which is designed to systematically and autonomously perform the clustering task on datasets with various types of density distributions. Specifically, we first propose a novel method to identify the representatives among all data points and construct initial clusters based on the identified representatives for further analysis of the clusters' property. Furthermore, we divide all data points into different levels according to their local density and propose a unified clustering framework by combining the advantages of both DPC and DBSCAN. Thus, all the identified initial clusters spreading across different density levels are systematically processed to form the final clusters. To evaluate the effectiveness of the proposed VDPC algorithm, we conduct extensive experiments using 20 datasets including eight synthetic, six real-world and six image datasets. The experimental results show that VDPC outperforms two classical algorithms (i.e., DPC and DBSCAN) and four state-of-the-art extended DPC algorithms.

preprint2020arXiv

Heterogeneous-Temporal Graph Convolutional Networks: Make the Community Detection Much Better

Community detection has long been an important yet challenging task to analyze complex networks with a focus on detecting topological structures of graph data. Essentially, real-world graph data contains various features, node and edge types which dynamically vary over time, and this invalidates most existing community detection approaches. To cope with these issues, this paper proposes the heterogeneous-temporal graph convolutional networks (HTGCN) to detect communities from hetergeneous and temporal graphs. Particularly, we first design a heterogeneous GCN component to acquire feature representations for each heterogeneous graph at each time step. Then, a residual compressed aggregation component is proposed to represent "dynamic" features for "varying" communities, which are then aggregated with "static" features extracted from current graph. Extensive experiments are evaluated on two real-world datasets, i.e., DBLP and IMDB. The promising results demonstrate that the proposed HTGCN is superior to both benchmark and the state-of-the-art approaches, e.g., GCN, GAT, GNN, LGNN, HAN and STAR, with respect to a number of evaluation criteria.

preprint2020arXiv

Structure Matters: Towards Generating Transferable Adversarial Images

Recent works on adversarial examples for image classification focus on directly modifying pixels with minor perturbations. The small perturbation requirement is imposed to ensure the generated adversarial examples being natural and realistic to humans, which, however, puts a curb on the attack space thus limiting the attack ability and transferability especially for systems protected by a defense mechanism. In this paper, we propose the novel concepts of structure patterns and structure-aware perturbations that relax the small perturbation constraint while still keeping images natural. The key idea of our approach is to allow perceptible deviation in adversarial examples while keeping structure patterns that are central to a human classifier. Built upon these concepts, we propose a \emph{structure-preserving attack (SPA)} for generating natural adversarial examples with extremely high transferability. Empirical results on the MNIST and the CIFAR10 datasets show that SPA exhibits strong attack ability in both the white-box and black-box setting even defenses are applied. Moreover, with the integration of PGD or CW attack, its attack ability escalates sharply under the white-box setting, without losing the outstanding transferability inherited from SPA.

preprint2014arXiv

Surface atomic diffusion processes observed at milliseconds time resolution using environmental TEM

Significant progress has been made in spatial resolution using environmental transmission electron microscopes (ETEM), which now enables atomic resolution visualization of structural transformation under variable temperature and gas environments close to materials real operational conditions. Structural transformations are observed by recording images or diffraction patterns at various time intervals using a video camera or by taking snap shots using electron pulses. While time resolution at 15 ns has been reported using pulsed electron beams, the time interval that can be recorded by this technique is currently very limited. For longer recording, however, time resolution inside ETEM has been limited by electron cameras to ~1/30 seconds for a long time. Using the recently developed direct electron detection technology, we have significantly improved the time resolution of ETEM to 2.5 ms (milliseconds) for full frame or 0.625 ms for 0.25 frames.

Xiaofeng Zhang

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing

Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

DAGNN: Demand-aware Graph Neural Networks for Session-based Recommendation

NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

The mass-degenerate SM-like Higgs and anomaly of $(g-2)_μ$ in $μ$-term extended NMSSM

VDPC: Variational Density Peak Clustering Algorithm

Heterogeneous-Temporal Graph Convolutional Networks: Make the Community Detection Much Better

Structure Matters: Towards Generating Transferable Adversarial Images

Surface atomic diffusion processes observed at milliseconds time resolution using environmental TEM