Researcher profile

Yijun Yang

Yijun Yang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

preprint2026arXiv

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, yet it remains unclear how dynamic adversarial fine-tuning changes the internal carriers of refusal. We study one 7B backbone under supervised fine-tuning (SFT) and under Robust Refusal Dynamic Defense (R2D2), a HarmBench-style adversarial fine-tuning procedure that repeatedly refreshes harmful training cases with current jailbreak attacks. Our protocol aligns fixed-source HarmBench, StrongREJECT, and XSTest with a five-anchor refusal-geometry suite, causal interventions, and a sparse adaptive stress test. R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints, but that regime coincides with maximal XSTest refusal and complete failure on a benign-utility audit. Later checkpoints partially recover benign utility while partially reopening attack success. Sparse adaptive attacks sharpen the same frontier: step~50 remains closed under both adaptive GCG and AutoDAN, whereas adaptive GCG ASR rises to 0.415 at step~250 and 0.613 at step~500. Geometrically, R2D2 preserves a late-layer admissible carrier through step~100 and relocates the best admissible carrier to an early layer by step~250; SFT relocates earlier while remaining less robust. Effective rank remains near 1.24, and SFT exhibits larger principal-angle drift despite worse robustness. Causal interventions show that late-stage R2D2 behavior is controlled by a low-dimensional but utility-coupled carrier. These results support a geometry-reorganization account along a robustness--utility frontier.

preprint2026arXiv

TextLDM: Language Modeling with Continuous Latent Diffusion

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.

preprint2026arXiv

Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.

preprint2022arXiv

Be Your Own Neighborhood: Detecting Adversarial Example by the Neighborhood Relations Built on Self-Supervised Learning

Deep Neural Networks (DNNs) have achieved excellent performance in various fields. However, DNNs' vulnerability to Adversarial Examples (AE) hinders their deployments to safety-critical applications. This paper presents a novel AE detection framework, named BEYOND, for trustworthy predictions. BEYOND performs the detection by distinguishing the AE's abnormal relation with its augmented versions, i.e. neighbors, from two prospects: representation similarity and label consistency. An off-the-shelf Self-Supervised Learning (SSL) model is used to extract the representation and predict the label for its highly informative representation capacity compared to supervised learning models. For clean samples, their representations and predictions are closely consistent with their neighbors, whereas those of AEs differ greatly. Furthermore, we explain this observation and show that by leveraging this discrepancy BEYOND can effectively detect AEs. We develop a rigorous justification for the effectiveness of BEYOND. Furthermore, as a plug-and-play model, BEYOND can easily cooperate with the Adversarial Trained Classifier (ATC), achieving the state-of-the-art (SOTA) robustness accuracy. Experimental results show that BEYOND outperforms baselines by a large margin, especially under adaptive attacks. Empowered by the robust relation net built on SSL, we found that BEYOND outperforms baselines in terms of both detection ability and speed. Our code will be publicly available.

preprint2022arXiv

Insight-HXMT Study of the Inner Accretion Disk in the Black Hole Candidate EXO 1846--031

We study the spectral evolution of the black hole candidate EXO 1846$-$031 during its 2019 outburst, in the 1--150 keV band,with the {\it {Hard X-ray Modulation Telescope}}. The continuum spectrum is well modelled with an absorbed disk-blackbody plus cutoff power-law, in the hard, intermediate and soft states. In addition, we detect an $\approx$6.6 keV Fe emission line in the hard intermediate state. Throughout the soft intermediate and soft states, the fitted inner disk radius remains almost constant; we suggest that it has settled at the innermost stable circular orbit (ISCO). However, in the hard and hard intermediate states, the apparent inner radius was unphysically small (smaller than ISCO), even after accounting for the Compton scattering of some of the disk photons by the corona in the fit. We argue that this is the result of a high hardening factor, $f_{\rm col}\approx2.0-2.7$, in the early phases of outburst evolution, well above the canonical value of 1.7 suitable to a steady disk. We suggest that the inner disk radius was close to ISCO already in the low/hard state. Furthermore, we propose that this high value of hardening factor in the relatively hard state is probably caused by the additional illuminating of the coronal irradiation onto the disk. Additionally, we estimate the spin parameter with the continuum-fitting method, over a range of plausible black hole masses and distances. We compare our results with the spin measured with the reflection-fitting method and find that the inconsistency of the two results is partly caused by the different choices of $f_{\rm col}$.

preprint2022arXiv

MixDefense: A Defense-in-Depth Framework for Adversarial Example Detection Based on Statistical and Semantic Analysis

Machine learning with deep neural networks (DNNs) has become one of the foundation techniques in many safety-critical systems, such as autonomous vehicles and medical diagnosis systems. DNN-based systems, however, are known to be vulnerable to adversarial examples (AEs) that are maliciously perturbed variants of legitimate inputs. While there has been a vast body of research to defend against AE attacks in the literature, the performances of existing defense techniques are still far from satisfactory, especially for adaptive attacks, wherein attackers are knowledgeable about the defense mechanisms and craft AEs accordingly. In this work, we propose a multilayer defense-in-depth framework for AE detection, namely MixDefense. For the first layer, we focus on those AEs with large perturbations. We propose to leverage the `noise' features extracted from the inputs to discover the statistical difference between natural images and tampered ones for AE detection. For AEs with small perturbations, the inference result of such inputs would largely deviate from their semantic information. Consequently, we propose a novel learning-based solution to model such contradictions for AE detection. Both layers are resilient to adaptive attacks because there do not exist gradient propagation paths for AE generation. Experimental results with various AE attack methods on image classification datasets show that the proposed MixDefense solution outperforms the existing AE detection techniques by a considerable margin.

preprint2022arXiv

Out-of-Distribution Detection with Semantic Mismatch under Masking

This paper proposes a novel out-of-distribution (OOD) detection framework named MoodCat for image classifiers. MoodCat masks a random portion of the input image and uses a generative model to synthesize the masked image to a new image conditioned on the classification result. It then calculates the semantic difference between the original image and the synthesized one for OOD detection. Compared to existing solutions, MoodCat naturally learns the semantic information of the in-distribution data with the proposed mask and conditional synthesis strategy, which is critical to identifying OODs. Experimental results demonstrate that MoodCat outperforms state-of-the-art OOD detection solutions by a large margin.

preprint2022arXiv

What You See is Not What the Network Infers: Detecting Adversarial Examples Based on Semantic Contradiction

Adversarial examples (AEs) pose severe threats to the applications of deep neural networks (DNNs) to safety-critical domains, e.g., autonomous driving. While there has been a vast body of AE defense solutions, to the best of our knowledge, they all suffer from some weaknesses, e.g., defending against only a subset of AEs or causing a relatively high accuracy loss for legitimate inputs. Moreover, most existing solutions cannot defend against adaptive attacks, wherein attackers are knowledgeable about the defense mechanisms and craft AEs accordingly. In this paper, we propose a novel AE detection framework based on the very nature of AEs, i.e., their semantic information is inconsistent with the discriminative features extracted by the target DNN model. To be specific, the proposed solution, namely ContraNet, models such contradiction by first taking both the input and the inference result to a generator to obtain a synthetic output and then comparing it against the original input. For legitimate inputs that are correctly inferred, the synthetic output tries to reconstruct the input. On the contrary, for AEs, instead of reconstructing the input, the synthetic output would be created to conform to the wrong label whenever possible. Consequently, by measuring the distance between the input and the synthetic output with metric learning, we can differentiate AEs from legitimate inputs. We perform comprehensive evaluations under various AE attack scenarios, and experimental results show that ContraNet outperforms existing solutions by a large margin, especially under adaptive attacks. Moreover, our analysis shows that successful AEs that can bypass ContraNet tend to have much-weakened adversarial semantics. We have also shown that ContraNet can be easily combined with adversarial training techniques to achieve further improved AE defense capabilities.

preprint2021arXiv

The distance between the weights of the neural network is meaningful

In the application of neural networks, we need to select a suitable model based on the problem complexity and the dataset scale. To analyze the network's capacity, quantifying the information learned by the network is necessary. This paper proves that the distance between the neural network weights in different training stages can be used to estimate the information accumulated by the network in the training process directly. The experiment results verify the utility of this method. An application of this method related to the label corruption is shown at the end.