Researcher profile

Jia Jia

Jia Jia contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/

preprint2026arXiv

Sheaf stable pairs on projective surfaces and birational geometry

We study moduli space of higher rank marginally stable pairs (E,s:= (s_1,..., s_r)) consisting of torsion free coherent sheaf E of rank r and r sections (s_1,..., s_r) on a smooth projective surface. Having fixed the Chern character of E, the resulting moduli space is isomorphic to some subscheme of the Quot-scheme parametrising quotient sheaves of appropriate Chern character. We establish a connection between moduli space of higher rank stable pairs and stable minimal models induced by the sheaf E and sections s_i and the relative lc model of base surface, and use birational geometry of minimal models to analyse in detail the components of the fibre of the Hilbert-Chow morphism from the moduli space to the Hilbert scheme of effective Cartier divisors on the base surface.

preprint2026arXiv

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.

preprint2023arXiv

Surjective endomorphisms of projective surfaces: the existence of infinitely many dense orbits

Let $f \colon X \to X$ be a surjective endomorphism of a normal projective surface. When $\operatorname{deg} f \geq 2$, applying an (iteration of) $f$-equivariant minimal model program (EMMP), we determine the geometric structure of $X$. Using this, we extend the second author's result to singular surfaces to the extent that either $X$ has an $f$-invariant non-constant rational function, or $f$ has infinitely many Zariski-dense forward orbits; this result is also extended to Adelic topology (which is finer than Zariski topology).

preprint2022arXiv

Equivariant Kähler model for Fujiki's class

Let $X$ be a compact complex manifold in Fujiki's class $\mathcal{C}$, i.e., admitting a big $(1,1)$-class $[α]$. Consider $\text{Aut}(X)$ the group of biholomorphic automorphisms and $\text{Aut}_{[α]}(X)$ the subgroup of automorphisms preserving the class $[α]$ via pullback. We show that $X$ admits an $\text{Aut}_{[α]}(X)$-equivariant Kähler model: there is a bimeromorphic holomorphic map $σ\colon \widetilde{X}\to X$ from a Kähler manifold $\widetilde{X}$ such that $\text{Aut}_{[α]}(X)$ lifts holomorphically via $σ$. There are several applications. We show that $\text{Aut}_{[α]}(X)$ is a Lie group with only finitely many components. This generalizes an early result of Lieberman and Fujiki on the Kähler case. We also show that every torsion subgroup of $\text{Aut}(X)$ is almost abelian, and $\text{Aut}(X)$ is finite if it is a torsion group.

preprint2022arXiv

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Cross-speaker style transfer aims to extract the speech style of the given reference speech, which can be reproduced in the timbre of arbitrary target speakers. Existing methods on this topic have explored utilizing utterance-level style labels to perform style transfer via either global or local scale style representations. However, audiobook datasets are typically characterized by both the local prosody and global genre, and are rarely accompanied by utterance-level style labels. Thus, properly transferring the reading style across different speakers remains a challenging task. This paper aims to introduce a chunk-wise multi-scale cross-speaker style model to capture both the global genre and the local prosody in audiobook speeches. Moreover, by disentangling speaker timbre and style with the proposed switchable adversarial classifiers, the extracted reading style is made adaptable to the timbre of different speakers. Experiment results confirm that the model manages to transfer a given reading style to new target speakers. With the support of local prosody and global genre type predictor, the potentiality of the proposed method in multi-speaker audiobook generation is further revealed.

preprint2021arXiv

Potential density of projective varieties having an int-amplified endomorphism

We consider the potential density of rational points on an algebraic variety defined over a number field $K$, i.e., the property that the set of rational points of $X$ becomes Zariski dense after a finite field extension of $K$. For a non-uniruled projective variety with an int-amplified endomorphism, we show that it always satisfies potential density. When a rationally connected variety admits an int-amplified endomorphism, we prove that there exists some rational curve with a Zariski dense forward orbit, assuming the Zariski dense orbit conjecture in lower dimensions. As an application, we prove the potential density for projective varieties with int-amplified endomorphisms in dimension $\leq 3$. We also study the existence of densely many rational points with the maximal arithmetic degree over a sufficiently large number field.

preprint2020arXiv

ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

Dance and music are two highly correlated artistic forms. Synthesizing dance motions has attracted much attention recently. Most previous works conduct music-to-dance synthesis via directly music to human skeleton keypoints mapping. Meanwhile, human choreographers design dance motions from music in a two-stage manner: they firstly devise multiple choreographic dance units (CAUs), each with a series of dance motions, and then arrange the CAU sequence according to the rhythm, melody and emotion of the music. Inspired by these, we systematically study such two-stage choreography approach and construct a dataset to incorporate such choreography knowledge. Based on the constructed dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to imitate human choreography procedure. Our framework firstly devises a CAU prediction model to learn the mapping relationship between music and CAU sequences. Afterwards, we devise a spatial-temporal inpainting model to convert the CAU sequence into continuous dance motions. Experimental results demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in terms of CAU BLEU score and 1.59 in terms of user study score).

preprint2020arXiv

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phonetic posteriorgrams (PPG). In this way, our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches. Furthermore, our method can support multilingual speech as input by building a universal phoneme space. As far as we know, our model is the first to support multilingual/mixlingual speech as input with convincing results. Objective and subjective experiments have shown that our model can generate high quality animations given speech from unseen languages or speakers and be robust to noise.