Source author record

Jia Jia

Jia Jia appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.AG Computer Vision Machine Learning Multimedia eess.AS math.DS math.NT Sound Artificial Intelligence Computation and Language Human-Computer Interaction math.CV math.GR Social and Information Networks

Catalog footprint

What is connected

11works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/

preprint2026arXiv

Sheaf stable pairs on projective surfaces and birational geometry

We study moduli space of higher rank marginally stable pairs (E,s:= (s_1,..., s_r)) consisting of torsion free coherent sheaf E of rank r and r sections (s_1,..., s_r) on a smooth projective surface. Having fixed the Chern character of E, the resulting moduli space is isomorphic to some subscheme of the Quot-scheme parametrising quotient sheaves of appropriate Chern character. We establish a connection between moduli space of higher rank stable pairs and stable minimal models induced by the sheaf E and sections s_i and the relative lc model of base surface, and use birational geometry of minimal models to analyse in detail the components of the fibre of the Hilbert-Chow morphism from the moduli space to the Hilbert scheme of effective Cartier divisors on the base surface.

preprint2026arXiv

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.

preprint2023arXiv

Surjective endomorphisms of projective surfaces: the existence of infinitely many dense orbits

Let $f \colon X \to X$ be a surjective endomorphism of a normal projective surface. When $\operatorname{deg} f \geq 2$, applying an (iteration of) $f$-equivariant minimal model program (EMMP), we determine the geometric structure of $X$. Using this, we extend the second author's result to singular surfaces to the extent that either $X$ has an $f$-invariant non-constant rational function, or $f$ has infinitely many Zariski-dense forward orbits; this result is also extended to Adelic topology (which is finer than Zariski topology).

preprint2022arXiv

Equivariant Kähler model for Fujiki's class

Let $X$ be a compact complex manifold in Fujiki's class $\mathcal{C}$, i.e., admitting a big $(1,1)$-class $[α]$. Consider $\text{Aut}(X)$ the group of biholomorphic automorphisms and $\text{Aut}_{[α]}(X)$ the subgroup of automorphisms preserving the class $[α]$ via pullback. We show that $X$ admits an $\text{Aut}_{[α]}(X)$-equivariant Kähler model: there is a bimeromorphic holomorphic map $σ\colon \widetilde{X}\to X$ from a Kähler manifold $\widetilde{X}$ such that $\text{Aut}_{[α]}(X)$ lifts holomorphically via $σ$. There are several applications. We show that $\text{Aut}_{[α]}(X)$ is a Lie group with only finitely many components. This generalizes an early result of Lieberman and Fujiki on the Kähler case. We also show that every torsion subgroup of $\text{Aut}(X)$ is almost abelian, and $\text{Aut}(X)$ is finite if it is a torsion group.

preprint2022arXiv

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Cross-speaker style transfer aims to extract the speech style of the given reference speech, which can be reproduced in the timbre of arbitrary target speakers. Existing methods on this topic have explored utilizing utterance-level style labels to perform style transfer via either global or local scale style representations. However, audiobook datasets are typically characterized by both the local prosody and global genre, and are rarely accompanied by utterance-level style labels. Thus, properly transferring the reading style across different speakers remains a challenging task. This paper aims to introduce a chunk-wise multi-scale cross-speaker style model to capture both the global genre and the local prosody in audiobook speeches. Moreover, by disentangling speaker timbre and style with the proposed switchable adversarial classifiers, the extracted reading style is made adaptable to the timbre of different speakers. Experiment results confirm that the model manages to transfer a given reading style to new target speakers. With the support of local prosody and global genre type predictor, the potentiality of the proposed method in multi-speaker audiobook generation is further revealed.

preprint2021arXiv

Potential density of projective varieties having an int-amplified endomorphism

We consider the potential density of rational points on an algebraic variety defined over a number field $K$, i.e., the property that the set of rational points of $X$ becomes Zariski dense after a finite field extension of $K$. For a non-uniruled projective variety with an int-amplified endomorphism, we show that it always satisfies potential density. When a rationally connected variety admits an int-amplified endomorphism, we prove that there exists some rational curve with a Zariski dense forward orbit, assuming the Zariski dense orbit conjecture in lower dimensions. As an application, we prove the potential density for projective varieties with int-amplified endomorphisms in dimension $\leq 3$. We also study the existence of densely many rational points with the maximal arithmetic degree over a sufficiently large number field.

preprint2020arXiv

ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

Dance and music are two highly correlated artistic forms. Synthesizing dance motions has attracted much attention recently. Most previous works conduct music-to-dance synthesis via directly music to human skeleton keypoints mapping. Meanwhile, human choreographers design dance motions from music in a two-stage manner: they firstly devise multiple choreographic dance units (CAUs), each with a series of dance motions, and then arrange the CAU sequence according to the rhythm, melody and emotion of the music. Inspired by these, we systematically study such two-stage choreography approach and construct a dataset to incorporate such choreography knowledge. Based on the constructed dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to imitate human choreography procedure. Our framework firstly devises a CAU prediction model to learn the mapping relationship between music and CAU sequences. Afterwards, we devise a spatial-temporal inpainting model to convert the CAU sequence into continuous dance motions. Experimental results demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in terms of CAU BLEU score and 1.59 in terms of user study score).

preprint2020arXiv

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phonetic posteriorgrams (PPG). In this way, our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches. Furthermore, our method can support multilingual speech as input by building a universal phoneme space. As far as we know, our model is the first to support multilingual/mixlingual speech as input with convincing results. Objective and subjective experiments have shown that our model can generate high quality animations given speech from unseen languages or speakers and be robust to noise.

preprint2016arXiv

Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition

Feature subspace selection is an important part in speech emotion recognition. Most of the studies are devoted to finding a feature subspace for representing all emotions. However, some studies have indicated that the features associated with different emotions are not exactly the same. Hence, traditional methods may fail to distinguish some of the emotions with just one global feature subspace. In this work, we propose a new divide and conquer idea to solve the problem. First, the feature subspaces are constructed for all the combinations of every two different emotions (emotion-pair). Bi-classifiers are then trained on these feature subspaces respectively. The final emotion recognition result is derived by the voting and competition method. Experimental results demonstrate that the proposed method can get better results than the traditional multi-classification method.

preprint2014arXiv

Modeling Emotion Influence from Images in Social Networks

Images become an important and prevalent way to express users' activities, opinions and emotions. In a social network, individual emotions may be influenced by others, in particular by close friends. We focus on understanding how users embed emotions into the images they uploaded to the social websites and how social influence plays a role in changing users' emotions. We first verify the existence of emotion influence in the image networks, and then propose a probabilistic factor graph based emotion influence model to answer the questions of "who influences whom". Employing a real network from Flickr as experimental data, we study the effectiveness of factors in the proposed model with in-depth data analysis. Our experiments also show that our model, by incorporating the emotion influence, can significantly improve the accuracy (+5%) for predicting emotions from images. Finally, a case study is used as the anecdotal evidence to further demonstrate the effectiveness of the proposed model.

Jia Jia

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

Sheaf stable pairs on projective surfaces and birational geometry

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Surjective endomorphisms of projective surfaces: the existence of infinitely many dense orbits

Equivariant Kähler model for Fujiki's class

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Potential density of projective varieties having an int-amplified endomorphism

ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition

Modeling Emotion Influence from Images in Social Networks