Source author record

Haoyuan Li

Haoyuan Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Computation and Language Artificial Intelligence Biological Physics cond-mat.mtrl-sci cond-mat.quant-gas cond-mat.stat-mech Databases eess.AS Machine Learning physics.data-an physics.ins-det physics.optics Sound

Catalog footprint

What is connected

9works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Unified Personalized Understanding, Generating and Editing

Unified large multimodal models (LMMs) have achieved remarkable progress in general-purpose multimodal understanding and generation. However, they still operate under a ``one-size-fits-all'' paradigm and struggle to model user-specific concepts (e.g., generate a photo of \texttt{<maeve>}) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge. We present \textbf{OmniPersona}, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior. To systematically evaluate unified personalization, we propose \textbf{\texttt{OmniPBench}}, extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.

preprint2025arXiv

R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.

preprint2024arXiv

Hard X-ray Generation and Detection of Nanometer-Scale Localized Coherent Acoustic Wave Packets in SrTiO$_3$ and KTaO$_3$

We demonstrate that the absorption of femtosecond x-ray pulses can excite quasi-spherical high-wavevector coherent acoustic phonon wavepackets using an all x-ray pump and probe scattering experiment. The time- and momentum-resolved diffuse scattering signal is consistent with strain pulses induced by the rapid electron cascade dynamics following photoionization at uncorrelated excitation centers. We quantify key parameters of this process, including the localization size of the strain wavepacket and the energy absorption efficiency, which are determined by the photoelectron and Auger electron cascade dynamics, as well as the electron-phonon interaction. In particular, we obtain the localization size of the observed strain wave packet to be 1.5 and 2.5 nm for bulk SrTiO$_3$ and KTaO$_3$ single crystals, even though there are no nanoscale structures or light-intensity patterns that would ordinarily be required to generate acoustic waves of wavelengths much shorter than the penetration depth. Whereas in GaAs and GaP we do not observe a signal above background. The results provide crucial information on x-ray matter interactions, which sheds light on the mechanism of x-ray energy deposition, and the study of high wavevector acoustic phonons and thermal transport at the nanoscale.

preprint2022arXiv

Generation of highly mutually coherent hard x-ray pulse pairs with an amplitude-splitting delay line

Beam splitters and delay lines are among the key building blocks of modern-day optical laser technologies. Progress in x-ray free electron laser source development and applications over the past decade is calling for their counter part operating in the Angstrom wavelength regime. Recent efforts in x-ray optics development have demonstrated relatively stable delay lines that most often adopted the division of wavefront approach for the beam splitting and recombination configuration. However, the two recombined beams have yet to achieve sufficient mutual coherence to enable applications such as interferometry, correlation spectroscopy, and nonlinear spectroscopy. We present the first experimental realization of the generation of highly mutually coherent pulse pairs using an amplitude-split delay line design based on transmission grating beam splitters and channel-cut crystal optic delay lines. The performance of the prototype system was analyzed in the context of x-ray coherent scattering and correlation spectroscopy, where we obtained nearly identical high-contrast speckle patterns from both branches. We show in addition the high level of dynamical stability during continuous delay scans, a capability essential for high sensitivity ultra-fast measurements.

preprint2022arXiv

Video-Guided Curriculum Learning for Spoken Video Grounding

In this paper, we introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. Compared with using text, employing audio requires the model to directly exploit the useful phonemes and syllables related to the video from raw speech. Moreover, we randomly add environmental noises to this speech audio, further increasing the difficulty of this task and better simulating real applications. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise. Considering during inference the model can not obtain ground truth video segments, we design a curriculum strategy that gradually shifts the input video from the ground truth to the entire video content during pre-training. Finally, the model can learn how to extract critical visual information from the entire video clip to help understand the spoken language. In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet, which is named as ActivityNet Speech dataset. Extensive experiments demonstrate our proposed video-guided curriculum learning can facilitate the pre-training process to obtain a mutual audio encoder, significantly promoting the performance of spoken video grounding tasks. Moreover, we prove that in the case of noisy sound, our model outperforms the method that grounding video with ASR transcripts, further demonstrating the effectiveness of our curriculum strategy.

preprint2020arXiv

An advanced workflow for single particle imaging with the limited data at an X-ray free-electron laser

An improved analysis for single particle imaging (SPI) experiments, using the limited data, is presented here. Results are based on a study of bacteriophage PR772 performed at the AMO instrument at the Linac Coherent Light Source (LCLS) as part of the SPI initiative. Existing methods were modified to cope with the shortcomings of the experimental data: inaccessibility of information from the half of the detector and small fraction of single hits. General SPI analysis workflow was upgraded with the expectation-maximization based classification of diffraction patterns and mode decomposition on the final virus structure determination step. The presented processing pipeline allowed us to determine the three-dimensional structure of the bacteriophage PR772 without symmetry constraints with a spatial resolution of 6.9 nm. The obtained resolution was limited by the scattering intensity during the experiment and the relatively small number of single hits.

preprint2020arXiv

Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding

Understanding intrinsic patterns and predicting spatiotemporal characteristics of cities require a comprehensive representation of urban neighborhoods. Existing works relied on either inter- or intra-region connectivities to generate neighborhood representations but failed to fully utilize the informative yet heterogeneous data within neighborhoods. In this work, we propose Urban2Vec, an unsupervised multi-modal framework which incorporates both street view imagery and point-of-interest (POI) data to learn neighborhood embeddings. Specifically, we use a convolutional neural network to extract visual features from street view images while preserving geospatial similarity. Furthermore, we model each POI as a bag-of-words containing its category, rating, and review information. Analog to document embedding in natural language processing, we establish the semantic similarity between neighborhood ("document") and the words from its surrounding POIs in the vector space. By jointly encoding visual, textual, and geospatial information into the neighborhood representation, Urban2Vec can achieve performances better than baseline models and comparable to fully-supervised methods in downstream prediction tasks. Extensive experiments on three U.S. metropolitan areas also demonstrate the model interpretability, generalization capability, and its value in neighborhood similarity analysis.

preprint2016arXiv

Many-body localization in Ising models with random long-range interactions

We theoretically investigate the many-body localization phase transition in a one-dimensional Ising spin chain with random long-range spin-spin interactions, $V_{ij}\propto\left|i-j\right|^{-α}$, where the exponent of the interaction range $α$ can be tuned from zero to infinitely large. By using exact diagonalization, we calculate the half-chain entanglement entropy and the energy spectral statistics and use them to characterize the phase transition towards the many-body localization phase at infinite temperature and at sufficiently large disorder strength. We perform finite-size scaling to extract the critical disorder strength and the critical exponent of the divergent localization length. With increasing $α$, the critical exponent experiences a sharp increase at about $α=1$ and then gradually decreases to a value found earlier in a disordered short-ranged interacting spin chain. For $α<1$, we find that the system is mostly localized and the increase in the disorder strength may drive a transition between two many-body localized phases. In contrast, for $α>1$, the transition is from a thermalized phase to the many-body localization phase. Our predictions could be experimentally tested with ion-trap quantum emulator with programmable random long-range interactions, or with randomly distributed Rydberg atoms or polar molecules in lattices.

preprint2014arXiv

The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox

To support complex data-intensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused heavily on the design of systems to support training complex models on large datasets. Unfortunately, the design of these systems largely ignores a critical component of the overall analytics process: the deployment and serving of models at scale. In this work, we present Velox, a new component of the Berkeley Data Analytics Stack. Velox is a data management system for facilitating the next steps in real-world, large-scale analytics pipelines: online model management, maintenance, and serving. Velox provides end-user applications and services with a low-latency, intuitive interface to models, transforming the raw statistical models currently trained using existing offline large-scale compute frameworks into full-blown, end-to-end data products capable of recommending products, targeting advertisements, and personalizing web content. To provide up-to-date results for these complex models, Velox also facilitates lightweight online model maintenance and selection (i.e., dynamic weighting). In this paper, we describe the challenges and architectural considerations required to achieve this functionality, including the abilities to span online and offline systems, to adaptively adjust model materialization strategies, and to exploit inherent statistical properties such as model error tolerance, all while operating at "Big Data" scale.