Source author record

Jialu Li

Jialu Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Computer Vision eess.AS Sound astro-ph.GA astro-ph.SR Machine Learning

Catalog footprint

What is connected

6works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion

Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance.

preprint2022arXiv

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. In this paper, we aim to solve two key challenges in this task: utilizing multilingual instructions for improved instruction-path grounding and navigating through new environments that are unseen during training. To address these challenges, we propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. First, our agent learns a shared and visually-aligned cross-lingual language representation for the three languages (English, Hindi and Telugu) in the Room-Across-Room dataset. Our language representation learning is guided by text pairs that are aligned by visual information. Second, our agent learns an environment-agnostic visual representation by maximizing the similarity between semantically-aligned image pairs (with constraints on object-matching) from different environments. Our environment agnostic visual representation can mitigate the environment bias induced by low-level visual information. Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation. Furthermore, we show that our learned language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task, and present detailed qualitative and quantitative generalization and grounding analysis. Our code is available at https://github.com/jialuli-luka/CLEAR

preprint2022arXiv

EnvEdit: Environment Editing for Vision-and-Language Navigation

In Vision-and-Language Navigation (VLN), an agent needs to navigate through the environment based on natural language instructions. Due to limited available data for agent training and finite diversity in navigation environments, it is challenging for the agent to generalize to new, unseen environments. To address this problem, we propose EnvEdit, a data augmentation method that creates new environments by editing existing environments, which are used to train a more generalizable agent. Our augmented environments can differ from the seen environments in three diverse aspects: style, object appearance, and object classes. Training on these edit-augmented environments prevents the agent from overfitting to existing environments and helps generalize better to new, unseen environments. Empirically, on both the Room-to-Room and the multi-lingual Room-Across-Room datasets, we show that our proposed EnvEdit method gets significant improvements in all metrics on both pre-trained and non-pre-trained VLN agents, and achieves the new state-of-the-art on the test leaderboard. We further ensemble the VLN agents augmented on different edited environments and show that these edit methods are complementary. Code and data are available at https://github.com/jialuli-luka/EnvEdit

preprint2022arXiv

High-Resolution M-band Spectroscopy of CO towards the Massive Young Stellar Binary W3 IRS5

We present in this paper the results of high spectral resolution ($R$=88,100) spectroscopy at 4.7 $μ$m with iSHELL/IRTF of hot molecular gas close to the massive binary protostar W3 IRS5. The binary was spatially resolved and the spectra of the two sources (MIR1 and MIR2) were obtained simultaneously for the first time. Hundreds of $^{12}$CO $ν$=0-1, $ν$=1-2 lines, and $ν$=0-1 transitions of the isotopes of $^{12}$CO were detected in absorption, and are blue-shifted compared to the cloud velocity $v_{LSR}=-$38 km/s. We decompose and identify kinematic components from the velocity profiles, and apply rotation diagram and curve of growth analyses to determine their physical properties. Temperatures and column densities of the identified components range from 30$-$700 K and 10$^{21}-$10$^{22}$ cm$^{-2}$, respectively. Our curve of growth analyses consider two scenarios. One assumes a foreground slab with a partial covering factor, which well reproduces the absorption of most of the components. The other assumes a circumstellar disk with an outward decreasing temperature in the vertical direction, and reproduces the absorption of all the hot components. We attribute the physical origins of the identified components to the foreground envelope ($<$100 K), post-J-shock regions (200$-$300 K), and clumpy structures on the circumstellar disks ($\sim$600 K). We propose that the components with a J-shock origin are akin to water maser spots in the same region, and are complementing the physical information of water masers along the direction of their movements.

preprint2022arXiv

Visualizations of Complex Sequences of Family-Infant Vocalizations Using Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features

In the U.S., approximately 15-17% of children 2-8 years of age are estimated to have at least one diagnosed mental, behavioral or developmental disorder. However, such disorders often go undiagnosed, and the ability to evaluate and treat disorders in the first years of life is limited. To analyze infant developmental changes, previous studies have shown advanced ML models excel at classifying infant and/or parent vocalizations collected using cell phone, video, or audio-only recording device like LENA. In this study, we pilot test the audio component of a new infant wearable multi-modal device that we have developed called LittleBeats (LB). LB audio pipeline is advanced in that it provides reliable labels for both speaker diarization and vocalization classification tasks, compared with other platforms that only record audio and/or provide speaker diarization labels. We leverage wav2vec 2.0 to obtain superior and more nuanced results with the LB family audio stream. We use a bag-of-audio-words method with wav2vec 2.0 features to create high-level visualizations to understand family-infant vocalization interactions. We demonstrate that our high-quality visualizations capture major types of family vocalization interactions, in categories indicative of mental, behavioral, and developmental health, for both labeled and unlabeled LB audio.

preprint2020arXiv

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones. Models are trained and tested multilingually in three languages, then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multilingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but asynchronous training results in lower tone error rate.

Jialu Li

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

EnvEdit: Environment Editing for Vision-and-Language Navigation

High-Resolution M-band Spectroscopy of CO towards the Massive Young Stellar Binary W3 IRS5

Visualizations of Complex Sequences of Family-Infant Vocalizations Using Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?