Source author record

Zijian Chen

Zijian Chen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Computation and Language eess.AS eess.SP Machine Learning Other Quantitative Biology q-fin.ST Sound

Catalog footprint

What is connected

6works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

GeoR-Bench: Evaluating Geoscience Visual Reasoning

Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.

preprint2026arXiv

KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?

While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33. Crucially, we observe a "Scaling Law Paradox": simply increasing model parameters fails to yield linear improvements in these foundational visual capabilities. This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.

preprint2026arXiv

Near-Field Sparse Bayesian Channel Estimation and Tracking for XL-IRS-Aided Wideband mmWave Systems

The rapid development of 6G systems demands advanced technologies to boost network capacity and spectral efficiency, particularly in the context of intelligent reflecting surfaces (IRS)-aided millimeter-wave (mmWave) communications. A key challenge here is obtaining accurate channel state information (CSI), especially with extremely large IRS (XL-IRS), due to near-field propagation, high-dimensional wideband cascaded channels, and the passive nature of the XL-IRS. In addition, most existing CSI acquisition methods fail to leverage the spatio-temporal sparsity inherent in the channel, resulting in suboptimal estimation performance. To address these challenges, we consider an XL-IRS-aided wideband multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) system and propose an efficient channel estimation and tracking (CET) algorithm. Specifically, a unified near-field cascaded channel representation model is presented first, and a hierarchical spatio-temporal sparse prior is then constructed to capture two-dimensional (2D) block sparsity in the polar domain, one-dimensional (1D) clustered sparsity in the angle-delay domain, and temporal correlations across different channel estimation frames. Based on these priors, a tensor-based sparse CET (TS-CET) algorithm is proposed that integrates tensor-based orthogonal matching pursuit (OMP) with particle-based variational Bayesian inference (VBI) and message passing. Simulation results demonstrate that the TS-CET framework significantly improves the estimation accuracy and reduces the pilot overhead as compared to existing benchmark methods.

preprint2025arXiv

PriceSeer: Evaluating Large Language Models in Real-Time Stock Prediction

Stock prediction, a subject closely related to people's investment activities in fully dynamic and live environments, has been widely studied. Current large language models (LLMs) have shown remarkable potential in various domains, exhibiting expert-level performance through advanced reasoning and contextual understanding. In this paper, we introduce PriceSeer, a live, dynamic, and data-uncontaminated benchmark specifically designed for LLMs performing stock prediction tasks. Specifically, PriceSeer includes 110 U.S. stocks from 11 industrial sectors, with each containing 249 historical data points. Our benchmark implements both internal and external information expansion, where LLMs receive extra financial indicators, news, and fake news to perform stock price prediction. We evaluate six cutting-edge LLMs under different prediction horizons, demonstrating their potential in generating investment strategies after obtaining accurate price predictions for different sectors. Additionally, we provide analyses of LLMs' suboptimal performance in long-term predictions, including the vulnerability to fake news and specific industries. The code and evaluation data will be open-sourced at https://github.com/BobLiang2113/PriceSeer.

preprint2022arXiv

Embedding of Functional Human Brain Networks on a Sphere

Human brain activity is often measured using the blood-oxygen-level dependent (BOLD) signals obtained through functional magnetic resonance imaging (fMRI). The strength of connectivity between brain regions is then measured as a Pearson correlation matrix. As the number of brain regions increases, the dimension of matrix increases. It becomes extremely cumbersome to even visualize and quantify such weighted complete networks. To remedy the problem, we propose to embed brain networks onto a sphere, which is a Riemannian manifold with constant positive curvature. The Matlab code for the spherical embedding is given in https://github.com/laplcebeltrami/sphericalMDS.

preprint2022arXiv

WaBERT: A Low-resource End-to-end Model for Spoken Language Understanding and Speech-to-BERT Alignment

Historically lower-level tasks such as automatic speech recognition (ASR) and speaker identification are the main focus in the speech field. Interest has been growing in higher-level spoken language understanding (SLU) tasks recently, like sentiment analysis (SA). However, improving performances on SLU tasks remains a big challenge. Basically, there are two main methods for SLU tasks: (1) Two-stage method, which uses a speech model to transfer speech to text, then uses a language model to get the results of downstream tasks; (2) One-stage method, which just fine-tunes a pre-trained speech model to fit in the downstream tasks. The first method loses emotional cues such as intonation, and causes recognition errors during ASR process, and the second one lacks necessary language knowledge. In this paper, we propose the Wave BERT (WaBERT), a novel end-to-end model combining the speech model and the language model for SLU tasks. WaBERT is based on the pre-trained speech and language model, hence training from scratch is not needed. We also set most parameters of WaBERT frozen during training. By introducing WaBERT, audio-specific information and language knowledge are integrated in the short-time and low-resource training process to improve results on the dev dataset of SLUE SA tasks by 1.15% of recall score and 0.82% of F1 score. Additionally, we modify the serial Continuous Integrate-and-Fire (CIF) mechanism to achieve the monotonic alignment between the speech and text modalities.

Zijian Chen

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

GeoR-Bench: Evaluating Geoscience Visual Reasoning

KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?

Near-Field Sparse Bayesian Channel Estimation and Tracking for XL-IRS-Aided Wideband mmWave Systems

PriceSeer: Evaluating Large Language Models in Real-Time Stock Prediction

Embedding of Functional Human Brain Networks on a Sphere

WaBERT: A Low-resource End-to-end Model for Spoken Language Understanding and Speech-to-BERT Alignment