Researcher profile

Fan Jiang

Fan Jiang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing $3$--$14\times$ more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.

preprint2026arXiv

SiriusHelper: An LLM Agent-Based Operations Assistant for Big Data Platforms

Big data platforms are widely used in modern enterprises, and an in-production intelligent assistant is increasingly important to help users quickly find actionable guidance and reduce operational burden. While recent LLM+RAG assistants provide a natural interface, they face practical challenges in real deployments: limited scenario coverage across both general consultation and domain-specific troubleshooting workflows, inefficient knowledge access due to inadequate multi-hop retrieval and flat knowledge organization, and high maintenance cost because escalated tickets are unstructured and hard to convert into assistant improvements and reusable SOPs. In this paper, we present SiriusHelper, a deployed intelligent assistant for big data platforms. SiriusHelper serves as a unified online assistant that automatically identifies user intent and routes queries to the right handling path, including dedicated expert workflows for specialized scenarios (e.g., SQL execution diagnosis). To support complex troubleshooting, SiriusHelper combines a DeepSearch-driven mechanism with a priority-based hierarchical knowledge base to enable multi-hop retrieval without context overload, thus improving answer reliability and latency. To reduce expert overhead, SiriusHelper further introduces automated ticket understanding and SOP distillation: it diagnoses the assistant failure reason (e.g., missing knowledge or wrong routing) and extracts domain-specific SOPs to continuously enrich the knowledge base. Experiments and online deployment on Tencent Big Data platform show that SiriusHelper outperforms representative alternatives and reduces online ticket volume by 20.8\%.

preprint2023arXiv

ESPRIT-Oriented Precoder Design for mmWave Channel Estimation

We consider the problem of ESPRIT-oriented precoder design for beamspace angle-of-departure (AoD) estimation in downlink mmWave multiple-input single-output communications. Standard precoders (i.e., directional/sum beams) yield poor performance in AoD estimation, while Cramer-Rao bound-optimized precoders undermine the so-called shift invariance property (SIP) of ESPRIT. To tackle this issue, the problem of designing ESPRIT-oriented precoders is formulated to jointly optimize over the precoding matrix and the SIP-restoring matrix of ESPRIT. We develop an alternating optimization approach that updates these two matrices under unit-modulus constraints for analog beamforming architectures. Simulation results demonstrate the validity of the proposed approach while providing valuable insights on the beampatterns of the ESPRIT-oriented precoders.

preprint2022arXiv

Doppler Exploitation in Bistatic mmWave Radio SLAM

Networks in 5G and beyond utilize millimeter wave (mmWave) radio signals, large bandwidths, and large antenna arrays, which bring opportunities in jointly localizing the user equipment and mapping the propagation environment, termed as simultaneous localization and mapping (SLAM). Existing approaches mainly rely on delays and angles, and ignore the Doppler, although it contains geometric information. In this paper, we study the benefits of exploiting Doppler in SLAM through deriving the posterior Cramér-Rao bounds (PCRBs) and formulating the extended Kalman-Poisson multi-Bernoulli sequential filtering solution with Doppler as one of the involved measurements. Both theoretical PCRB analysis and simulation results demonstrate the efficacy of utilizing Doppler.

preprint2022arXiv

Doppler-Enabled Single-Antenna Localization and Mapping Without Synchronization

Radio localization is a key enabler for joint communication and sensing in the fifth/sixth generation (5G/6G) communication systems. With the help of multipath components (MPCs), localization and mapping tasks can be done with a single base station (BS) and single unsynchronized user equipment (UE) if both of them are equipped with an antenna array. However, the antenna array at the UE side increases the hardware and computational cost, preventing localization functionality. In this work, we show that with Doppler estimation and MPCs, localization and mapping tasks can be performed even with a single-antenna mobile UE. Furthermore, we show that the localization and mapping performance will improve and then saturate at a certain level with an increased UE speed. Both theoretical Cramér-Rao bound analysis and simulation results show the potential of localization under mobility and the effectiveness of the proposed localization algorithm.

preprint2022arXiv

Optimal Spatial Signal Design for mmWave Positioning under Imperfect Synchronization

We consider the problem of spatial signal design for multipath-assisted mmWave positioning under limited prior knowledge on the user's location and clock bias. We propose an optimal robust design and, based on the low-dimensional precoder structure under perfect prior knowledge, a codebook-based heuristic design with optimized beam power allocation. Through numerical results, we characterize different position-error-bound (PEB) regimes with respect to clock bias uncertainty and show that the proposed low-complexity codebook-based designs outperform the conventional directional beam codebook and achieve near-optimal PEB performance for both analog and digital architectures.

preprint2022arXiv

SADN: Learned Light Field Image Compression with Spatial-Angular Decorrelation

Light field image becomes one of the most promising media types for immersive video applications. In this paper, we propose a novel end-to-end spatial-angular-decorrelated network (SADN) for high-efficiency light field image compression. Different from the existing methods that exploit either spatial or angular consistency in the light field image, SADN decouples the angular and spatial information by dilation convolution and stride convolution in spatial-angular interaction, and performs feature fusion to compress spatial and angular information jointly. To train a stable and robust algorithm, a large-scale dataset consisting of 7549 light field images is proposed and built. The proposed method provides 2.137 times and 2.849 times higher compression efficiency relative to H.266/VVC and H.265/HEVC inter coding, respectively. It also outperforms the end-to-end image compression networks by an average of 79.6% bitrate saving with much higher subjective quality and light field consistency.

preprint2020arXiv

A Hidden Markov Model Based Unsupervised Algorithm for Sleep/Wake Identification Using Actigraphy

Actigraphy is widely used in sleep studies but lacks a universal unsupervised algorithm for sleep/wake identification. In this study, we proposed a Hidden Markov Model (HMM) based unsupervised algorithm that can automatically and effectively infer sleep/wake states. It is an individualized data-driven approach that analyzes actigraphy from each individual respectively to learn activity characteristics and further separate sleep and wake states. We used Actiwatch and polysomnography (PSG) data from 43 individuals in the Multi-Ethnic Study of Atherosclerosis to evaluate the performance of our method. Epoch-by-epoch comparisons were made between our HMM algorithm and that embedded in the Actiwatch software (AS). The percent agreement between HMM and PSG was 85.7%, and that between AS and PSG was 84.7%. Positive predictive values for sleep epochs were 85.6% and 84.6% for HMM and AS, respectively, and 95.5% and 85.6% for wake epochs. Both methods have similar performance and tend to overestimate sleep and underestimate wake compared to PSG. Our HMM approach is able to quantify the variability in activity counts that allow us to differentiate relatively active and sedentary individuals: individuals with higher estimated variabilities tend to show more frequent sedentary behaviors. In conclusion, our unsupervised data-driven HMM algorithm achieves slightly better performance compared to the commonly used algorithm in the Actiwatch software. HMM can help expand the application of actigraphy in large-scale studies and in cases where intrusive PSG is hard to acquire or unavailable. In addition, the estimated HMM parameters can characterize individual activity patterns that can be utilized for further analysis.

preprint2020arXiv

Asymptotic expansion for the transition densities of stochastic differential equations driven by the gamma processes

In this paper, enlightened by the asymptotic expansion methodology developed by Li(2013b) and Li and Chen (2016), we propose a Taylor-type approximation for the transition densities of the stochastic differential equations (SDEs) driven by the gamma processes, a special type of Levy processes. After representing the transition density as a conditional expectation of Dirac delta function acting on the solution of the related SDE, the key technical method for calculating the expectation of multiple stochastic integrals conditional on the gamma process is presented. To numerically test the efficiency of our method, we examine the pure jump Ornstein--Uhlenbeck (OU) model and its extensions to two jump-diffusion models. For each model, the maximum relative error between our approximated transition density and the benchmark density obtained by the inverse Fourier transform of the characteristic function is sufficiently small, which shows the efficiency of our approximated method.

preprint2020arXiv

Estimation of genome size using k-mer frequencies from corrected long reads

The third-generation long reads sequencing technologies, such as PacBio and Nanopore, have great advantages over second-generation Illumina sequencing in de novo assembly studies. However, due to the inherent low base accuracy, third-generation sequencing data cannot be used for k-mer counting and estimating genomic profile based on k-mer frequencies. Thus, in current genome projects, second-generation data is also necessary for accurately determining genome size and other genomic characteristics. We show that corrected third-generation data can be used to count k-mer frequencies and estimate genome size reliably, in replacement of using second-generation data. Therefore, future genome projects can depend on only one sequencing technology to finish both assembly and k-mer analysis, which will largely decrease sequencing cost in both time and money. Moreover, we present a fast light-weight tool kmerfreq and use it to perform all the k-mer counting tasks in this work. We have demonstrated that corrected third-generation sequencing data can be used to estimate genome size and developed a new open-source C/C++ k-mer counting tool, kmerfreq, which is freely available at https://github.com/fanagislab/kmerfreq.