Researcher profile

GuoLiang Li

GuoLiang Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

Markovian Pre-Trained Transformer for Next-Item Recommendation

We introduce the Markovian Pre-trained Transformer (MPT) for next-item recommendation, a transferable model fully pre-trained on synthetic Markov chains, yet capable of achieving state-of-the-art performance by fine-tuning a lightweight adaptor. This counterintuitive success stems from the observation of the `Markovian' nature: advanced sequential recommenders coincidentally rely on the latest interaction to make predictions, while the historical interactions serve mainly as auxiliary cues for inferring the user's general, non-sequential identity. This characteristic necessitates the capabilities of a universal recommendation model to effectively summarize the user sequence, with particular emphasis on the latest interaction. MPT inherently has the potential to be universal and transferable. On the one hand, when trained to predict the next state of Markov chains, it acquires the capabilities to estimate transition probabilities from the context (one adaptive manner for summarizing sequences) and attend to the last state to ensure accurate state transitions. On the other hand, unlike the heterogeneous interaction data, an unlimited amount of controllable Markov chains is available to boost the model capacity. We conduct extensive experiments on five public datasets from three distinct platforms to validate the superiority of Markovian pre-training over traditional recommendation pre-training and recent language pre-training paradigms.

preprint2026arXiv

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in large language models (LLMs) raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven data preparation along three core capabilities: interactive disambiguation, prep-code generation, and code-to-workflow translation. We crawl data from the Preppin' Data Challenges, and then extend it into a systematically designed benchmark. The benchmark covers diverse domains, and each task involves 3 to 18 data preparation steps. Nearly half of the tasks require over 100 lines of Python code, and the longest solutions approach 300 lines. Our evaluation shows that, despite recent progress, realizing this paradigm shift remains challenging for state-of-the-art LLMs. PrepBench provides a principled benchmark for measuring this gap and helps identify key challenges toward realizing NL-driven data preparation.

preprint2026arXiv

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.

preprint2025arXiv

Introduction to the Chinese Space Station Survey Telescope (CSST)

The Chinese Space Station Survey Telescope (CSST) is an upcoming Stage-IV sky survey telescope, distinguished by its large field of view (FoV), high image quality, and multi-band observation capabilities. It can simultaneously conduct precise measurements of the Universe by performing multi-color photometric imaging and slitless spectroscopic surveys. The CSST is equipped with five scientific instruments, i.e. Multi-band Imaging and Slitless Spectroscopy Survey Camera (SC), Multi-Channel Imager (MCI), Integral Field Spectrograph (IFS), Cool Planet Imaging Coronagraph (CPI-C), and THz Spectrometer (TS). Using these instruments, CSST is expected to make significant contributions and discoveries across various astronomical fields, including cosmology, galaxies and active galactic nuclei (AGN), the Milky Way and nearby galaxies, stars, exoplanets, Solar System objects, astrometry, and transients and variable sources. This review aims to provide a comprehensive overview of the CSST instruments, observational capabilities, data products, and scientific potential.

preprint2022arXiv

About One-point Statistics of the Ratio of Two Fourier-transformed Cosmic Fields and an Application

The Fourier transformation is an effective and efficient operation of Gaussianization at the one-point level. Using a set of N-body simulation data, we verified that the one-point distribution functions of the dark matter momentum divergence and density fields closely follow complex Gaussian distributions. The one-point distribution function of the quotient of two complex Gaussian variables is introduced and studied. Statistical theories are then applied to model one-point statistics about the growth of individual Fourier mode of the dark matter density field, which can be obtained by the ratio of two Fourier transformed cosmic fields. Our simulation results proved that the models based on the Gaussian approximation are impressively accurate, and our analysis revealed many interesting aspects about the growth of dark matter's density fluctuation in Fourier space.

preprint2022arXiv

Accelerating Edge Intelligence via Integrated Sensing and Communication

Realizing edge intelligence consists of sensing, communication, training, and inference stages. Conventionally, the sensing and communication stages are executed sequentially, which results in excessive amount of dataset generation and uploading time. This paper proposes to accelerate edge intelligence via integrated sensing and communication (ISAC). As such, the sensing and communication stages are merged so as to make the best use of the wireless signals for the dual purpose of dataset generation and uploading. However, ISAC also introduces additional interference between sensing and communication functionalities. To address this challenge, this paper proposes a classification error minimization formulation to design the ISAC beamforming and time allocation. The globally optimal solution is derived via the rank-1 guaranteed semidefinite relaxation, and performance analysis is performed to quantify the ISAC gain over that of conventional edge intelligence. Simulation results are provided to verify the effectiveness of the proposed ISAC-assisted edge intelligence system. Interestingly, we find that ISAC is always beneficial, when the duration of generating a sample is more than the duration of uploading a sample. Otherwise, the ISAC gain can vanish or even be negative. Nevertheless, we still derive a sufficient condition, under which a positive ISAC gain is feasible.

preprint2022arXiv

An Improved GPU-Based Ray-Shooting Code For Gravitational Microlensing

We present an improved inverse ray-shooting code based on GPUs for generating microlensing magnification maps. In addition to introducing GPUs for acceleration, we put the efforts in two aspects: (i) A standard circular lens plane is replaced by a rectangular one to reduce the number of unnecessary lenses as a result of an extremely prolate rectangular image plane. (ii) Interpolation method is applied in our implementation which has achieved an significant acceleration when dealing with large number of lenses and light rays required by high resolution maps. With these applications, we have greatly reduced the running time while maintaining high accuracy: the speed has been increased by about 100 times compared with ordinary GPU based IRS code and GPU-D code when handling large number of lenses. If encountered the high resolution situation up to $10000^2$ pixels, resulting in almost $10^{11}$ light rays, the running time can also be reduced by two orders of magnitude.

preprint2022arXiv

Forecast of observing time delay of the strongly lensed quasars with Muztagh-Ata 1.93m telescope

As a completely independent method, the measurement of time delay of strongly lensed quasars (TDSL) are crucial to resolve the Hubble tension. Extensive monitoring is required but so far limited to a small sample of strongly lensed quasars. Together with several partner institutes, Beijing Normal University is constructing a 1.93m reflector telescope at the Muztagh-Ata site in west China, which has the world class observing conditions. The telescope will be equipped with both a three-channel imager/photometer which covers $3500-11000$ Angstrom wavelength band, and a low-medium resolution ($λ/δλ=500/2000/7500$) spectrograph. In this paper, we investigate the capability of Muztagh-Ata 1.93m telescope in measuring time delays of strongly lensed quasars. We generate mock strongly lensed quasar systems and light curves with microlensing effects based on five known strongly lensed quasars, i.e., RX J1131-1231, HE 0435-1223, PG 1115+080, WFI 2033-4723 and SDSS 1206+4332. In particular, RX J1131-1231 is generated with lens modeling in this work. Due to lack of enough information, we simulate the other 4 systems with the public data without lens modeling. According to simulations, for RX J1131-like systems (wide variation in time delay between images) the TDSL measurement can be achieved with the precision about $Δt=0.5$ day with 4 seasons campaign length and 1 day cadence. This accuracy is comparable to the up-coming TDCOSMO project. And it would be better when the campaign length keeps longer and with high cadence. As a result, the capability of Muztagh-Ata 1.93m telescope allows it to join the network of TDSL observatories. It will enrich the database for strongly lensed quasar observations and make more precise measurements of time delays, especially considering the unique coordinate of the site.

preprint2022arXiv

Post-Newtonian parameters of ghost-free parity-violating gravities

We investigate the slow-motion and weak-field approximation of the general ghost-free parity-violating (PV) theory of gravity in the parametrized post-Newtonian (PPN) framework and derive the perturbative field equations, which are modified by the PV terms of this theory. The complete PPN parameters are obtained by solving the perturbative field equations. We find that all the PPN parameters are exactly the same as those in general relativity, except for an extra parameter $κ$, which is caused by the new curl-type term in the gravitomagnetic sector of the metric in this theory. We calculate the precession effects of gyroscopes in this theory and constrain the model parameters by the observations of the Gravity Probe B experiment.

preprint2022arXiv

Tolerance For the Pixelation Effect in Shear Measurement

Images taken by space telescopes typically have a superb spatial resolution, but a relatively poor sampling rate due to the finite CCD pixel size. Beyond the Nyquist limit, it becomes uncertain how much the pixelation effect may affect the accuracy of galaxy shape measurement. It is timely to study this issue given that a number of space-based large-scale weak lensing surveys are planned. Using the Fourier_Quad method, we quantify the shear recovery error as a function of the sampling factor Q, i.e., the ratio between the FWHM of the point-spread-function (PSF) and the pixel size of the CCD, for different PSFs and galaxies of different sizes and noise levels. We show that sub-percent-level accuracy in shear recovery is achievable with single-exposure images for $Q\lesssim 2$. The conclusion holds for galaxies much smaller than the PSF, and those with a significant level of noise.

preprint2020arXiv

Neural Multi-Task Learning for Teacher Question Detection in Online Classrooms

Asking questions is one of the most crucial pedagogical techniques used by teachers in class. It not only offers open-ended discussions between teachers and students to exchange ideas but also provokes deeper student thought and critical analysis. Providing teachers with such pedagogical feedback will remarkably help teachers improve their overall teaching quality over time in classrooms. Therefore, in this work, we build an end-to-end neural framework that automatically detects questions from teachers' audio recordings. Compared with traditional methods, our approach not only avoids cumbersome feature engineering, but also adapts to the task of multi-class question detection in real education scenarios. By incorporating multi-task learning techniques, we are able to strengthen the understanding of semantic relations among different types of questions. We conducted extensive experiments on the question detection tasks in a real-world online classroom dataset and the results demonstrate the superiority of our model in terms of various evaluation metrics.

preprint2020arXiv

Relational Data Synthesis using Generative Adversarial Networks: A Design Space Exploration

The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the tradeoff between privacy and utility of the released data. Thus, the database community and machine learning community have recently studied a new problem of relational data synthesis using generative adversarial networks (GAN) and proposed various algorithms. However, these algorithms are not compared under the same framework and thus it is hard for practitioners to understand GAN's benefits and limitations. To bridge the gaps, we conduct so far the most comprehensive experimental study that investigates applying GAN to relational data synthesis. We introduce a unified GAN-based framework and define a space of design solutions for each component in the framework, including neural network architectures and training strategies. We conduct extensive experiments to explore the design space and compare with traditional data synthesis approaches. Through extensive experiments, we find that GAN is very promising for relational data synthesis, and provide guidance for selecting appropriate design solutions. We also point out limitations of GAN and identify future research directions.

preprint2020arXiv

Temporal Network Representation Learning via Historical Neighborhoods Aggregation

Network embedding is an effective method to learn low-dimensional representations of nodes, which can be applied to various real-life applications such as visualization, node classification, and link prediction. Although significant progress has been made on this problem in recent years, several important challenges remain, such as how to properly capture temporal information in evolving networks. In practice, most networks are continually evolving. Some networks only add new edges or nodes such as authorship networks, while others support removal of nodes or edges such as internet data routing. If patterns exist in the changes of the network structure, we can better understand the relationships between nodes and the evolution of the network, which can be further leveraged to learn node representations with more meaningful information. In this paper, we propose the Embedding via Historical Neighborhoods Aggregation (EHNA) algorithm. More specifically, we first propose a temporal random walk that can identify relevant nodes in historical neighborhoods which have impact on edge formations. Then we apply a deep learning model which uses a custom attention mechanism to induce node embeddings that directly capture temporal information in the underlying feature representation. We perform extensive experiments on a range of real-world datasets, and the results demonstrate the effectiveness of our new approach in the network reconstruction task and the link prediction task.

preprint2015arXiv

NXgraph: An Efficient Graph Processing System on a Single Machine

Recent studies show that graph processing systems on a single machine can achieve competitive performance compared with cluster-based graph processing systems. In this paper, we present NXgraph, an efficient graph processing system on a single machine. With the abstraction of vertex intervals and edge sub-shards, we propose the Destination-Sorted Sub-Shard (DSSS) structure to store a graph. By dividing vertices and edges into intervals and sub-shards, NXgraph ensures graph data access locality and enables fine-grained scheduling. By sorting edges within each sub-shard according to their destination vertices, NXgraph reduces write conflicts among different threads and achieves a high degree of parallelism. Then, three updating strategies, i.e., Single-Phase Update (SPU), Double-Phase Update (DPU), and Mixed-Phase Update (MPU), are proposed in this paper. NXgraph can adaptively choose the fastest strategy for different graph problems according to the graph size and the available memory resources to fully utilize the memory space and reduce the amount of data transfer. All these three strategies exploit streamlined disk access pattern. Extensive experiments on three real-world graphs and five synthetic graphs show that NXgraph can outperform GraphChi, TurboGraph, VENUS, and GridGraph in various situations. Moreover, NXgraph, running on a single commodity PC, can finish an iteration of PageRank on the Twitter graph with 1.5 billion edges in 2.05 seconds; while PowerGraph, a distributed graph processing system, needs 3.6s to finish the same task.