Source author record

GuoLiang Li

GuoLiang Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases astro-ph.CO Artificial Intelligence astro-ph.IM Machine Learning astro-ph.GA Computation and Language astro-ph cond-mat.mes-hall Distributed, Parallel, and Cluster Computing eess.SP gr-qc Information Retrieval

Catalog footprint

What is connected

26works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Markovian Pre-Trained Transformer for Next-Item Recommendation

We introduce the Markovian Pre-trained Transformer (MPT) for next-item recommendation, a transferable model fully pre-trained on synthetic Markov chains, yet capable of achieving state-of-the-art performance by fine-tuning a lightweight adaptor. This counterintuitive success stems from the observation of the `Markovian' nature: advanced sequential recommenders coincidentally rely on the latest interaction to make predictions, while the historical interactions serve mainly as auxiliary cues for inferring the user's general, non-sequential identity. This characteristic necessitates the capabilities of a universal recommendation model to effectively summarize the user sequence, with particular emphasis on the latest interaction. MPT inherently has the potential to be universal and transferable. On the one hand, when trained to predict the next state of Markov chains, it acquires the capabilities to estimate transition probabilities from the context (one adaptive manner for summarizing sequences) and attend to the last state to ensure accurate state transitions. On the other hand, unlike the heterogeneous interaction data, an unlimited amount of controllable Markov chains is available to boost the model capacity. We conduct extensive experiments on five public datasets from three distinct platforms to validate the superiority of Markovian pre-training over traditional recommendation pre-training and recent language pre-training paradigms.

preprint2026arXiv

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in large language models (LLMs) raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven data preparation along three core capabilities: interactive disambiguation, prep-code generation, and code-to-workflow translation. We crawl data from the Preppin' Data Challenges, and then extend it into a systematically designed benchmark. The benchmark covers diverse domains, and each task involves 3 to 18 data preparation steps. Nearly half of the tasks require over 100 lines of Python code, and the longest solutions approach 300 lines. Our evaluation shows that, despite recent progress, realizing this paradigm shift remains challenging for state-of-the-art LLMs. PrepBench provides a principled benchmark for measuring this gap and helps identify key challenges toward realizing NL-driven data preparation.

preprint2026arXiv

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.

preprint2025arXiv

Introduction to the Chinese Space Station Survey Telescope (CSST)

The Chinese Space Station Survey Telescope (CSST) is an upcoming Stage-IV sky survey telescope, distinguished by its large field of view (FoV), high image quality, and multi-band observation capabilities. It can simultaneously conduct precise measurements of the Universe by performing multi-color photometric imaging and slitless spectroscopic surveys. The CSST is equipped with five scientific instruments, i.e. Multi-band Imaging and Slitless Spectroscopy Survey Camera (SC), Multi-Channel Imager (MCI), Integral Field Spectrograph (IFS), Cool Planet Imaging Coronagraph (CPI-C), and THz Spectrometer (TS). Using these instruments, CSST is expected to make significant contributions and discoveries across various astronomical fields, including cosmology, galaxies and active galactic nuclei (AGN), the Milky Way and nearby galaxies, stars, exoplanets, Solar System objects, astrometry, and transients and variable sources. This review aims to provide a comprehensive overview of the CSST instruments, observational capabilities, data products, and scientific potential.

preprint2022arXiv

About One-point Statistics of the Ratio of Two Fourier-transformed Cosmic Fields and an Application

The Fourier transformation is an effective and efficient operation of Gaussianization at the one-point level. Using a set of N-body simulation data, we verified that the one-point distribution functions of the dark matter momentum divergence and density fields closely follow complex Gaussian distributions. The one-point distribution function of the quotient of two complex Gaussian variables is introduced and studied. Statistical theories are then applied to model one-point statistics about the growth of individual Fourier mode of the dark matter density field, which can be obtained by the ratio of two Fourier transformed cosmic fields. Our simulation results proved that the models based on the Gaussian approximation are impressively accurate, and our analysis revealed many interesting aspects about the growth of dark matter's density fluctuation in Fourier space.

preprint2022arXiv

Accelerating Edge Intelligence via Integrated Sensing and Communication

Realizing edge intelligence consists of sensing, communication, training, and inference stages. Conventionally, the sensing and communication stages are executed sequentially, which results in excessive amount of dataset generation and uploading time. This paper proposes to accelerate edge intelligence via integrated sensing and communication (ISAC). As such, the sensing and communication stages are merged so as to make the best use of the wireless signals for the dual purpose of dataset generation and uploading. However, ISAC also introduces additional interference between sensing and communication functionalities. To address this challenge, this paper proposes a classification error minimization formulation to design the ISAC beamforming and time allocation. The globally optimal solution is derived via the rank-1 guaranteed semidefinite relaxation, and performance analysis is performed to quantify the ISAC gain over that of conventional edge intelligence. Simulation results are provided to verify the effectiveness of the proposed ISAC-assisted edge intelligence system. Interestingly, we find that ISAC is always beneficial, when the duration of generating a sample is more than the duration of uploading a sample. Otherwise, the ISAC gain can vanish or even be negative. Nevertheless, we still derive a sufficient condition, under which a positive ISAC gain is feasible.

preprint2022arXiv

An Improved GPU-Based Ray-Shooting Code For Gravitational Microlensing

We present an improved inverse ray-shooting code based on GPUs for generating microlensing magnification maps. In addition to introducing GPUs for acceleration, we put the efforts in two aspects: (i) A standard circular lens plane is replaced by a rectangular one to reduce the number of unnecessary lenses as a result of an extremely prolate rectangular image plane. (ii) Interpolation method is applied in our implementation which has achieved an significant acceleration when dealing with large number of lenses and light rays required by high resolution maps. With these applications, we have greatly reduced the running time while maintaining high accuracy: the speed has been increased by about 100 times compared with ordinary GPU based IRS code and GPU-D code when handling large number of lenses. If encountered the high resolution situation up to $10000^2$ pixels, resulting in almost $10^{11}$ light rays, the running time can also be reduced by two orders of magnitude.

preprint2022arXiv

Forecast of observing time delay of the strongly lensed quasars with Muztagh-Ata 1.93m telescope

As a completely independent method, the measurement of time delay of strongly lensed quasars (TDSL) are crucial to resolve the Hubble tension. Extensive monitoring is required but so far limited to a small sample of strongly lensed quasars. Together with several partner institutes, Beijing Normal University is constructing a 1.93m reflector telescope at the Muztagh-Ata site in west China, which has the world class observing conditions. The telescope will be equipped with both a three-channel imager/photometer which covers $3500-11000$ Angstrom wavelength band, and a low-medium resolution ($λ/δλ=500/2000/7500$) spectrograph. In this paper, we investigate the capability of Muztagh-Ata 1.93m telescope in measuring time delays of strongly lensed quasars. We generate mock strongly lensed quasar systems and light curves with microlensing effects based on five known strongly lensed quasars, i.e., RX J1131-1231, HE 0435-1223, PG 1115+080, WFI 2033-4723 and SDSS 1206+4332. In particular, RX J1131-1231 is generated with lens modeling in this work. Due to lack of enough information, we simulate the other 4 systems with the public data without lens modeling. According to simulations, for RX J1131-like systems (wide variation in time delay between images) the TDSL measurement can be achieved with the precision about $Δt=0.5$ day with 4 seasons campaign length and 1 day cadence. This accuracy is comparable to the up-coming TDCOSMO project. And it would be better when the campaign length keeps longer and with high cadence. As a result, the capability of Muztagh-Ata 1.93m telescope allows it to join the network of TDSL observatories. It will enrich the database for strongly lensed quasar observations and make more precise measurements of time delays, especially considering the unique coordinate of the site.

preprint2022arXiv

Post-Newtonian parameters of ghost-free parity-violating gravities

We investigate the slow-motion and weak-field approximation of the general ghost-free parity-violating (PV) theory of gravity in the parametrized post-Newtonian (PPN) framework and derive the perturbative field equations, which are modified by the PV terms of this theory. The complete PPN parameters are obtained by solving the perturbative field equations. We find that all the PPN parameters are exactly the same as those in general relativity, except for an extra parameter $κ$, which is caused by the new curl-type term in the gravitomagnetic sector of the metric in this theory. We calculate the precession effects of gyroscopes in this theory and constrain the model parameters by the observations of the Gravity Probe B experiment.

preprint2022arXiv

Tolerance For the Pixelation Effect in Shear Measurement

Images taken by space telescopes typically have a superb spatial resolution, but a relatively poor sampling rate due to the finite CCD pixel size. Beyond the Nyquist limit, it becomes uncertain how much the pixelation effect may affect the accuracy of galaxy shape measurement. It is timely to study this issue given that a number of space-based large-scale weak lensing surveys are planned. Using the Fourier_Quad method, we quantify the shear recovery error as a function of the sampling factor Q, i.e., the ratio between the FWHM of the point-spread-function (PSF) and the pixel size of the CCD, for different PSFs and galaxies of different sizes and noise levels. We show that sub-percent-level accuracy in shear recovery is achievable with single-exposure images for $Q\lesssim 2$. The conclusion holds for galaxies much smaller than the PSF, and those with a significant level of noise.

preprint2020arXiv

Neural Multi-Task Learning for Teacher Question Detection in Online Classrooms

Asking questions is one of the most crucial pedagogical techniques used by teachers in class. It not only offers open-ended discussions between teachers and students to exchange ideas but also provokes deeper student thought and critical analysis. Providing teachers with such pedagogical feedback will remarkably help teachers improve their overall teaching quality over time in classrooms. Therefore, in this work, we build an end-to-end neural framework that automatically detects questions from teachers' audio recordings. Compared with traditional methods, our approach not only avoids cumbersome feature engineering, but also adapts to the task of multi-class question detection in real education scenarios. By incorporating multi-task learning techniques, we are able to strengthen the understanding of semantic relations among different types of questions. We conducted extensive experiments on the question detection tasks in a real-world online classroom dataset and the results demonstrate the superiority of our model in terms of various evaluation metrics.

preprint2020arXiv

Relational Data Synthesis using Generative Adversarial Networks: A Design Space Exploration

The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the tradeoff between privacy and utility of the released data. Thus, the database community and machine learning community have recently studied a new problem of relational data synthesis using generative adversarial networks (GAN) and proposed various algorithms. However, these algorithms are not compared under the same framework and thus it is hard for practitioners to understand GAN's benefits and limitations. To bridge the gaps, we conduct so far the most comprehensive experimental study that investigates applying GAN to relational data synthesis. We introduce a unified GAN-based framework and define a space of design solutions for each component in the framework, including neural network architectures and training strategies. We conduct extensive experiments to explore the design space and compare with traditional data synthesis approaches. Through extensive experiments, we find that GAN is very promising for relational data synthesis, and provide guidance for selecting appropriate design solutions. We also point out limitations of GAN and identify future research directions.

preprint2020arXiv

Temporal Network Representation Learning via Historical Neighborhoods Aggregation

Network embedding is an effective method to learn low-dimensional representations of nodes, which can be applied to various real-life applications such as visualization, node classification, and link prediction. Although significant progress has been made on this problem in recent years, several important challenges remain, such as how to properly capture temporal information in evolving networks. In practice, most networks are continually evolving. Some networks only add new edges or nodes such as authorship networks, while others support removal of nodes or edges such as internet data routing. If patterns exist in the changes of the network structure, we can better understand the relationships between nodes and the evolution of the network, which can be further leveraged to learn node representations with more meaningful information. In this paper, we propose the Embedding via Historical Neighborhoods Aggregation (EHNA) algorithm. More specifically, we first propose a temporal random walk that can identify relevant nodes in historical neighborhoods which have impact on edge formations. Then we apply a deep learning model which uses a custom attention mechanism to induce node embeddings that directly capture temporal information in the underlying feature representation. We perform extensive experiments on a range of real-world datasets, and the results demonstrate the effectiveness of our new approach in the network reconstruction task and the link prediction task.

preprint2016arXiv

The Point Spread Function Reconstruction by Using Moffatlets - I

The shear measurement is a crucial task in the current and the future weak lensing survey projects. And the reconstruction of the point spread function(PSF) is one of the essential steps. In this work, we present three different methods, including Gaussianlets, Moffatlets and EMPCA to quantify their efficiency on PSF reconstruction using four sets of simulated LSST star images. Gaussianlets and Moffatlets are two different sets of basis functions whose profiles are based on Gaussian and Moffat functions respectively. Expectation Maximization(EM) PCA is a statistical method performing iterative procedure to find principal components of an ensemble of star images. Our tests show that: 1) Moffatlets always perform better than Gaussianlets. 2) EMPCA is more compact and flexible, but the noise existing in the Principal Components (PCs) will contaminate the size and ellipticity of PSF while Moffatlets keeps them very well.

preprint2015arXiv

Characterizing localized surface plasmons using electron energy-loss spectroscopy

Electron energy-loss spectroscopy (EELS) offers a window to view nanoscale properties and processes. When performed in a scanning transmission electron microscope, EELS can simultaneously render images of nanoscale objects with sub-nanometer spatial resolution and correlate them with spectroscopic information of $\sim10 - 100$ meV spectral resolution. Consequently, EELS is a near-perfect tool for understanding the optical and electronic properties of individual and few-particle plasmonic metal nanoparticles assemblies, which are significant in a wide range of fields. This review presents an overview of basic plasmonics and EELS theory and highlights several recent noteworthy experiments involving the electron-beam interrogation of plasmonic metal nanoparticle systems.

preprint2015arXiv

NXgraph: An Efficient Graph Processing System on a Single Machine

Recent studies show that graph processing systems on a single machine can achieve competitive performance compared with cluster-based graph processing systems. In this paper, we present NXgraph, an efficient graph processing system on a single machine. With the abstraction of vertex intervals and edge sub-shards, we propose the Destination-Sorted Sub-Shard (DSSS) structure to store a graph. By dividing vertices and edges into intervals and sub-shards, NXgraph ensures graph data access locality and enables fine-grained scheduling. By sorting edges within each sub-shard according to their destination vertices, NXgraph reduces write conflicts among different threads and achieves a high degree of parallelism. Then, three updating strategies, i.e., Single-Phase Update (SPU), Double-Phase Update (DPU), and Mixed-Phase Update (MPU), are proposed in this paper. NXgraph can adaptively choose the fastest strategy for different graph problems according to the graph size and the available memory resources to fully utilize the memory space and reduce the amount of data transfer. All these three strategies exploit streamlined disk access pattern. Extensive experiments on three real-world graphs and five synthetic graphs show that NXgraph can outperform GraphChi, TurboGraph, VENUS, and GridGraph in various situations. Moreover, NXgraph, running on a single commodity PC, can finish an iteration of PageRank on the Twitter graph with 1.5 billion edges in 2.05 seconds; while PowerGraph, a distributed graph processing system, needs 3.6s to finish the same task.

preprint2015arXiv

RFP: A Remote Fetching Paradigm for RDMA-Accelerated Systems

Remote Direct Memory Access (RDMA) is an efficient way to improve the performance of traditional client-server systems. Currently, there are two main design paradigms for RDMA-accelerated systems. The first allows the clients to directly operate the server's memory and totally bypasses the CPUs at server side. The second follows the traditional server-reply paradigm, which asks the server to write results back to the clients. However, the first method has to expose server's memory and needs tremendous re-design of upper-layer software, which is complex, unsafe, error-prone, and inefficient. The second cannot achieve high input/output operations per second (IOPS), because it employs out-bound RDMA-write at server side which is not efficient. We find that the performance of out-bound RDMA-write and in-bound RDMA-read is asymmetric and the latter is 5 times faster than the former. Based on this observation, we propose a novel design paradigm named Remote Fetching Paradigm (RFP). In RFP, the server is still responsible for processing requests from the clients. However, counter-intuitively, instead of sending results back to the clients through out-bound RDMA-write, the server only writes the results in local memory buffers, and the clients use in-bound RDMA-read to remotely fetch these results. Since in-bound RDMA-read achieves much higher IOPS than out-bound RDMA-write, our model is able to bring higher performance than the traditional models. In order to prove the effectiveness of RFP, we design and implement an RDMA-accelerated in-memory key-value store following the RFP model. To further improve the IOPS, we propose an optimization mechanism that combines status checking and result fetching. Experiment results show that RFP can improve the IOPS by 160%~310% against state-of-the-art models for in-memory key-value stores.

preprint2014arXiv

Leveraging Transitive Relations for Crowdsourced Joins

The development of crowdsourced query processing systems has recently attracted a significant attention in the database community. A variety of crowdsourced queries have been investigated. In this paper, we focus on the crowdsourced join query which aims to utilize humans to find all pairs of matching objects from two collections. As a human-only solution is expensive, we adopt a hybrid human-machine approach which first uses machines to generate a candidate set of matching pairs, and then asks humans to label the pairs in the candidate set as either matching or non-matching. Given the candidate pairs, existing approaches will publish all pairs for verification to a crowdsourcing platform. However, they neglect the fact that the pairs satisfy transitive relations. As an example, if $o_1$ matches with $o_2$, and $o_2$ matches with $o_3$, then we can deduce that $o_1$ matches with $o_3$ without needing to crowdsource $(o_1, o_3)$. To this end, we study how to leverage transitive relations for crowdsourced joins. We propose a hybrid transitive-relations and crowdsourcing labeling framework which aims to crowdsource the minimum number of pairs to label all the candidate pairs. We prove the optimal labeling order in an ideal setting and propose a heuristic labeling order in practice. We devise a parallel labeling algorithm to efficiently crowdsource the pairs following the order. We evaluate our approaches in both simulated environment and a real crowdsourcing platform. Experimental results show that our approaches with transitive relations can save much more money and time than existing methods, with a little loss in the result quality.

preprint2014arXiv

The Expected Optimal Labeling Order Problem for Crowdsourced Joins and Entity Resolution

In the SIGMOD 2013 conference, we published a paper extending our earlier work on crowdsourced entity resolution to improve crowdsourced join processing by exploiting transitive relationships [Wang et al. 2013]. The VLDB 2014 conference has a paper that follows up on our previous work [Vesdapunt et al., 2014], which points out and corrects a mistake we made in our SIGMOD paper. Specifically, in Section 4.2 of our SIGMOD paper, we defined the "Expected Optimal Labeling Order" (EOLO) problem, and proposed an algorithm for solving it. We incorrectly claimed that our algorithm is optimal. In their paper, Vesdapunt et al. show that the problem is actually NP-Hard, and based on that observation, propose a new algorithm to solve it. In this note, we would like to put the Vesdapunt et al. results in context, something we believe that their paper does not adequately do.

preprint2013arXiv

Gravitational lensing effects on sub-millimetre galaxy counts

We study the effects on the number counts of sub-millimetre galaxies due to gravitational lensing. We explore the effects on the magnification cross section due to halo density profiles, ellipticity and cosmological parameter (the power-spectrum normalisation $σ_8$). We show that the ellipticity does not strongly affect the magnification cross section in gravitational lensing while the halo radial profiles do. Since the baryonic cooling effect is stronger in galaxies than clusters, galactic haloes are more concentrated. In light of this, a new scenario of two halo population model is explored where galaxies are modeled as a singular isothermal sphere profile and clusters as a Navarro, Frenk and White (NFW) profile. We find the transition mass between the two has modest effects on the lensing probability. The cosmological parameter $σ_8$ alters the abundance of haloes and therefore affects our results. Compared with other methods, our model is simpler and more realistic. The conclusions of previous works is confirm that gravitational lensing is a natural explanation for the number count excess at the bright end.

preprint2012arXiv

Breaking Out The XML MisMatch Trap

In keyword search, when user cannot get what she wants, query refinement is needed and reason can be various. We first give a thorough categorization of the reason, then focus on solving one category of query refinement problem in the context of XML keyword search, where what user searches for does not exist in the data. We refer to it as the MisMatch problem in this paper. Then we propose a practical way to detect the MisMatch problem and generate helpful suggestions to users. Our approach can be viewed as a post-processing job of query evaluation, and has three main features: (1) it adopts both the suggested queries and their sample results as the output to user, helping user judge whether the MisMatch problem is solved without consuming all query results; (2) it is portable in the sense that it can work with any LCA-based matching semantics and orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. An online XML keyword search engine called XClear that embeds the MisMatch problem detector and suggester has been built.

preprint2012arXiv

Fast Shape Estimation for Galaxies and Stars

Model fitting is frequently used to determine the shape of galaxies and the point spread function, for examples, in weak lensing analyses or morphology studies aiming at probing the evolution of galaxies. However, the number of parameters in the model, as well as the number of objects, are often so large as to limit the use of model fitting for future large surveys. In this article, we propose a set of algorithms to speed up the fitting process. Our approach is divided into three distinctive steps: centroiding, ellipticity measurement, and profile fitting. We demonstrate that we can derive the position and ellipticity of an object analytically in the first two steps and thus leave only a small number of parameters to be derived through model fitting. The position, ellipticity, and shape parameters can then used in constructing orthonomal basis functions such as sérsiclets for better galaxy image reconstruction. We assess the efficiency and accuracy of the algorithms with simulated images. We have not taken into account the deconvolution of the point spread function, which most weak lensing analyses do.

preprint2012arXiv

SEAL: Spatio-Textual Similarity Search

Location-based services (LBS) have become more and more ubiquitous recently. Existing methods focus on finding relevant points-of-interest (POIs) based on users' locations and query keywords. Nowadays, modern LBS applications generate a new kind of spatio-textual data, regions-of-interest (ROIs), containing region-based spatial information and textual description, e.g., mobile user profiles with active regions and interest tags. To satisfy search requirements on ROIs, we study a new research problem, called spatio-textual similarity search: Given a set of ROIs and a query ROI, we find the similar ROIs by considering spatial overlap and textual similarity. Spatio-textual similarity search has many important applications, e.g., social marketing in location-aware social networks. It calls for an efficient search method to support large scales of spatio-textual data in LBS systems. To this end, we introduce a filter-and-verification framework to compute the answers. In the filter step, we generate signatures for the ROIs and the query, and utilize the signatures to generate candidates whose signatures are similar to that of the query. In the verification step, we verify the candidates and identify the final answers. To achieve high performance, we generate effective high-quality signatures, and devise efficient filtering algorithms as well as pruning techniques. Experimental results on real and synthetic datasets show that our method achieves high performance.

preprint2011arXiv

PASS-JOIN: A Partition-based Method for Similarity Joins

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets.

preprint2010arXiv

Mass reconstruction by gravitational shear and flexion

Galaxy clusters are considered as excellent probes for cosmology. For that purpose, their mass needs to be measured and their structural properties needs to be understood. We propose a method for galaxy cluster mass reconstruction which combines information from strong lensing, weak lensing shear and flexion. We extend the weak lensing analysis to the inner parts of the cluster and, in particular, improve the resolution of substructure. We use simulations to show that the method recovers the mass density profiles of the cluster. We find that the weak lensing flexion is sensitive to substructure. After combining the flexion data into the joint weak and strong lensing analysis, we can resolve the cluster properties with substructures.

preprint2007arXiv

Detecting First Star Lyman-$α$ Spheres through Gravitational Telescopes

Lyman-$α$ spheres, i.e. regions around the first stars which are illuminated by Lyman-$α$ photons and show 21cm absorption feature against the CMB, are smoking guns at the dawn of the reionization epoch. Though overwhelming radio foreground makes their detections extremely difficult, we pointed out that, strong gravitational lensing can significantly improve their observational feasibility. Since Lyman-$α$ spheres have ~10" sizes, comparable to the caustic size of galaxy clusters, individual images of each strongly lensed Lyman-$α$ sphere often merge together and form single structures in the 21cm sky with irregular shapes. Using high-resolution N-body LCDM simulations, we found that the lensing probability to have magnification bigger than 10 is ~10^{-5}. This results in $\ga 10^6$ strongly lensed Lyman-$α$ spheres across the sky, which should be the primary targets for first detections of Lyman-$α$ spheres. Although the required total radio array collecting area for their detection is large (~100 km^2), the design of long fixed cylindrical reflectors can significantly reduce the total cost of such array to the level of the square kilometer array (SKA) and makes the detection of these very first objects feasible.

GuoLiang Li

What is connected

Connect this record

See the researcher in context

Building this map preview

26 published item(s)

Markovian Pre-Trained Transformer for Next-Item Recommendation

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Introduction to the Chinese Space Station Survey Telescope (CSST)

About One-point Statistics of the Ratio of Two Fourier-transformed Cosmic Fields and an Application

Accelerating Edge Intelligence via Integrated Sensing and Communication

An Improved GPU-Based Ray-Shooting Code For Gravitational Microlensing

Forecast of observing time delay of the strongly lensed quasars with Muztagh-Ata 1.93m telescope

Post-Newtonian parameters of ghost-free parity-violating gravities

Tolerance For the Pixelation Effect in Shear Measurement

Neural Multi-Task Learning for Teacher Question Detection in Online Classrooms

Relational Data Synthesis using Generative Adversarial Networks: A Design Space Exploration

Temporal Network Representation Learning via Historical Neighborhoods Aggregation

The Point Spread Function Reconstruction by Using Moffatlets - I

Characterizing localized surface plasmons using electron energy-loss spectroscopy

NXgraph: An Efficient Graph Processing System on a Single Machine

RFP: A Remote Fetching Paradigm for RDMA-Accelerated Systems

Leveraging Transitive Relations for Crowdsourced Joins

The Expected Optimal Labeling Order Problem for Crowdsourced Joins and Entity Resolution

Gravitational lensing effects on sub-millimetre galaxy counts

Breaking Out The XML MisMatch Trap

Fast Shape Estimation for Galaxies and Stars

SEAL: Spatio-Textual Similarity Search

PASS-JOIN: A Partition-based Method for Similarity Joins

Mass reconstruction by gravitational shear and flexion

Detecting First Star Lyman-$α$ Spheres through Gravitational Telescopes