Source author record

Jiayi Liu

Jiayi Liu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Artificial Intelligence Computation and Language astro-ph.CO astro-ph.IM Distributed, Parallel, and Cluster Computing gr-qc

Catalog footprint

What is connected

10works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

Retrieval-augmented generation (RAG) is widely used to augment large language models (LLMs) with external knowledge. However, many benchmark datasets, designed to test RAG performance, comprise many questions that can already be answered from an LLM's parametric memory. This leads to unreliable evaluation. We refer to this phenomenon as knowledge leakage: cases where RAG tasks are solvable without retrieval. This issue worsens over time due to benchmark aging. As benchmarks are reused for training, their contents are increasingly absorbed into model parameters, making them less effective for evaluating retrieval. We introduce SeedRG, a semi-synthetic benchmark generation pipeline that mitigates knowledge leakage and addresses the issue of benchmark aging. Starting from a seed benchmark dataset, SeedRG extracts a reasoning graph from question-context pairs to capture their underlying reasoning structure, and then generates new examples via type-constrained entity replacement. This process produces structurally similar but novel instances that are unlikely to exist in the model's parametric knowledge, while preserving the original reasoning patterns. To ensure quality, we incorporate two verification steps: (1) a reasoning-graph consistency check to maintain task difficulty, and (2) a knowledge-leakage filter to exclude instances answerable without retrieval.

preprint2022arXiv

Exemplar-based Pattern Synthesis with Implicit Periodic Field Network

Synthesis of ergodic, stationary visual patterns is widely applicable in texturing, shape modeling, and digital content creation. The wide applicability of this technique thus requires the pattern synthesis approaches to be scalable, diverse, and authentic. In this paper, we propose an exemplar-based visual pattern synthesis framework that aims to model the inner statistics of visual patterns and generate new, versatile patterns that meet the aforementioned requirements. To this end, we propose an implicit network based on generative adversarial network (GAN) and periodic encoding, thus calling our network the Implicit Periodic Field Network (IPFN). The design of IPFN ensures scalability: the implicit formulation directly maps the input coordinates to features, which enables synthesis of arbitrary size and is computationally efficient for 3D shape synthesis. Learning with a periodic encoding scheme encourages diversity: the network is constrained to model the inner statistics of the exemplar based on spatial latent codes in a periodic field. Coupled with continuously designed GAN training procedures, IPFN is shown to synthesize tileable patterns with smooth transitions and local variations. Last but not least, thanks to both the adversarial training technique and the encoded Fourier features, IPFN learns high-frequency functions that produce authentic, high-quality results. To validate our approach, we present novel experimental results on various applications in 2D texture synthesis and 3D shape synthesis.

preprint2022arXiv

ImageSubject: A Large-scale Dataset for Subject Detection

Main subjects usually exist in the images or videos, as they are the objects that the photographer wants to highlight. Human viewers can easily identify them but algorithms often confuse them with other objects. Detecting the main subjects is an important technique to help machines understand the content of images and videos. We present a new dataset with the goal of training models to understand the layout of the objects and the context of the image then to find the main subjects among them. This is achieved in three aspects. By gathering images from movie shots created by directors with professional shooting skills, we collect the dataset with strong diversity, specifically, it contains 107\,700 images from 21\,540 movie shots. We labeled them with the bounding box labels for two classes: subject and non-subject foreground object. We present a detailed analysis of the dataset and compare the task with saliency detection and object detection. ImageSubject is the first dataset that tries to localize the subject in an image that the photographer wants to highlight. Moreover, we find the transformer-based detection model offers the best result among other popular model architectures. Finally, we discuss the potential applications and conclude with the importance of the dataset.

preprint2022arXiv

Improving Personality Consistency in Conversation by Persona Extending

Endowing chatbots with a consistent personality plays a vital role for agents to deliver human-like interactions. However, existing personalized approaches commonly generate responses in light of static predefined personas depicted with textual description, which may severely restrict the interactivity of human and the chatbot, especially when the agent needs to answer the query excluded in the predefined personas, which is so-called out-of-predefined persona problem (named OOP for simplicity). To alleviate the problem, in this paper we propose a novel retrieval-to-prediction paradigm consisting of two subcomponents, namely, (1) Persona Retrieval Model (PRM), it retrieves a persona from a global collection based on a Natural Language Inference (NLI) model, the inferred persona is consistent with the predefined personas; and (2) Posterior-scored Transformer (PS-Transformer), it adopts a persona posterior distribution that further considers the actual personas used in the ground response, maximally mitigating the gap between training and inferring. Furthermore, we present a dataset called IT-ConvAI2 that first highlights the OOP problem in personalized dialogue. Extensive experiments on both IT-ConvAI2 and ConvAI2 demonstrate that our proposed model yields considerable improvements in both automatic metrics and human evaluations.

preprint2022arXiv

Tidal effects of dark matter halo around a galactic black hole

We have investigated the tidal forces and geodesic deviation motion in the spacetime of a black hole in the galaxy with dark matter halo. Our results show that the tidal force and geodesic deviation motion depend on the dark matter halo mass and the typical lengthscale of galaxy. The effect of the typical lengthscale of galaxy on tidal force is opposite to that of dark matter mass. For the radial tidal force, with the increasing mass of dark matter, it increases in the region far from the black hole, but decreases in the region near black hole. For the angular tidal force, its absolute value of angular tidal force monotonously increases with the dark matter halo mass. Especially, the angular tidal force also depends on the particle's energy and the effects of dark matter become more distinct for the test particle with high energy, which is different from those in the usual static black hole spacetimes. We also present the change of geodesic deviation vector with the dark matter halo mass and the typical lengthscale of galaxy under two kinds of initial conditions.

preprint2020arXiv

Improving Model Training by Periodic Sampling over Weight Distributions

In this paper, we explore techniques centered around periodic sampling of model weights that provide convergence improvements on gradient update methods (vanilla \acs{SGD}, Momentum, Adam) for a variety of vision problems (classification, detection, segmentation). Importantly, our algorithms provide better, faster and more robust convergence and training performance with only a slight increase in computation time. Our techniques are independent of the neural network model, gradient optimization methods or existing optimal training policies and converge in a less volatile fashion with performance improvements that are approximately monotonic. We conduct a variety of experiments to quantify these improvements and identify scenarios where these techniques could be more useful.

preprint2020arXiv

On-Device Machine Learning: An Algorithms and Learning Theory Perspective

The predominant paradigm for using machine learning models on a device is to train a model in the cloud and perform inference using the trained model on the device. However, with increasing number of smart devices and improved hardware, there is interest in performing model training on the device. Given this surge in interest, a comprehensive survey of the field from a device-agnostic perspective sets the stage for both understanding the state-of-the-art and for identifying open challenges and future avenues of research. However, on-device learning is an expansive field with connections to a large number of related topics in AI and machine learning (including online learning, model adaptation, one/few-shot learning, etc.). Hence, covering such a large number of topics in a single survey is impractical. This survey finds a middle ground by reformulating the problem of on-device learning as resource constrained learning where the resources are compute and memory. This reformulation allows tools, techniques, and algorithms from a wide variety of research areas to be compared equitably. In addition to summarizing the state-of-the-art, the survey also identifies a number of challenges and next steps for both the algorithmic and theoretical aspects of on-device learning.

preprint2020arXiv

Pruning Algorithms to Accelerate Convolutional Neural Networks for Edge Applications: A Survey

With the general trend of increasing Convolutional Neural Network (CNN) model sizes, model compression and acceleration techniques have become critical for the deployment of these models on edge devices. In this paper, we provide a comprehensive survey on Pruning, a major compression strategy that removes non-critical or redundant neurons from a CNN model. The survey covers the overarching motivation for pruning, different strategies and criteria, their advantages and drawbacks, along with a compilation of major pruning techniques. We conclude the survey with a discussion on alternatives to pruning and current challenges for the model compression community.

preprint2016arXiv

$\mathtt{ComEst}$: a Completeness Estimator of Source Extraction on Astronomical Imaging

The completeness of source detection is critical for analyzing the photometric and spatial properties of the population of interest observed by astronomical imaging. We present a software package $\mathtt{ComEst}$, which calculates the completeness of source detection on charge-coupled device (CCD) images of astronomical observations, especially for the optical and near-infrared (NIR) imaging of galaxies and point sources. The completeness estimator $\mathtt{ComEst}$ is designed for the source finder $\mathtt{SExtractor}$ used on the CCD images saved in the Flexible Image Transport System (FITS) format. Specifically, $\mathtt{ComEst}$ estimates the completeness of the source detection by deriving the detection rate of synthetic point sources and galaxies simulated on the observed CCD images. In order to capture any observational artifacts or noise properties while deriving the completeness, $\mathtt{ComEst}$ directly carries out the detection of simulated sources on the observed images. Given an observed CCD image saved in FITS format, $\mathtt{ComEst}$ derives the completeness of the source detection from end to end as a function of source flux (or magnitude) and CCD position. In addition, $\mathtt{ComEst}$ can also estimate the purity of the source detection by comparing the catalog of the detected sources to the input catalogs of the simulated sources. We run ComEst on the images from Blanco Cosmology Survey (BCS) and compare the derived completeness as a function of magnitude to the limiting magnitudes derived by using the Signal-to-Noise ratio (SNR) and number count histogram of the detected sources. $\mathtt{ComEst}$ is released as a Python package with an easy-to-use syntax and is publicly available at https://github.com/inonchiu/ComEst

preprint2010arXiv

Noisy weak-lensing convergence peak statistics near clusters of galaxies and beyond

Taking into account noise from intrinsic ellipticities of source galaxies, in this paper, we study the peak statistics in weak-lensing convergence maps around clusters of galaxies and beyond. We emphasize how the noise peak statistics is affected by the density distribution of nearby clusters, and also how cluster-peak signals are changed by the existence of noise. These are the important aspects to be understood thoroughly in weak-lensing analyses for individual clusters as well as in cosmological applications of weak-lensing cluster statistics. We adopt Gaussian smoothing with the smoothing scale $θ_G=0.5\hbox{ arcmin}$ in our analyses. It is found that the noise peak distribution near a cluster of galaxies depends sensitively on the density profile of the cluster. For a cored isothermal cluster with the core radius $R_c$, the inner region with $R\le R_c$ appears noisy containing on average $\sim 2.4$ peaks with $ν\ge 5$ for $R_c= 1.7\hbox{ arcmin}$ and the true peak height of the cluster $ν=5.6$, where $ν$ denotes the convergence signal to noise ratio. For a NFW cluster of the same mass and the same central $ν$, the average number of peaks with $ν\ge 5$ within $R\le R_c$ is $\sim 1.6$. Thus a high peak corresponding to the main cluster can be identified more cleanly in the NFW case. In the outer region with $R_c<R\le 5R_c$, the number of high noise peaks is considerably enhanced in comparison with that of the pure noise case without the nearby cluster. (abridged)

Jiayi Liu

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

Exemplar-based Pattern Synthesis with Implicit Periodic Field Network

ImageSubject: A Large-scale Dataset for Subject Detection

Improving Personality Consistency in Conversation by Persona Extending

Tidal effects of dark matter halo around a galactic black hole

Improving Model Training by Periodic Sampling over Weight Distributions

On-Device Machine Learning: An Algorithms and Learning Theory Perspective

Pruning Algorithms to Accelerate Convolutional Neural Networks for Edge Applications: A Survey

$\mathtt{ComEst}$: a Completeness Estimator of Source Extraction on Astronomical Imaging

Noisy weak-lensing convergence peak statistics near clusters of galaxies and beyond