Source author record

Cho-Yu Jason Chiang

Cho-Yu Jason Chiang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Information Retrieval Machine Learning

Catalog footprint

What is connected

3works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

RAGPart & RAGMask: Retrieval-Stage Defenses Against Corpus Poisoning in Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to enhance large language models (LLMs) with external knowledge, reducing hallucinations and compensating for outdated information. However, recent studies have exposed a critical vulnerability in RAG pipelines corpus poisoning where adversaries inject malicious documents into the retrieval corpus to manipulate model outputs. In this work, we propose two complementary retrieval-stage defenses: RAGPart and RAGMask. Our defenses operate directly on the retriever, making them computationally lightweight and requiring no modification to the generation model. RAGPart leverages the inherent training dynamics of dense retrievers, exploiting document partitioning to mitigate the effect of poisoned points. In contrast, RAGMask identifies suspicious tokens based on significant similarity shifts under targeted token masking. Across two benchmarks, four poisoning strategies, and four state-of-the-art retrievers, our defenses consistently reduce attack success rates while preserving utility under benign conditions. We further introduce an interpretable attack to stress-test our defenses. Our findings highlight the potential and limitations of retrieval-stage defenses, providing practical insights for robust RAG deployments.

preprint2021arXiv

Pareto GAN: Extending the Representational Power of GANs to Heavy-Tailed Distributions

Generative adversarial networks (GANs) are often billed as "universal distribution learners", but precisely what distributions they can represent and learn is still an open question. Heavy-tailed distributions are prevalent in many different domains such as financial risk-assessment, physics, and epidemiology. We observe that existing GAN architectures do a poor job of matching the asymptotic behavior of heavy-tailed distributions, a problem that we show stems from their construction. Additionally, when faced with the infinite moments and large distances between outlier points that are characteristic of heavy-tailed distributions, common loss functions produce unstable or near-zero gradients. We address these problems with the Pareto GAN. A Pareto GAN leverages extreme value theory and the functional properties of neural networks to learn a distribution that matches the asymptotic behavior of the marginal distributions of the features. We identify issues with standard loss functions and propose the use of alternative metric spaces that enable stable and efficient learning. Finally, we evaluate our proposed approach on a variety of heavy-tailed datasets.

preprint2010arXiv

On Optimal Deadlock Detection Scheduling

Deadlock detection scheduling is an important, yet often overlooked problem that can significantly affect the overall performance of deadlock handling. Excessive initiation of deadlock detection increases overall message usage, resulting in degraded system performance in the absence of deadlocks; while insufficient initiation of deadlock detection increases the deadlock persistence time, resulting in an increased deadlock resolution cost in the presence of deadlocks. The investigation of this performance tradeoff, however, is missing in the literature. This paper studies the impact of deadlock detection scheduling on the overall performance of deadlock handling. In particular, we show that there exists an optimal deadlock detection frequency that yields the minimum long-run mean average cost, which is determined by the message complexities of the deadlock detection and resolution algorithms being used, as well as the rate of deadlock formation, denoted as $λ$. For the best known deadlock detection and resolution algorithms, we show that the asymptotically optimal frequency of deadlock detection scheduling that minimizes the overall message overhead is ${\cal O}((λn)^{1/3})$, when the total number $n$ of processes is sufficiently large. Furthermore, we show that in general fully distributed (uncoordinated) deadlock detection scheduling cannot be performed as efficiently as centralized (coordinated) deadlock detection scheduling.