Researcher profile

Ruqing Zhang

Ruqing Zhang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2022arXiv

A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval

Dense retrieval (DR) has shown promising results in information retrieval. In essence, DR requires high-quality text representations to support effective search in the representation space. Recent studies have shown that pre-trained autoencoder-based language models with a weak decoder can provide high-quality text representations, boosting the effectiveness and few-shot ability of DR models. However, even a weak autoregressive decoder has the bypass effect on the encoder. More importantly, the discriminative ability of learned representations may be limited since each token is treated equally important in decoding the input texts. To address the above problems, in this paper, we propose a contrastive pre-training approach to learn a discriminative autoencoder with a lightweight multi-layer perception (MLP) decoder. The basic idea is to generate word distributions of input text in a non-autoregressive fashion and pull the word distributions of two masked versions of one text close while pushing away from others. We theoretically show that our contrastive strategy can suppress the common words and highlight the representative words in decoding, leading to discriminative representations. Empirical results show that our method can significantly outperform the state-of-the-art autoencoder-based language models and other pre-trained models for dense retrieval.

preprint2022arXiv

A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration

The cross-entropy objective has proved to be an all-purpose training objective for autoregressive language models (LMs). However, without considering the penalization of problematic tokens, LMs trained using cross-entropy exhibit text degeneration. To address this, unlikelihood training has been proposed to reduce the probability of unlikely tokens predicted by LMs. But unlikelihood does not consider the relationship between the label tokens and unlikely token candidates, thus showing marginal improvements in degeneration. We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training and avoids their limitations. The key idea is to teach a LM to generate high probabilities for label tokens and low probabilities of negative candidates. Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields much less repetitive texts, with a higher generation quality than baseline approaches, achieving the new state-of-the-art performance on text degeneration.

preprint2022arXiv

Are Neural Ranking Models Robust?

Recently, we have witnessed the bloom of neural ranking models in the information retrieval (IR) field. So far, much effort has been devoted to developing effective neural ranking models that can generalize well on new data. There has been less attention paid to the robustness perspective. Unlike the effectiveness which is about the average performance of a system under normal purpose, robustness cares more about the system performance in the worst case or under malicious operations instead. When a new technique enters into the real-world application, it is critical to know not only how it works in average, but also how would it behave in abnormal situations. So we raise the question in this work: Are neural ranking models robust? To answer this question, firstly, we need to clarify what we refer to when we talk about the robustness of ranking models in IR. We show that robustness is actually a multi-dimensional concept and there are three ways to define it in IR: 1) The performance variance under the independent and identically distributed (I.I.D.) setting; 2) The out-of-distribution (OOD) generalizability; and 3) The defensive ability against adversarial operations. The latter two definitions can be further specified into two different perspectives respectively, leading to 5 robustness tasks in total. Based on this taxonomy, we build corresponding benchmark datasets, design empirical experiments, and systematically analyze the robustness of several representative neural ranking models against traditional probabilistic ranking models and learning-to-rank (LTR) models. The empirical results show that there is no simple answer to our question. While neural ranking models are less robust against other IR models in most cases, some of them can still win 1 out of 5 tasks. This is the first comprehensive study on the robustness of neural ranking models.

preprint2022arXiv

B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Pre-training and fine-tuning have achieved remarkable success in many downstream natural language processing (NLP) tasks. Recently, pre-training methods tailored for information retrieval (IR) have also been explored, and the latest success is the PROP method which has reached new SOTA on a variety of ad-hoc retrieval benchmarks. The basic idea of PROP is to construct the \textit{representative words prediction} (ROP) task for pre-training inspired by the query likelihood model. Despite its exciting performance, the effectiveness of PROP might be bounded by the classical unigram language model adopted in the ROP task construction process. To tackle this problem, we propose a bootstrapped pre-training method (namely B-PROP) based on BERT for ad-hoc retrieval. The key idea is to use the powerful contextual language model BERT to replace the classical unigram language model for the ROP task construction, and re-train BERT itself towards the tailored objective for IR. Specifically, we introduce a novel contrastive method, inspired by the divergence-from-randomness idea, to leverage BERT's self-attention mechanism to sample representative words from the document. By further fine-tuning on downstream ad-hoc retrieval tasks, our method achieves significant improvements over baselines without pre-training or with other pre-training methods, and further pushes forward the SOTA on a variety of ad-hoc retrieval tasks.

preprint2022arXiv

Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models

Neural ranking models (NRMs) have achieved promising results in information retrieval. NRMs have also been shown to be vulnerable to adversarial examples. A typical Word Substitution Ranking Attack (WSRA) against NRMs was proposed recently, in which an attacker promotes a target document in rankings by adding human-imperceptible perturbations to its text. This raises concerns when deploying NRMs in real-world applications. Therefore, it is important to develop techniques that defend against such attacks for NRMs. In empirical defenses adversarial examples are found during training and used to augment the training set. However, such methods offer no theoretical guarantee on the models' robustness and may eventually be broken by other sophisticated WSRAs. To escape this arms race, rigorous and provable certified defense methods for NRMs are needed. To this end, we first define the \textit{Certified Top-$K$ Robustness} for ranking models since users mainly care about the top ranked results in real-world scenarios. A ranking model is said to be Certified Top-$K$ Robust on a ranked list when it is guaranteed to keep documents that are out of the top $K$ away from the top $K$ under any attack. Then, we introduce a Certified Defense method, named CertDR, to achieve certified top-$K$ robustness against WSRA, based on the idea of randomized smoothing. Specifically, we first construct a smoothed ranker by applying random word substitutions on the documents, and then leverage the ranking property jointly with the statistical property of the ensemble to provably certify top-$K$ robustness. Extensive experiments on two representative web search datasets demonstrate that CertDR can significantly outperform state-of-the-art empirical defense methods for ranking models.

preprint2022arXiv

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks

Knowledge-intensive language tasks (KILT) usually require a large body of information to provide correct answers. A popular paradigm to solve this problem is to combine a search system with a machine reader, where the former retrieves supporting evidences and the latter examines them to produce answers. Recently, the reader component has witnessed significant advances with the help of large-scale pre-trained generative models. Meanwhile most existing solutions in the search component rely on the traditional ``index-retrieve-then-rank'' pipeline, which suffers from large memory footprint and difficulty in end-to-end optimization. Inspired by recent efforts in constructing model-based IR models, we propose to replace the traditional multi-step search pipeline with a novel single-step generative model, which can dramatically simplify the search process and be optimized in an end-to-end manner. We show that a strong generative retrieval model can be learned with a set of adequately designed pre-training tasks, and be adopted to improve a variety of downstream KILT tasks with further fine-tuning. We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index. Empirical results show that CorpusBrain can significantly outperform strong baselines for the retrieval task on the KILT benchmark and establish new state-of-the-art downstream performances. We also show that CorpusBrain works well under zero- and low-resource settings.

preprint2022arXiv

GERE: Generative Evidence Retrieval for Fact Verification

Fact verification (FV) is a challenging task which aims to verify a claim using multiple evidential sentences from trustworthy corpora, e.g., Wikipedia. Most existing approaches follow a three-step pipeline framework, including document retrieval, sentence retrieval and claim verification. High-quality evidences provided by the first two steps are the foundation of the effective reasoning in the last step. Despite being important, high-quality evidences are rarely studied by existing works for FV, which often adopt the off-the-shelf models to retrieve relevant documents and sentences in an "index-retrieve-then-rank" fashion. This classical approach has clear drawbacks as follows: i) a large document index as well as a complicated search process is required, leading to considerable memory and computational overhead; ii) independent scoring paradigms fail to capture the interactions among documents and sentences in ranking; iii) a fixed number of sentences are selected to form the final evidence set. In this work, we propose GERE, the first system that retrieves evidences in a generative fashion, i.e., generating the document titles as well as evidence sentence identifiers. This enables us to mitigate the aforementioned technical issues since: i) the memory and computational cost is greatly reduced because the document index is eliminated and the heavy ranking process is replaced by a light generative process; ii) the dependency between documents and that between sentences could be captured via sequential generation process; iii) the generative formulation allows us to dynamically select a precise set of relevant evidences for each claim. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines, with both time-efficiency and memory-efficiency.

preprint2022arXiv

Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models

Neural ranking models (NRMs) have become one of the most important techniques in information retrieval (IR). Due to the limitation of relevance labels, the training of NRMs heavily relies on negative sampling over unlabeled data. In general machine learning scenarios, it has shown that training with hard negatives (i.e., samples that are close to positives) could lead to better performance. Surprisingly, we find opposite results from our empirical studies in IR. When sampling top-ranked results (excluding the labeled positives) as negatives from a stronger retriever, the performance of the learned NRM becomes even worse. Based on our investigation, the superficial reason is that there are more false negatives (i.e., unlabeled positives) in the top-ranked results with a stronger retriever, which may hurt the training process; The root is the existence of pooling bias in the dataset constructing process, where annotators only judge and label very few samples selected by some basic retrievers. Therefore, in principle, we can formulate the false negative issue in training NRMs as learning from labeled datasets with pooling bias. To solve this problem, we propose a novel Coupled Estimation Technique (CET) that learns both a relevance model and a selection model simultaneously to correct the pooling bias for training NRMs. Empirical results on three retrieval benchmarks show that NRMs trained with our technique can achieve significant gains on ranking effectiveness against other baseline strategies.

preprint2022arXiv

PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models

Neural ranking models (NRMs) have shown remarkable success in recent years, especially with pre-trained language models. However, deep neural models are notorious for their vulnerability to adversarial examples. Adversarial attacks may become a new type of web spamming technique given our increased reliance on neural information retrieval models. Therefore, it is important to study potential adversarial attacks to identify vulnerabilities of NRMs before they are deployed. In this paper, we introduce the Word Substitution Ranking Attack (WSRA) task against NRMs, which aims to promote a target document in rankings by adding adversarial perturbations to its text. We focus on the decision-based black-box attack setting, where the attackers cannot directly get access to the model information, but can only query the target model to obtain the rank positions of the partial retrieved list. This attack setting is realistic in real-world search engines. We propose a novel Pseudo Relevance-based ADversarial ranking Attack method (PRADA) that learns a surrogate model based on Pseudo Relevance Feedback (PRF) to generate gradients for finding the adversarial perturbations. Experiments on two web search benchmark datasets show that PRADA can outperform existing attack strategies and successfully fool the NRM with small indiscernible perturbations of text.

preprint2022arXiv

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Dense retrieval has shown promising results in many information retrieval (IR) related tasks, whose foundation is high-quality text representation learning for effective search. Some recent studies have shown that autoencoder-based language models are able to boost the dense retrieval performance using a weak decoder. However, we argue that 1) it is not discriminative to decode all the input texts and, 2) even a weak decoder has the bypass effect on the encoder. Therefore, in this work, we introduce a novel contrastive span prediction task to pre-train the encoder alone, but still retain the bottleneck ability of the autoencoder. % Therefore, in this work, we propose to drop out the decoder and introduce a novel contrastive span prediction task to pre-train the encoder alone. The key idea is to force the encoder to generate the text representation close to its own random spans while far away from others using a group-wise contrastive loss. In this way, we can 1) learn discriminative text representations efficiently with the group-wise contrastive learning over spans and, 2) avoid the bypass effect of the decoder thoroughly. Comprehensive experiments over publicly available retrieval benchmark datasets show that our approach can outperform existing pre-training methods for dense retrieval significantly.

preprint2022arXiv

Pre-training Methods in Information Retrieval

The core of information retrieval (IR) is to identify relevant information from large-scale resources and return it as a ranked list to respond to the user's information need. In recent years, the resurgence of deep learning has greatly advanced this field and leads to a hot topic named NeuIR (i.e., neural information retrieval), especially the paradigm of pre-training methods (PTMs). Owing to sophisticated pre-training objectives and huge model size, pre-trained models can learn universal language representations from massive textual data, which are beneficial to the ranking task of IR. Recently, a large number of works, which are dedicated to the application of PTMs in IR, have been introduced to promote the retrieval performance. Considering the rapid progress of this direction, this survey aims to provide a systematic review of pre-training methods in IR. To be specific, we present an overview of PTMs applied in different components of an IR system, including the retrieval component, the re-ranking component, and other components. In addition, we also introduce PTMs specifically designed for IR, and summarize available datasets as well as benchmark leaderboards. Moreover, we discuss some open challenges and highlight several promising directions, with the hope of inspiring and facilitating more works on these topics for future research.

preprint2022arXiv

Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval

Pre-training and fine-tuning have achieved significant advances in the information retrieval (IR). A typical approach is to fine-tune all the parameters of large-scale pre-trained models (PTMs) on downstream tasks. As the model size and the number of tasks increase greatly, such approach becomes less feasible and prohibitively expensive. Recently, a variety of parameter-efficient tuning methods have been proposed in natural language processing (NLP) that only fine-tune a small number of parameters while still attaining strong performance. Yet there has been little effort to explore parameter-efficient tuning for IR. In this work, we first conduct a comprehensive study of existing parameter-efficient tuning methods at both the retrieval and re-ranking stages. Unlike the promising results in NLP, we find that these methods cannot achieve comparable performance to full fine-tuning at both stages when updating less than 1\% of the original model parameters. More importantly, we find that the existing methods are just parameter-efficient, but not learning-efficient as they suffer from unstable training and slow convergence. To analyze the underlying reason, we conduct a theoretical analysis and show that the separation of the inserted trainable modules makes the optimization difficult. To alleviate this issue, we propose to inject additional modules alongside the \acp{PTM} to make the original scattered modules connected. In this way, all the trainable modules can form a pathway to smooth the loss surface and thus help stabilize the training process. Experiments at both retrieval and re-ranking stages show that our method outperforms existing parameter-efficient methods significantly, and achieves comparable or even better performance over full fine-tuning.

preprint2021arXiv

A Linguistic Study on Relevance Modeling in Information Retrieval

Relevance plays a central role in information retrieval (IR), which has received extensive studies starting from the 20th century. The definition and the modeling of relevance has always been critical challenges in both information science and computer science research areas. Along with the debate and exploration on relevance, IR has already become a core task in many real-world applications, such as Web search engines, question answering systems, conversational bots, and so on. While relevance acts as a unified concept in all these retrieval tasks, the inherent definitions are quite different due to the heterogeneity of these tasks. This raises a question to us: Do these different forms of relevance really lead to different modeling focuses? To answer this question, in this work, we conduct an empirical study on relevance modeling in three representative IR tasks, i.e., document retrieval, answer retrieval, and response retrieval. Specifically, we attempt to study the following two questions: 1) Does relevance modeling in these tasks really show differences in terms of natural language understanding (NLU)? We employ 16 linguistic tasks to probe a unified retrieval model over these three retrieval tasks to answer this question. 2) If there do exist differences, how can we leverage the findings to enhance the relevance modeling? We proposed three intervention methods to investigate how to leverage different modeling focuses of relevance to improve these IR tasks. We believe the way we study the problem as well as our findings would be beneficial to the IR community.

preprint2021arXiv

Learning to Truncate Ranked Lists for Information Retrieval

Ranked list truncation is of critical importance in a variety of professional information retrieval applications such as patent search or legal search. The goal is to dynamically determine the number of returned documents according to some user-defined objectives, in order to reach a balance between the overall utility of the results and user efforts. Existing methods formulate this task as a sequential decision problem and take some pre-defined loss as a proxy objective, which suffers from the limitation of local decision and non-direct optimization. In this work, we propose a global decision based truncation model named AttnCut, which directly optimizes user-defined objectives for the ranked list truncation. Specifically, we take the successful transformer architecture to capture the global dependency within the ranked list for truncation decision, and employ the reward augmented maximum likelihood (RAML) for direct optimization. We consider two types of user-defined objectives which are of practical usage. One is the widely adopted metric such as F1 which acts as a balanced objective, and the other is the best F1 under some minimal recall constraint which represents a typical objective in professional search. Empirical results over the Robust04 and MQ2007 datasets demonstrate the effectiveness of our approach as compared with the state-of-the-art baselines.

preprint2020arXiv

Continual Domain Adaptation for Machine Reading Comprehension

Machine reading comprehension (MRC) has become a core component in a variety of natural language processing (NLP) applications such as question answering and dialogue systems. It becomes a practical challenge that an MRC model needs to learn in non-stationary environments, in which the underlying data distribution changes over time. A typical scenario is the domain drift, i.e. different domains of data come one after another, where the MRC model is required to adapt to the new domain while maintaining previously learned ability. To tackle such a challenge, in this work, we introduce the \textit{Continual Domain Adaptation} (CDA) task for MRC. So far as we know, this is the first study on the continual learning perspective of MRC. We build two benchmark datasets for the CDA task, by re-organizing existing MRC collections into different domains with respect to context type and question type, respectively. We then analyze and observe the catastrophic forgetting (CF) phenomenon of MRC under the CDA setting. To tackle the CDA task, we propose several BERT-based continual learning MRC models using either regularization-based methodology or dynamic-architecture paradigm. We analyze the performance of different continual learning MRC models under the CDA task and show that the proposed dynamic-architecture based model achieves the best performance.

preprint2020arXiv

Match$^2$: A Matching over Matching Model for Similar Question Identification

Community Question Answering (CQA) has become a primary means for people to acquire knowledge, where people are free to ask questions or submit answers. To enhance the efficiency of the service, similar question identification becomes a core task in CQA which aims to find a similar question from the archived repository whenever a new question is asked. However, it has long been a challenge to properly measure the similarity between two questions due to the inherent variation of natural language, i.e., there could be different ways to ask a same question or different questions sharing similar expressions. To alleviate this problem, it is natural to involve the existing answers for the enrichment of the archived questions. Traditional methods typically take a one-side usage, which leverages the answer as some expanded representation of the corresponding question. Unfortunately, this may introduce unexpected noises into the similarity computation since answers are often long and diverse, leading to inferior performance. In this work, we propose a two-side usage, which leverages the answer as a bridge of the two questions. The key idea is based on our observation that similar questions could be addressed by similar parts of the answer while different questions may not. In other words, we can compare the matching patterns of the two questions over the same answer to measure their similarity. In this way, we propose a novel matching over matching model, namely Match$^2$, which compares the matching patterns between two question-answer pairs for similar question identification. Empirical experiments on two benchmark datasets demonstrate that our model can significantly outperform previous state-of-the-art methods on the similar question identification task.

preprint2020arXiv

Query Understanding via Intent Description Generation

Query understanding is a fundamental problem in information retrieval (IR), which has attracted continuous attention through the past decades. Many different tasks have been proposed for understanding users' search queries, e.g., query classification or query clustering. However, it is not that precise to understand a search query at the intent class/cluster level due to the loss of many detailed information. As we may find in many benchmark datasets, e.g., TREC and SemEval, queries are often associated with a detailed description provided by human annotators which clearly describes its intent to help evaluate the relevance of the documents. If a system could automatically generate a detailed and precise intent description for a search query, like human annotators, that would indicate much better query understanding has been achieved. In this paper, therefore, we propose a novel Query-to-Intent-Description (Q2ID) task for query understanding. Unlike those existing ranking tasks which leverage the query and its description to compute the relevance of documents, Q2ID is a reverse task which aims to generate a natural language intent description based on both relevant and irrelevant documents of a given query. To address this new task, we propose a novel Contrastive Generation model, namely CtrsGen for short, to generate the intent description by contrasting the relevant documents with the irrelevant documents given a query. We demonstrate the effectiveness of our model by comparing with several state-of-the-art generation models on the Q2ID task. We discuss the potential usage of such Q2ID technique through an example application.