Source author record

Sinong Wang

Sinong Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Machine Learning Artificial Intelligence Information Theory math.IT Networking and Internet Architecture Computer Science and Game Theory Cryptography and Security

Catalog footprint

What is connected

12works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

BayesFormer: Transformer with Uncertainty Estimation

Transformer has become ubiquitous due to its dominant performance in various NLP and image processing tasks. However, it lacks understanding of how to generate mathematically grounded uncertainty estimates for transformer architectures. Models equipped with such uncertainty estimates can typically improve predictive performance, make networks robust, avoid over-fitting and used as acquisition function in active learning. In this paper, we introduce BayesFormer, a Transformer model with dropouts designed by Bayesian theory. We proposed a new theoretical framework to extend the approximate variational inference-based dropout to Transformer-based architectures. Through extensive experiments, we validate the proposed architecture in four paradigms and show improvements across the board: language modeling and classification, long-sequence understanding, machine translation and acquisition function for active learning.

preprint2022arXiv

Detection, Disambiguation, Re-ranking: Autoregressive Entity Linking as a Multi-Task Problem

We propose an autoregressive entity linking model, that is trained with two auxiliary tasks, and learns to re-rank generated samples at inference time. Our proposed novelties address two weaknesses in the literature. First, a recent method proposes to learn mention detection and then entity candidate selection, but relies on predefined sets of candidates. We use encoder-decoder autoregressive entity linking in order to bypass this need, and propose to train mention detection as an auxiliary task instead. Second, previous work suggests that re-ranking could help correct prediction errors. We add a new, auxiliary task, match prediction, to learn re-ranking. Without the use of a knowledge base or candidate sets, our model sets a new state of the art in two benchmark datasets of entity linking: COMETA in the biomedical domain, and AIDA-CoNLL in the news domain. We show through ablation studies that each of the two auxiliary tasks increases performance, and that re-ranking is an important factor to the increase. Finally, our low-resource experimental results suggest that performance on the main task benefits from the knowledge learned by the auxiliary tasks, and not just from the additional training data.

preprint2022arXiv

IDPG: An Instance-Dependent Prompt Generation Method

Prompt tuning is a new, efficient NLP transfer learning paradigm that adds a task-specific prompt in each input instance during the model training stage. It freezes the pre-trained language model and only optimizes a few task-specific prompts. In this paper, we propose a conditional prompt generation method to generate prompts for each input instance, referred to as the Instance-Dependent Prompt Generation (IDPG). Unlike traditional prompt tuning methods that use a fixed prompt, IDPG introduces a lightweight and trainable component to generate prompts based on each input sentence. Extensive experiments on ten natural language understanding (NLU) tasks show that the proposed strategy consistently outperforms various prompt tuning baselines and is on par with other efficient transfer learning methods such as Compacter while tuning far fewer model parameters.

preprint2021arXiv

Studying Strategically: Learning to Mask for Closed-book QA

Closed-book question-answering (QA) is a challenging task that requires a model to directly answer questions without access to external knowledge. It has been shown that directly fine-tuning pre-trained language models with (question, answer) examples yields surprisingly competitive performance, which is further improved upon through adding an intermediate pre-training stage between general pre-training and fine-tuning. Prior work used a heuristic during this intermediate stage, whereby named entities and dates are masked, and the model is trained to recover these tokens. In this paper, we aim to learn the optimal masking strategy for the intermediate pre-training stage. We first train our masking policy to extract spans that are likely to be tested, using supervision from the downstream task itself, then deploy the learned policy during intermediate pre-training. Thus, our policy packs task-relevant knowledge into the parameters of a language model. Our approach is particularly effective on TriviaQA, outperforming strong heuristics when used to pre-train BART.

preprint2020arXiv

CLEAR: Contrastive Learning for Sentence Representation

Pre-trained language models have proven their unique powers in capturing implicit language features. However, most pre-training approaches focus on the word-level training objective, while sentence-level objectives are rarely studied. In this paper, we propose Contrastive LEArning for sentence Representation (CLEAR), which employs multiple sentence-level augmentation strategies in order to learn a noise-invariant sentence representation. These augmentations include word and span deletion, reordering, and substitution. Furthermore, we investigate the key reasons that make contrastive learning effective through numerous experiments. We observe that different sentence augmentations during pre-training lead to different performance improvements on various downstream tasks. Our approach is shown to outperform multiple existing methods on both SentEval and GLUE benchmarks.

preprint2020arXiv

Language Models as Fact Checkers?

Recent work has suggested that language models (LMs) store both common-sense and factual knowledge learned from pre-training data. In this paper, we leverage this implicit knowledge to create an effective end-to-end fact checker using a solely a language model, without any external knowledge or explicit retrieval components. While previous work on extracting knowledge from LMs have focused on the task of open-domain question answering, to the best of our knowledge, this is the first work to examine the use of language models as fact checkers. In a closed-book setting, we show that our zero-shot LM approach outperforms a random baseline on the standard FEVER task, and that our fine-tuned LM compares favorably with standard baselines. Though we do not ultimately outperform methods which use explicit knowledge bases, we believe our exploration shows that this method is viable and has much room for exploration.

preprint2020arXiv

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.

preprint2020arXiv

To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks

Pretraining NLP models with variants of Masked Language Model (MLM) objectives has recently led to a significant improvements on many tasks. This paper examines the benefits of pretrained models as a function of the number of training samples used in the downstream task. On several text classification tasks, we show that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%. Our findings indicate that MLM-based models might reach a diminishing return point as the supervised data size increases significantly.

preprint2016arXiv

Non-additive Security Games

We have investigated the security game under non-additive utility functions.

preprint2015arXiv

Coded Caching with Heterogenous Cache Sizes

We investigate the coded caching scheme under heterogenous cache sizes.

preprint2015arXiv

The Performance Analysis of Coded Cache in Wireless Fading Channel

The rapid growth of data volume and the accompanying congestion problems over the wireless networks have been critical issues to content providers. A novel technique, termed as coded cache, is proposed to relieve the burden. Through creating coded-multicasting opportunities, the coded-cache scheme can provide extra performance gain over the conventional push technique that simply pre-stores contents at local caches during the network idle period. But existing works on the coded caching scheme assumed the availability of an error-free shared channel accessible by each user. This paper considers the more realistic scenario where each user may experience different link quality. In this case, the system performance would be restricted by the user with the worst channel condition. And the corresponding resource allocation schemes aimed at breaking this obstacles are developed. Specifically, we employ the coded caching scheme in time division and frequency division transmission mode and formulate the sub-optimal problems. Power and bandwidth are allocated respectively to maximum the system throughput. The simulation results show that the throughput of the technique in wireless scenario will be limited and would decrease as the number of users becomes sufficiently large.

preprint2014arXiv

Exploiting the Unexploited of Coded Caching for Wireless Content Distribution: Detailed Theoretical Proofs

Recent studies show that the coded caching technique can facilitate the wireless content distribution by mitigating the wireless traffic rate during the peak-traffic time, where the contents are partially prefetched to the local cache of mobile devices during the off-peak time. The remaining contents are then jointly coded and delivered in multicast, when many content requests are initiated in the peak-traffic time. The requested contents can be recovered from the local-prefetched and multicast data with requesters experiencing less congestions. However, the benefit of the coded caching scheme is still under estimated, where the potential gain by appropriate caching distribution is under exploited. In this paper, we propose a theoretical model to minimize the average wireless traffic rate required in the coded caching, for which the optimized caching distribution is derived with the content popularity distribution taken into account. In order to improve the computational efficiency for determining the appropriate caching distribution, we transform the objective function from the average wireless traffic rate into the average size of un-prefetched contents. We theoretically show the order optimality of the derived results from both the primal model and the relaxed one. Simulation results show that the coded caching performance can be further improved with the derived caching distribution.

Sinong Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

BayesFormer: Transformer with Uncertainty Estimation

Detection, Disambiguation, Re-ranking: Autoregressive Entity Linking as a Multi-Task Problem

IDPG: An Instance-Dependent Prompt Generation Method

Studying Strategically: Learning to Mask for Closed-book QA

CLEAR: Contrastive Learning for Sentence Representation

Language Models as Fact Checkers?

Linformer: Self-Attention with Linear Complexity

To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks

Non-additive Security Games

Coded Caching with Heterogenous Cache Sizes

The Performance Analysis of Coded Cache in Wireless Fading Channel

Exploiting the Unexploited of Coded Caching for Wireless Content Distribution: Detailed Theoretical Proofs