Researcher profile

Ahmed Hassan Awadallah

Ahmed Hassan Awadallah contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
16works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

16 published item(s)

preprint2022arXiv

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io.

preprint2022arXiv

Knowledge Infused Decoding

Pre-trained language models (LMs) have been shown to memorize a substantial amount of knowledge from the pre-training corpora; however, they are still limited in recalling factually correct knowledge given a certain context. Hence, they tend to suffer from counterfactual or hallucinatory generation when used in knowledge-intensive natural language generation (NLG) tasks. Recent remedies to this problem focus on modifying either the pre-training or task fine-tuning objectives to incorporate knowledge, which normally require additional costly training or architecture modification of LMs for practical applications. We present Knowledge Infused Decoding (KID) -- a novel decoding algorithm for generative LMs, which dynamically infuses external knowledge into each step of the LM decoding. Specifically, we maintain a local knowledge memory based on the current context, interacting with a dynamically created external knowledge trie, and continuously update the local memory as a knowledge-aware constraint to guide decoding via reinforcement learning. On six diverse knowledge-intensive NLG tasks, task-agnostic LMs (e.g., GPT-2 and BART) armed with KID outperform many task-optimized state-of-the-art models, and show particularly strong performance in few-shot scenarios over seven related knowledge-infusion techniques. Human evaluation confirms KID's ability to generate more relevant and factual language for the input context when compared with multiple baselines. Finally, KID also alleviates exposure bias and provides stable generation quality when generating longer sequences. Code for KID is available at https://github.com/microsoft/KID.

preprint2022arXiv

LiST: Lite Prompted Self-training Makes Parameter-Efficient Few-shot Learners

We present a new method LiST is short for Lite Prompted Self-Training for parameter-efficient fine-tuning of large pre-trained language models (PLMs) for few-shot learning. LiST improves over recent methods that adopt prompt-based fine-tuning (FN) using two key techniques. The first is the use of self-training to leverage large amounts of unlabeled data for prompt-based FN in few-shot settings. We use self-training in conjunction with meta-learning for re-weighting noisy pseudo-prompt labels. Self-training is expensive as it requires updating all the model parameters repetitively. Therefore, we use a second technique for light-weight fine-tuning where we introduce a small number of task-specific parameters that are fine-tuned during self-training while keeping the PLM encoder frozen. Our experiments show that LiST can effectively leverage unlabeled data to improve the model performance for few-shot learning. Additionally, the fine-tuning is efficient as it only updates a small percentage of parameters and the overall model footprint is reduced since several tasks can share a common PLM encoder as backbone. A comprehensive study on six NLU tasks demonstrate LiST to improve by 35% over classic fine-tuning and 6% over prompt-based FN with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each task. With only 14M tunable parameters, LiST outperforms GPT-3 in-context learning by 33% on few-shot NLU tasks.

preprint2022arXiv

Pathologies of Pre-trained Language Models in Few-shot Fine-tuning

Although adapting pre-trained language models with few examples has shown promising performance on text classification, there is a lack of understanding of where the performance gain comes from. In this work, we propose to answer this question by interpreting the adaptation behavior using post-hoc explanations from model predictions. By modeling feature statistics of explanations, we discover that (1) without fine-tuning, pre-trained models (e.g. BERT and RoBERTa) show strong prediction bias across labels; (2) although few-shot fine-tuning can mitigate the prediction bias and demonstrate promising prediction performance, our analysis shows models gain performance improvement by capturing non-task-related features (e.g. stop words) or shallow data patterns (e.g. lexical overlaps). These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior, which requires further sanity check on model predictions and careful design in model evaluations in few-shot fine-tuning.

preprint2022arXiv

Structure-Grounded Pretraining for Text-to-SQL

Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERT-LARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. The Spider-Realistic dataset is available at https://doi.org/10.5281/zenodo.5205322.

preprint2022arXiv

WALNUT: A Benchmark on Semi-weakly Supervised Learning for Natural Language Understanding

Building machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been proven valuable when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. It is thus hard to compare different approaches and evaluate the benefit of weak supervision without access to a unified and systematic benchmark with diverse tasks and real-world weak labeling rules. In this paper, we propose such a benchmark, named WALNUT (semi-WeAkly supervised Learning for Natural language Understanding Testbed), to advocate and facilitate research on weak supervision for NLU. WALNUT consists of NLU tasks with different types, including document-level and token-level prediction tasks. WALNUT is the first semi-weakly supervised learning benchmark for NLU, where each task contains weak labels generated by multiple real-world weak sources, together with a small set of clean labels. We conduct baseline evaluations on WALNUT to systematically evaluate the effectiveness of various weak supervision methods and model architectures. Our results demonstrate the benefit of weak supervision for low-resource NLU tasks and highlight interesting patterns across tasks. We expect WALNUT to stimulate further research on methodologies to leverage weak supervision more effectively. The benchmark and code for baselines are available at \url{aka.ms/walnut_benchmark}.

preprint2020arXiv

An Empirical Study of Software Exceptions in the Field using Search Logs

Software engineers spend a substantial amount of time using Web search to accomplish software engineering tasks. Such search tasks include finding code snippets, API documentation, seeking help with debugging, etc. While debugging a bug or crash, one of the common practices of software engineers is to search for information about the associated error or exception traces on the internet. In this paper, we analyze query logs from a leading commercial general-purpose search engine (GPSE) such as Google, Yahoo! or Bing to carry out a large scale study of software exceptions. To the best of our knowledge, this is the first large scale study to analyze how Web search is used to find information about exceptions. We analyzed about 1 million exception related search queries from a random sample of 5 billion web search queries. To extract exceptions from unstructured query text, we built a novel and high-performance machine learning model with a F1-score of 0.82. Using the machine learning model, we extracted exceptions from raw queries and performed popularity, effort, success, query characteristic and web domain analysis. We also performed programming language-specific analysis to give a better view of the exception search behavior. These techniques can help improve existing methods, documentation and tools for exception analysis and prediction. Further, similar techniques can be applied for APIs, frameworks, etc.

preprint2020arXiv

Analyzing Web Search Behavior for Software Engineering Tasks

Web search plays an integral role in software engineering (SE) to help with various tasks such as finding documentation, debugging, installation, etc. In this work, we present the first large-scale analysis of web search behavior for SE tasks using the search query logs from Bing, a commercial web search engine. First, we use distant supervision techniques to build a machine learning classifier to extract the SE search queries with an F1 score of 93%. We then perform an analysis on one million search sessions to understand how software engineering related queries and sessions differ from other queries and sessions. Subsequently, we propose a taxonomy of intents to identify the various contexts in which web search is used in software engineering. Lastly, we analyze millions of SE queries to understand the distribution, search metrics and trends across these SE search intents. Our analysis shows that SE related queries form a significant portion of the overall web search traffic. Additionally, we found that there are six major intent categories for which web search is used in software engineering. The techniques and insights can not only help improve existing tools but can also inspire the development of new tools that aid in finding information for SE related tasks.

preprint2020arXiv

Detecting Fake News with Weak Social Supervision

Limited labeled data is becoming the largest bottleneck for supervised learning systems. This is especially the case for many real-world tasks where large scale annotated examples are either too expensive to acquire or unavailable due to privacy or data access constraints. Weak supervision has shown to be a good means to mitigate the scarcity of annotated data by leveraging weak labels or injecting constraints from heuristic rules and/or external knowledge sources. Social media has little labeled data but possesses unique characteristics that make it suitable for generating weak supervision, resulting in a new type of weak supervision, i.e., weak social supervision. In this article, we illustrate how various aspects of social media can be used to generate weak social supervision. Specifically, we use the recent research on fake news detection as the use case, where social engagements are abundant but annotated examples are scarce, to show that weak social supervision is effective when facing the little labeled data problem. This article opens the door for learning with weak social supervision for other emerging tasks.

preprint2020arXiv

Distilling BERT into Simple Neural Networks with Unlabeled Transfer Data

Recent advances in pre-training huge models on large amounts of text through self supervision have obtained state-of-the-art results in various natural language processing tasks. However, these huge and expensive models are difficult to use in practise for downstream tasks. Some recent efforts use knowledge distillation to compress these models. However, we see a gap between the performance of the smaller student models as compared to that of the large teacher. In this work, we leverage large amounts of in-domain unlabeled transfer data in addition to a limited amount of labeled training instances to bridge this gap for distilling BERT. We show that simple RNN based student models even with hard distillation can perform at par with the huge teachers given the transfer set. The student performance can be further improved with soft distillation and leveraging teacher intermediate representations. We show that our student models can compress the huge teacher by up to 26x while still matching or even marginally exceeding the teacher performance in low-resource settings with small amount of labeled data. Additionally, for the multilingual extension of this work with XtremeDistil (Mukherjee and Hassan Awadallah, 2020), we demonstrate massive distillation of multilingual BERT-like teacher models by upto 35x in terms of parameter compression and 51x in terms of latency speedup for batch inference while retaining 95% of its F1-score for NER over 41 languages.

preprint2020arXiv

Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer

Multilingual representations embed words from many languages into a single semantic space such that words with similar meanings are close to each other regardless of the language. These embeddings have been widely used in various settings, such as cross-lingual transfer, where a natural language processing (NLP) model trained on one language is deployed to another language. While the cross-lingual transfer techniques are powerful, they carry gender bias from the source to target languages. In this paper, we study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications. We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations from both the intrinsic and extrinsic perspectives. Experimental results show that the magnitude of bias in the multilingual representations changes differently when we align the embeddings to different target spaces and that the alignment direction can also have an influence on the bias in transfer learning. We further provide recommendations for using the multilingual word representations for downstream tasks.

preprint2020arXiv

Learning with Weak Supervision for Email Intent Detection

Email remains one of the most frequently used means of online communication. People spend a significant amount of time every day on emails to exchange information, manage tasks and schedule events. Previous work has studied different ways for improving email productivity by prioritizing emails, suggesting automatic replies or identifying intents to recommend appropriate actions. The problem has been mostly posed as a supervised learning problem where models of different complexities were proposed to classify an email message into a predefined taxonomy of intents or classes. The need for labeled data has always been one of the largest bottlenecks in training supervised models. This is especially the case for many real-world tasks, such as email intent classification, where large scale annotated examples are either hard to acquire or unavailable due to privacy or data access constraints. Email users often take actions in response to intents expressed in an email (e.g., setting up a meeting in response to an email with a scheduling request). Such actions can be inferred from user interaction logs. In this paper, we propose to leverage user actions as a source of weak supervision, in addition to a limited set of annotated examples, to detect intents in emails. We develop an end-to-end robust deep neural network model for email intent identification that leverages both clean annotated data and noisy weak supervision along with a self-paced learning mechanism. Extensive experiments on three different intent detection tasks show that our approach can effectively leverage the weakly supervised data to improve intent detection in emails.

preprint2020arXiv

Leveraging Multi-Source Weak Social Supervision for Early Detection of Fake News

Social media has greatly enabled people to participate in online activities at an unprecedented rate. However, this unrestricted access also exacerbates the spread of misinformation and fake news online which might cause confusion and chaos unless being detected early for its mitigation. Given the rapidly evolving nature of news events and the limited amount of annotated data, state-of-the-art systems on fake news detection face challenges due to the lack of large numbers of annotated training instances that are hard to come by for early detection. In this work, we exploit multiple weak signals from different sources given by user and content engagements (referred to as weak social supervision), and their complementary utilities to detect fake news. We jointly leverage the limited amount of clean data along with weak signals from social engagements to train deep neural networks in a meta-learning framework to estimate the quality of different weak instances. Experiments on realworld datasets demonstrate that the proposed framework outperforms state-of-the-art baselines for early detection of fake news without using any user engagements at prediction time.

preprint2020arXiv

Smart To-Do : Automatic Generation of To-Do Items from Emails

Intelligent features in email service applications aim to increase productivity by helping people organize their folders, compose their emails and respond to pending tasks. In this work, we explore a new application, Smart-To-Do, that helps users with task management over emails. We introduce a new task and dataset for automatically generating To-Do items from emails where the sender has promised to perform an action. We design a two-stage process leveraging recent advances in neural text generation and sequence-to-sequence learning, obtaining BLEU and ROUGE scores of 0:23 and 0:63 for this task. To the best of our knowledge, this is the first work to address the problem of composing To-Do items from emails.

preprint2020arXiv

Speak to your Parser: Interactive Text-to-SQL with Natural Language Feedback

We study the task of semantic parse correction with natural language feedback. Given a natural language utterance, most semantic parsing systems pose the problem as one-shot translation where the utterance is mapped to a corresponding logical form. In this paper, we investigate a more interactive scenario where humans can further interact with the system by providing free-form natural language feedback to correct the system when it generates an inaccurate interpretation of an initial utterance. We focus on natural language to SQL systems and construct, SPLASH, a dataset of utterances, incorrect SQL interpretations and the corresponding natural language feedback. We compare various reference models for the correction task and show that incorporating such a rich form of feedback can significantly improve the overall semantic parsing accuracy while retaining the flexibility of natural language interaction. While we estimated human correction accuracy is 81.5%, our best model achieves only 25.1%, which leaves a large gap for improvement in future research. SPLASH is publicly available at https://aka.ms/Splash_dataset.

preprint2020arXiv

Uncertainty-aware Self-training for Text Classification with Few Labels

Recent success of large-scale pre-trained language models crucially hinge on fine-tuning them on large amounts of labeled data for the downstream task, that are typically expensive to acquire. In this work, we study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck by making use of large-scale unlabeled data for the target task. Standard self-training mechanism randomly samples instances from the unlabeled pool to pseudo-label and augment labeled data. In this work, we propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network leveraging recent advances in Bayesian deep learning. Specifically, we propose (i) acquisition functions to select instances from the unlabeled pool leveraging Monte Carlo (MC) Dropout, and (ii) learning mechanism leveraging model confidence for self-training. As an application, we focus on text classification on five benchmark datasets. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models fine-tuned on thousands of labeled instances with an aggregate accuracy of 91% and improving by upto 12% over baselines.