Researcher profile

Yi Mao

Yi Mao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
16works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

16 published item(s)

preprint2022arXiv

A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation

Large pretrained generative models like GPT-3 often suffer from hallucinating non-existent or incorrect content, which undermines their potential merits in real applications. Existing work usually attempts to detect these hallucinations based on a corresponding oracle reference at a sentence or document level. However ground-truth references may not be readily available for many free-form text generation applications, and sentence- or document-level detection may fail to provide the fine-grained signals that would prevent fallacious content in real time. As a first step to addressing these issues, we propose a novel token-level, reference-free hallucination detection task and an associated annotated dataset named HaDes (HAllucination DEtection dataSet). To create this dataset, we first perturb a large number of text segments extracted from English language Wikipedia, and then verify these with crowd-sourced annotations. To mitigate label imbalance during annotation, we utilize an iterative model-in-loop strategy. We conduct comprehensive data analyses and create multiple baseline models.

preprint2022arXiv

An End-to-End Dialogue Summarization System for Sales Calls

Summarizing sales calls is a routine task performed manually by salespeople. We present a production system which combines generative models fine-tuned for customer-agent setting, with a human-in-the-loop user experience for an interactive summary curation process. We address challenging aspects of dialogue summarization task in a real-world setting including long input dialogues, content validation, lack of labeled data and quality evaluation. We show how GPT-3 can be leveraged as an offline data labeler to handle training data scarcity and accommodate privacy constraints in an industrial setting. Experiments show significant improvements by our models in tackling the summarization and content validation tasks on public datasets.

preprint2022arXiv

Estimation of HII Bubble Size Distribution from 21cm Power Spectrum with Artificial Neural Networks

The bubble size distribution of ionized hydrogen regions probes the information about the morphology of \HII\ bubbles during the reionization. Conventionally, the \HII\ bubble size distribution can be derived from the tomographic imaging data of the redshifted 21~cm signal from the epoch of reionization, which, however, is observationally challenging even for the upcoming large radio interferometer arrays. Given that these interferometers promise to measure the 21~cm power spectrum accurately, we propose a new method, which is based on the artificial neural networks (ANN), to reconstruct the \HII\ bubble size distribution from the 21~cm power spectrum. We demonstrate that the reconstruction from the 21~cm power spectrum can be almost as accurate as directly measured from the imaging data with the fractional error $\lesssim 10\%$, even with thermal noise at the sensitivity level of the Square Kilometre Array. Nevertheless, the reconstruction implicitly exploits the modelling in reionization simulations, and hence the recovered \HII\ bubble size distribution is not an independent summary statistic from the power spectrum, and should be used only as the indicator for understanding \HII\ bubble morphology and its evolution.

preprint2022arXiv

Implicit Likelihood Inference of Reionization Parameters from the 21 cm Power Spectrum

The first measurements of the 21 cm brightness temperature power spectrum from the epoch of reionization will very likely be achieved in the near future by radio interferometric array experiments such as the Hydrogen Epoch of Reionization Array (HERA) and the Square Kilometre Array (SKA). Standard MCMC analyses use an explicit likelihood approximation to infer the reionization parameters from the 21 cm power spectrum. In this paper, we present a new Bayesian inference of the reionization parameters where the likelihood is implicitly defined through forward simulations using density estimation likelihood-free inference (DELFI). Realistic effects including thermal noise and foreground avoidance are also applied to the mock observations from the HERA and SKA. We demonstrate that this method recovers accurate posterior distributions for the reionization parameters, and outperforms the standard MCMC analysis in terms of the location and size of credible parameter regions. With the minutes-level processing time once the network is trained, this technique is a promising approach for the scientific interpretation of future 21 cm power spectrum observation data. Our code 21cmDELFI-PS is publicly available at this link.

preprint2022arXiv

Knowledge-Grounded Dialogue Generation with a Unified Knowledge Representation

Knowledge-grounded dialogue systems are challenging to build due to the lack of training data and heterogeneous knowledge sources. Existing systems perform poorly on unseen topics due to limited topics covered in the training data. In addition, heterogeneous knowledge sources make it challenging for systems to generalize to other tasks because knowledge sources in different knowledge representations require different knowledge encoders. To address these challenges, we present PLUG, a language model that homogenizes different knowledge sources to a unified knowledge representation for knowledge-grounded dialogue generation tasks. PLUG is pre-trained on a dialogue generation task conditioned on a unified essential knowledge representation. It can generalize to different downstream knowledge-grounded dialogue generation tasks with a few training examples. The empirical evaluation on two benchmarks shows that our model generalizes well across different knowledge-grounded tasks. It can achieve comparable performance with state-of-the-art methods under a fully-supervised setting and significantly outperforms other methods in zero-shot and few-shot settings.

preprint2022arXiv

OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering

The information in tables can be an important complement to text, making table-based question answering (QA) systems of great value. The intrinsic complexity of handling tables often adds an extra burden to both model design and data annotation. In this paper, we aim to develop a simple table-based QA model with minimal annotation effort. Motivated by the fact that table-based QA requires both alignment between questions and tables and the ability to perform complicated reasoning over multiple table elements, we propose an omnivorous pretraining approach that consumes both natural and synthetic data to endow models with these respective abilities. Specifically, given freely available tables, we leverage retrieval to pair them with relevant natural sentences for mask-based pretraining, and synthesize NL questions by converting SQL sampled from tables for pretraining with a QA loss. We perform extensive experiments in both few-shot and full settings, and the results clearly demonstrate the superiority of our model OmniTab, with the best multitasking approach achieving an absolute gain of 16.2% and 2.7% in 128-shot and full settings respectively, also establishing a new state-of-the-art on WikiTableQuestions. Detailed ablations and analyses reveal different characteristics of natural and synthetic data, shedding light on future directions in omnivorous pretraining. Code, pretraining data, and pretrained models are available at https://github.com/jzbjyb/OmniTab.

preprint2022arXiv

Theoretical Models of the Atomic Hydrogen Content in Dark Matter Halos

Atomic hydrogen (H I) gas, mostly residing in dark matter halos after cosmic reionization, is the fuel for star formation. Its relation with properties of host halo is the key to understand the cosmic H I distribution. In this work, we propose a flexible, empirical model of H I-halo relation. In this model, while the H I mass depends primarily on the mass of host halo, there is also secondary dependence on other halo properties. We apply our model to the observation data of the Arecibo Fast Legacy ALFA Survey (ALFALFA), and find it can successfully fit to the cosmic H I abundance ($Ω_{\rm HI}$), average H I-halo mass relation $\langle M_{\rm HI}|M_{\rm h}\rangle$, and the H I clustering. The bestfit of the ALFALFA data rejects with high confidence level the model with no secondary halo dependence of H I mass and the model with secondary dependence on halo spin parameter ($λ$), and shows strong dependence on halo formation time ($a_{1/2}$) and halo concentration ($c_{\rm vir}$). In attempt to explain these findings from the perspective of hydrodynamical simulations, the IllustrisTNG simulation confirms the dependence of H I mass on secondary halo parameters. However, the IllustrisTNG results show strong dependence on $λ$ and weak dependence on $c_{\rm vir}$ and $a_{1/2}$, and also predict a much larger value of H I clustering on large scales than observations. This discrepancy between the simulation and observation calls for improvements in understanding the H I-halo relation from both theoretical and observational sides.

preprint2021arXiv

Antisymmetric Cross-correlation between H I and CO Line Intensity Maps as a New Probe of Cosmic Reionization

Intensity mapping of the H I 21 cm line and the CO 2.61 mm line from the epoch of reionization has emerged as powerful, complementary, probes of the high-redshift Universe. However, both maps and their cross-correlation are dominated by foregrounds. We propose a new analysis by which the signal is unbiased by foregrounds, i.e. it can be measured without foreground mitigation. We construct the antisymmetric part of two-point cross-correlation between intensity maps of the H I 21 cm line and the CO 2.61 mm line, arising because the statistical fluctuations of two fields have different evolution in time. We show that the sign of this new signal can distinguish model-independently whether inside-out reionization happens during some interval of time. More importantly, within the framework of the excursion set model of reionization, we demonstrate that the slope of the dipole of H I-CO cross-power spectrum at large scales is linear to the rate of change of global neutral fraction of hydrogen in a manner independent of reionization parameters, until the slope levels out near the end of reionization, but this trend might possibly depend on the framework of reionization modelling. The H I-CO dipole may be a smoking-gun probe for the speed of reionization, or "standard speedometer" for cosmic reionization. Observations of this new signal will unveil the global reionization history from the midpoint to near the completion of reionization.

preprint2021arXiv

Robust Intensity Mapping Analysis against Foregrounds for the Epoch of Reionization

Intensity mapping of the HI 21 cm line and the CO 2.61 mm line from the epoch of reionization has emerged as powerful, complementary, probes of the high-redshift Universe. However, both maps and their cross-correlation are dominated by foregrounds. We propose a new analysis by which the signal is unbiased by foregrounds, i.e. it can be measured without foreground mitigation. We construct the antisymmetric part of the HI-CO cross-correlation, arising because the statistical fluctuations of two fields have different evolution in time. We show that the sign of this new signal can distinguish model-independently whether inside-out reionization happens during some interval of time.

preprint2021arXiv

Simulation-Based Inference of Reionization Parameters From 3D Tomographic 21 cm Lightcone Images

Tomographic three-dimensional 21 cm images from the epoch of reionization contain a wealth of information about the reionization of the intergalactic medium by astrophysical sources. Conventional power spectrum analysis cannot exploit the full information in the 21 cm data because the 21 cm signal is highly non-Gaussian due to reionization patchiness. We perform a Bayesian inference of the reionization parameters where the likelihood is implicitly defined through forward simulations using density estimation likelihood-free inference (DELFI). We adopt a trained 3D Convolutional Neural Network (CNN) to compress the 3D image data into informative summaries (DELFI-3D CNN). We show that this method recovers accurate posterior distributions for the reionization parameters. Our approach outperforms earlier analysis based on two-dimensional 21 cm images. In contrast, an MCMC analysis of the 3D lightcone-based 21 cm power spectrum alone and using a standard explicit likelihood approximation results in less accurate credible parameter regions than inferred by the DELFI-3D CNN, both in terms of the location and shape of the contours. Our proof-of-concept study implies that the DELFI-3D CNN can effectively exploit more information in the 3D 21 cm images than a 2D CNN or power spectrum analysis. This technique can be readily extended to include realistic effects and is therefore a promising approach for the scientific interpretation of future 21 cm observation data.

preprint2020arXiv

Conditional Self-Attention for Query-based Summarization

Self-attention mechanisms have achieved great success on a variety of NLP tasks due to its flexibility of capturing dependency between arbitrary positions in a sequence. For problems such as query-based summarization (Qsumm) and knowledge graph reasoning where each input sequence is associated with an extra query, explicitly modeling such conditional contextual dependencies can lead to a more accurate solution, which however cannot be captured by existing self-attention mechanisms. In this paper, we propose \textit{conditional self-attention} (CSA), a neural network module designed for conditional dependency modeling. CSA works by adjusting the pairwise attention between input tokens in a self-attention module with the matching score of the inputs to the given query. Thereby, the contextual dependencies modeled by CSA will be highly relevant to the query. We further studied variants of CSA defined by different types of attention. Experiments on Debatepedia and HotpotQA benchmark datasets show CSA consistently outperforms vanilla Transformer and previous models for the Qsumm problem.

preprint2020arXiv

Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning

In this work, we aim at equipping pre-trained language models with structured knowledge. We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs. Building upon entity-level masked language models, our first contribution is an entity masking scheme that exploits relational knowledge underlying the text. This is fulfilled by using a linked knowledge graph to select informative entities and then masking their mentions. In addition we use knowledge graphs to obtain distractors for the masked entities, and propose a novel distractor-suppressed ranking objective which is optimized jointly with masked language model. In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training, to inject language models with structured knowledge via learning from raw text. It is more efficient than retrieval-based methods that perform entity linking and integration during finetuning and inference, and generalizes more effectively than the methods that directly learn from concatenated graph triples. Experiments show that our proposed model achieves improved performance on five benchmark datasets, including question answering and knowledge base completion tasks.

preprint2020arXiv

Ly$α$ forest power spectrum as an emerging window into the epoch of reionization and cosmic dawn

Conventional wisdom was that thermal relics from the epoch of reionization (EOR) would vanish swiftly. Recently, however, it was shown that these relics can survive to lower redshifts ($z \sim 2$) than previously thought, due to gas at mean density being heated to $T \sim 3 \times 10^4$ K by reionization, which is inhomogeneous, and shocks. Given the high sensitivities of upcoming Ly$α$ forest surveys, this effect will be a novel broadband systematic for cosmological application. From the astrophysical point of view, however, the imprint of inhomogeneous reionization can shed light on the EOR and cosmic dawn. We utilize a hybrid method -- which includes two different simulation codes capable of handling the huge dynamical range -- to show the impact of patchy reionization on the Ly$α$ forest and its dependence on different astrophysical scenarios. We found statistically significant deviations in the 1D Ly$α$ power spectrum at $k = 0.14$ cMpc$^{-1}$ that range from $\sim 1\%$ at $z = 2$ up to almost $\sim 20\%$ at $z = 4$. The deviations in the 3D Ly$α$ power spectrum, at the same wavenumber, are large and range from a few per cent at $z = 2$ up to $\sim 50\%$ at $z = 4$, although these deviations ignore the effect of He II reionization and AGN feedback at $z<4$. By exploiting different $k$-dependence of power spectrum among various astrophysical scenarios, the effect of patchy reionization on the Ly$α$ forest power spectrum can open a new window into cosmic reionization and possibly cosmic dawn.

preprint2020arXiv

The Breakdown Scale of HI Bias Linearity

The 21 cm intensity mapping experiments promise to obtain the large-scale distribution of HI gas at the post-reionization epoch. In order to reveal the underlying matter density fluctuations from the HI mapping, it is important to understand how HI gas traces the matter density distribution. Both nonlinear halo clustering and nonlinear effects modulating HI gas in halos may determine the scale below which the HI bias deviates from linearity. We employ three approaches to generate the mock HI density from a large-scale N-body simulation at low redshifts, and demonstrate that the assumption of HI linearity is valid at the scale corresponding to the first peak of baryon acoustic oscillations, but breaks down at $k \gtrsim 0.1\,h\, {\rm Mpc}^{-1}$. The nonlinear effects of halo clustering and HI content modulation counteract each other at small scales, and their competition results in a model-dependent &#34;sweet-spot&#34; redshift near $z$=1 where the HI bias is scale-independent down to small scales. We also find that the linear HI bias scales approximately linearly with redshift for $z\le 3$.

preprint2019arXiv

Testing the scale-dependent hemispherical asymmetry with the 21-cm power spectrum from the epoch of reionization

Hemispherical power asymmetry has emerged as a new challenge to cosmology in early universe. While the cosmic microwave background (CMB) measurements indicated the asymmetry amplitude $A \simeq 0.07$ at the CMB scale $k_{\rm CMB}\simeq 0.0045\,{\rm Mpc}^{-1}$, the high-redshift quasar observations found no significant deviation from statistical isotropy. This conflict can be reconciled in some scale-dependent asymmetry models. We put forward a new parameterization of scale-dependent asymmetric power spectrum, inspired by a multi-speed inflation model. The 21-cm power spectrum from the epoch of reionization can be used to constrain the scale-dependent hemispherical asymmetry. We demonstrate that an optimum, multi-frequency observation by the Square Kilometre Array (SKA) Phase 2 can impose a constraint on the amplitude of the power asymmetry anomaly at the level of $ΔA \simeq 0.2$ at $0.056 \lesssim k_{\rm 21cm} \lesssim 0.15 \,{\rm Mpc}^{-1}$. This limit may be further improved by an order of magnitude as $ΔA \simeq 0.01$ with a cosmic variance limited experiment such as the Omniscope.

preprint2019arXiv

The impact of inhomogeneous subgrid clumping on cosmic reionization

Cosmic reionization was driven by the imbalance between early sources and sinks of ionizing radiation, both of which were dominated by small-scale structure and are thus usually treated in cosmological reionization simulations by subgrid modelling. The recombination rate of intergalactic hydrogen is customarily boosted by a subgrid clumping factor, ${\left<n^2\right>/\left<n\right>^2}$, which corrects for unresolved fluctuations in gas density ${n}$ on scales below the grid-spacing of coarse-grained simulations. We investigate in detail the impact of this inhomogeneous subgrid clumping on reionization and its observables, as follows: (1) Previous attempts generally underestimated the clumping factor because of insufficient mass resolution. We perform a high-resolution $N$-body simulation that resolves haloes down to the pre-reionization Jeans mass to derive the time-dependent, spatially-varying local clumping factor and a fitting formula for its correlation with local overdensity. (2) We then perform a large-scale $N$-body and radiative transfer simulation that accounts for this inhomogeneous subgrid clumping by applying this clumping factor-overdensity correlation. Boosting recombination significantly slows the expansion of ionized regions, which delays completion of reionization and suppresses 21 cm power spectra on large scales in the later stages of reionization. (3) We also consider a simplified prescription in which the globally-averaged, time-evolving clumping factor from the same high-resolution $N$-body simulation is applied uniformly to all cells in the reionization simulation, instead. Observables computed with this model agree fairly well with those from the inhomogeneous clumping model, e.g. predicting 21 cm power spectra to within 20% error, suggesting it may be a useful approximation.