Researcher profile

Wenhao Liu

Wenhao Liu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
22works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

22 published item(s)

preprint2026arXiv

Benchmark^2: Systematic Evaluation of LLM Benchmarks

The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.

preprint2025arXiv

Atomic-scale spin sensing of a 2D $d$-wave altermagnet via helical tunneling

Altermagnetism simultaneously possesses nonrelativistic spin responses and zero net magnetization, thus combining advantages of ferromagnetism and antiferromagnetism. This superiority originates from its unique dual feature, i.e., opposite-magnetic sublattices in real space and alternating spin polarization in momentum space enforced by the same crystal symmetry. Therefore, the determination of an altermagnetic order and its unique spin response inherently necessitates atomic-scale spin-resolved measurements in real and momentum spaces, an experimental milestone yet to be achieved. Here, via utilizing the helical edge (hinge) modes of a higher order topological insulator as the spin sensor, we realize spin-resolved scanning tunneling microscopy which enables us to pin down the dual-space feature of a layered $d$-wave altermagnet, KV$_2$Se$_2$O. In real space, atomic-registered mapping demonstrates the checkerboard antiferromagnetic order together with density-wave lattice modulation, and in momentum space, spin-resolved spectroscopic imaging provides a direct visualization of d-wave spin splitting of the band structure. Critically, using this new topology-guaranteed spin filter we directly reveal the unidirectional, spin-polarized quasiparticle excitations originating from the crystal symmetry-paired X and Y valleys around opposite magnetic sublattices simultaneously --the unique spin response for $d$-wave altermagnetism. Our experiments establish a solid basis for the exploration and utilization of altermagnetism in layered materials and further facilitate access to atomic-scale spin sensing and manipulating of 2D quantum materials.

preprint2025arXiv

Observation of robust one-dimensional edge channels in a three-dimensional quantum spin Hall insulator

Topologically protected edge channels show prospects for quantum devices. They have been found experimentally in two-dimensional (2D) quantum spin Hall insulators (QSHIs), weak topological insulators and higher-order topological insulators (HOTIs), but the number of materials realizing these topologies is still quite limited. Here, we provide evidence for topological edge states within a novel topology named three-dimensional (3D) QSHIs. Its topology originates solely from a nonzero $S_z$ spin Chern number for each $k_z$ plane of the crystal and is realized in bulk $α$-Bi$_4$I$_4$ with trivial symmetry indicators, as we show by density functional theory calculations. We experimentally observe the related edge states at each type of monolayer and bilayer step of this material by scanning tunneling microscopy. Consistently, the edge states are neither interrupted, nor backscattered by defects at the step edges corroborating their helical character as expected from the nontrivial topology. Furthermore, two individual edge channels are directly observed at bilayer steps without visible interaction gap opening, demonstrating the robustness of these edge modes against vertical stacking. Our results establish $α$-Bi$_4$I$_4$ as the first material realization of a 3D QSHI whose definition goes beyond the scope of topological symmetry indicators, and provide a pathway for realizing nearly-quantized spin Hall conductivity per unit cell in a bulk crystal.

preprint2024arXiv

The Dust Attenuation Scaling Relation of Star-Forming Galaxies in the EAGLE Simulations

Dust attenuation in star-forming galaxies (SFGs), as parameterized by the infrared excess (IRX $\equiv L_{\rm IR}/L_{\rm UV}$), is found to be tightly correlated with star formation rate (SFR), metallicity and galaxy size, following a universal IRX relation up to $z=3$. This scaling relation can provide a fundamental constraint for theoretical models to reconcile galaxy star formation, chemical enrichment, and structural evolution across cosmic time. We attempt to reproduce the universal IRX relation over $0.1\leq z\leq 2.5$ using the EAGLE hydrodynamical simulations and examine sensitive parameters in determining galaxy dust attenuation. Our findings show that while the predicted universal IRX relation from EAGLE approximately aligns with observations at $z\leq 0.5$, noticeable disparities arise at different stellar masses and higher redshifts. Specifically, we investigate how modifying various galaxy parameters can affect the predicted universal IRX relation in comparison to the observed data. We demonstrate that the simulated gas-phase metallicity is the critical quantity for the shape of the predicted universal IRX relation. We find that the influence of the infrared luminosity and infrared excess is less important while galaxy size has virtually no significant effect. Overall, the EAGLE simulations are not able to replicate some of the observed characteristics between IRX and galaxy parameters of SFGs, emphasizing the need for further investigation and testing for our current state-of-the-art theoretical models.

preprint2022arXiv

A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis

Sentiment analysis is an important task in natural language processing. In recent works, pre-trained language models are often used to achieve state-of-the-art results, especially when training data is scarce. It is common to fine-tune on the downstream task, usually by adding task-specific layers on top of the model. In this paper, we focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. In particular, we are interested in few-shot settings. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention (GPT2 is used unless stated otherwise). This way, the model learns to accomplish the tasks via language generation without the need of training task-specific layers. Our evaluation results on the single-task polarity prediction show that our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings. More importantly, our generative approach significantly reduces the model variance caused by low-resource data. We further demonstrate that the proposed generative language model can handle joint and multi-task settings, unlike previous work. We observe that the proposed sequence generation method achieves further improved performances on polarity prediction when the model is trained via joint and multi-task settings. Further evaluation on similar sentiment analysis datasets, SST-2, SST- and OOS intent detection validates the superiority and noise robustness of generative language model in few-shot settings.

preprint2022arXiv

A three-stage magnetic phase transition revealed in ultrahigh-quality van der Waals magnet CrSBr

van der Waals (vdW) magnets are receiving ever-growing attention nowadays due to their significance in both fundamental research on low-dimensional magnetism and potential applications in spintronic devices. High crystalline quality of vdW magnets is key for maintaining intrinsic magnetic and electronic properties, especially when exfoliated down to the 2D limit. Here, ultrahigh-quality air-stable vdW CrSBr crystals are synthesized using the direct vapor-solid synthesis method. The high single crystallinity and spatial homogeneity have been thoroughly evidenced at length scales from sub-mm to atomic resolution by X-ray diffraction, second harmonic generation, and scanning transmission electron microscopy. More importantly, specific heat measurements of these ultrahigh quality CrSBr crystals show three thermodynamic anomalies at 185K, 156K, and 132K, revealing a stage-by-stage development of the magnetic order upon cooling, which is also corroborated with the magnetization and transport results. Our ultrahigh-quality CrSBr can further be exfoliated down to monolayers and bilayers easily, paving the way to integrate them into heterostructures for spintronic and magneto-optoelectronic applications.

preprint2022arXiv

CaPE: Contrastive Parameter Ensembling for Reducing Hallucination in Abstractive Summarization

Hallucination is a known issue for neural abstractive summarization models. Recent work suggests that the degree of hallucination may depend on errors in the training data. In this work, we propose a new method called Contrastive Parameter Ensembling (CaPE) to use training data more effectively, utilizing variations in noise in training samples to reduce hallucination. We first select clean and noisy subsets from the training data using different automatic factual metrics. Then, we fine-tune a base summarization model, which is trained on all training samples, on the clean (noisy) subset to obtain an \textit{expert} (\textit{anti-expert}) model. Finally, we adjust the parameters of base model by the difference between parameters of the \textit{expert} and \textit{anti-expert} models, steering the base model towards the \textit{expert} model and away from the \textit{anti-expert} model. Experimental results show that CaPE improves performance across different automatic factual metrics and human evaluation, with the maximum improvement of 16.69\% and 15.78\% on summary-level dependency-arc entailment accuracy for the XSUM and CNN/DM datasets. The improvement in factual performance does not degrade the performance on other metrics of informativeness such as ROUGE.

preprint2022arXiv

Chandra view of Abell 407: the central compact group of galaxies and the interaction between the radio AGN and the ICM

Abell 407 (A407) is a unique galaxy cluster hosting a central compact group of nine galaxies (named as &#39;Zwicky&#39;s Nonet&#39;; G1 - G9 in this work) within a 30 kpc radius region. The cluster core also hosts a luminous radio active galactic nucleus (AGN), 4C 35.06 with helically twisted jets extending over 200 kpc. With a 44 ks Chandra observation of A407, we characterize the X-ray properties of its intracluster medium (ICM) and central galaxies. The mean X-ray temperature of A407 is 2.7 keV and the $M_{200}$ is $1.9 \times 10^{14} {M_{\odot}}$. We suggest that A407 has a weak cool core at $r < 60$ kpc scales and at its very center, $< 1$-2 kpc radius, a small galaxy corona associated with the strong radio AGN. We also conclude that the AGN 4C 35.06 host galaxy is most likely G3. We suggest that the central group of galaxies is undergoing a `slow merge&#39; procedure. The range of the merging time-scale is $0.3\sim2.3$ Gyr and the stellar mass of the future brightest cluster galaxy (BCG) will be $7.4\times10^{11} M_{\odot}$. We find that the regions which overlap with the radio jets have higher temperature and metallicity. This is consistent with AGN feedback activity. The central entropy is higher than that for other clusters, which may be due to the AGN feedback and/or merging activity. With all these facts, we suggest that A407 is a unique and rare system in the local universe that could help us to understand the formation of a massive BCG.

preprint2022arXiv

Converse: A Tree-Based Modular Task-Oriented Dialogue System

Creating a system that can have meaningful conversations with humans to help accomplish tasks is one of the ultimate goals of Artificial Intelligence (AI). It has defined the meaning of AI since the beginning. A lot has been accomplished in this area recently, with voice assistant products entering our daily lives and chat bot systems becoming commonplace in customer service. At first glance there seems to be no shortage of options for dialogue systems. However, the frequently deployed dialogue systems today seem to all struggle with a critical weakness - they are hard to build and harder to maintain. At the core of the struggle is the need to script every single turn of interactions between the bot and the human user. This makes the dialogue systems more difficult to maintain as the tasks become more complex and more tasks are added to the system. In this paper, we propose Converse, a flexible tree-based modular task-oriented dialogue system. Converse uses an and-or tree structure to represent tasks and offers powerful multi-task dialogue management. Converse supports task dependency and task switching, which are unique features compared to other open-source dialogue frameworks. At the same time, Converse aims to make the bot building process easy and simple, for both professional and non-professional software developers. The code is available at https://github.com/salesforce/Converse.

preprint2022arXiv

DialFact: A Benchmark for Fact-Checking in Dialogue

Fact-checking is an essential tool to mitigate the spread of misinformation and disinformation. We introduce the task of fact-checking in dialogue, which is a relatively unexplored area. We construct DialFact, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DialFact: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information. We found that existing fact-checking models trained on non-dialogue data like FEVER fail to perform well on our task, and thus, we propose a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue. We point out unique challenges in DialFact such as handling the colloquialisms, coreferences and retrieval ambiguities in the error analysis to shed light on future research in this direction.

preprint2022arXiv

Exploring Neural Models for Query-Focused Summarization

Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization. While recently released datasets, such as QMSum or AQuaMuSe, facilitate research efforts in QFS, the field lacks a comprehensive study of the broad space of applicable modeling methods. In this paper we conduct a systematic exploration of neural approaches to QFS, considering two general classes of methods: two-stage extractive-abstractive solutions and end-to-end models. Within those categories, we investigate existing models and explore strategies for transfer learning. We also present two modeling extensions that achieve state-of-the-art performance on the QMSum dataset, up to a margin of 3.38 ROUGE-1, 3.72 ROUGE2, and 3.28 ROUGE-L when combined with transfer learning strategies. Results from human evaluation suggest that the best models produce more comprehensive and factually consistent summaries compared to a baseline model. Code and checkpoints are made publicly available: https://github.com/salesforce/query-focused-sum.

preprint2022arXiv

MixQG: Neural Question Generation with Mixed Answer Types

Asking good questions is an essential ability for both human and machine intelligence. However, existing neural question generation approaches mainly focus on the short factoid type of answers. In this paper, we propose a neural question generator, MixQG, to bridge this gap. We combine 9 question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstractive answers, to train a single generative model. We show with empirical results that our model outperforms existing work in both seen and unseen domains and can generate questions with different cognitive levels when conditioned on different answer types. Our code is released and well-integrated with the Huggingface library to facilitate various downstream applications.

preprint2022arXiv

Open Vocabulary Object Detection with Pseudo Bounding-Box Labels

Despite great progress in object detection, most existing methods work only on a limited set of object categories, due to the tremendous human effort needed for bounding-box annotations of training data. To alleviate the problem, recent open vocabulary and zero-shot detection methods attempt to detect novel object categories beyond those seen during training. They achieve this goal by training on a pre-defined base categories to induce generalization to novel objects. However, their potential is still constrained by the small set of base categories available for training. To enlarge the set of base classes, we propose a method to automatically generate pseudo bounding-box annotations of diverse objects from large-scale image-caption pairs. Our method leverages the localization ability of pre-trained vision-language models to generate pseudo bounding-box labels and then directly uses them for training object detectors. Experimental results show that our method outperforms the state-of-the-art open vocabulary detector by 8% AP on COCO novel categories, by 6.3% AP on PASCAL VOC, by 2.3% AP on Objects365 and by 2.8% AP on LVIS. Code is available at https://github.com/salesforce/PB-OVD.

preprint2022arXiv

QAConv: Question Answering on Informative Conversations

This paper introduces QAConv, a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations, including business emails, panel discussions, and work channels. Unlike open-domain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,608 QA pairs from 10,259 selected conversations with both human-written and machine-generated questions. We use a question generator and a dialogue summarizer as auxiliary tools to collect and recommend questions. The dataset has two testing scenarios: chunk mode and full mode, depending on whether the grounded partial conversation is provided or retrieved. Experimental results show that state-of-the-art pretrained QA systems have limited zero-shot performance and tend to predict our questions as unanswerable. Our dataset provides a new training and evaluation testbed to facilitate QA on conversations research.

preprint2022arXiv

QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization

Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.

preprint2022arXiv

Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation

Question generation (QGen) models are often evaluated with standardized NLG metrics that are based on n-gram overlap. In this paper, we measure whether these metric improvements translate to gains in a practical setting, focusing on the use case of helping teachers automate the generation of reading comprehension quizzes. In our study, teachers building a quiz receive question suggestions, which they can either accept or refuse with a reason. Even though we find that recent progress in QGen leads to a significant increase in question acceptance rates, there is still large room for improvement, with the best model having only 68.4% of its questions accepted by the ten teachers who participated in our study. We then leverage the annotations we collected to analyze standard NLG metrics and find that model performance has reached projected upper-bounds, suggesting new automatic metrics are needed to guide QGen research forward.

preprint2022arXiv

Structure Extraction in Task-Oriented Dialogues with Slot Clustering

Extracting structure information from dialogue data can help us better understand user and system behaviors. In task-oriented dialogues, dialogue structure has often been considered as transition graphs among dialogue states. However, annotating dialogue states manually is expensive and time-consuming. In this paper, we propose a simple yet effective approach for structure extraction in task-oriented dialogues. We first detect and cluster possible slot tokens with a pre-trained model to approximate dialogue ontology for a target domain. Then we track the status of each identified token group and derive a state transition structure. Empirical results show that our approach outperforms unsupervised baseline models by far in dialogue structure extraction. In addition, we show that data augmentation based on extracted structures enriches the surface formats of training data and can achieve a significant performance boost in dialogue response generation.

preprint2022arXiv

Submillimetre galaxies in two massive protoclusters at z = 2.24: witnessing the enrichment of extreme starbursts in the outskirts of HAE density peaks

Submillimetre galaxies represent a rapid growth phase of both star formation and massive galaxies. Mapping SMGs in galaxy protoclusters provides key insights into where and how these extreme starbursts take place in connections with the assembly of the large-scale structure in the early Universe. We search for SMGs at 850$\,μm$ using JCMT/SCUBA-2 in two massive protoclusters at $z=2.24$, BOSS1244 and BOSS1542, and detect 43 and 54 sources with $S_{850}>4\,$mJy at the $4σ$ level within an effective area of 264$\,$arcmin$^2$, respectively. We construct the intrinsic number counts and find that the abundance of SMGs is $2.0\pm0.3$ and $2.1\pm0.2$ times that of the general fields, confirming that BOSS1244 and BOSS1542 contain a higher fraction of dusty galaxies with strongly enhanced star formation. The volume densities of the SMGs are estimated to be $\sim15-$30 times the average, significantly higher than the overdensity factor ($\sim 6$) traced by H$α$ emission-line galaxies (HAEs). More importantly, we discover a prominent offset between the spatial distributions of the two populations in these two protoclusters -- SMGs are mostly located around the high-density regions of HAEs, and few are seen inside these regions. This finding may have revealed for the first time the occurrence of violent star formation enhancement in the outskirts of the HAE density peaks, likely driven by the boosting of gas supplies and/or starburst triggering events. Meanwhile, the lack of SMGs inside the most overdense regions at $z\sim2$ implies a transition to the environment disfavouring extreme starbursts.

preprint2022arXiv

Systematic biases in determining dust attenuation curves through galaxy SED fitting

While the slope of the dust attenuation curve ($δ$) is found to correlate with effective dust attenuation ($A_V$) as obtained through spectral energy distribution (SED) fitting, it remains unknown how the fitting degeneracies shape this relation. We examine the degeneracy effects by fitting SEDs of a sample of local star-forming galaxies (SFGs) selected from the Galaxy And Mass Assembly survey, in conjunction with mock galaxy SEDs of known attenuation parameters. A well-designed declining starburst star formation history is adopted to generate model SED templates with intrinsic UV slope ($β_0$) spanning over a reasonably wide range. The best-fitting $β_0$ for our sample SFGs shows a wide coverage, dramatically differing from the limited range of $β_0<-2.2$ for a starburst of constant star formation. Our results show that strong degeneracies between $β_0$, $δ$, and $A_V$ in the SED fitting induce systematic biases leading to a false $A_V$--$δ$ correlation. Our simulation tests reveal that this relationship can be well reproduced even when a flat $A_V$--$δ$ relation is taken to build the input model galaxy SEDs. The variations in best-fitting $δ$ are dominated by the fitting errors. We show that assuming a starburst with constant star formation in SED fitting will result in a steeper attenuation curve, smaller degeneracy errors, and a stronger $A_V$--$δ$ relation. Our findings confirm that the $A_V$--$δ$ relation obtained through SED fitting is likely driven by the systematic biases induced by the fitting degeneracies between $β_0$, $δ$, and $A_V$.

preprint2022arXiv

The Physical Properties of Star-Forming Galaxies with Strong [O III] Lines at z=3.25

We present an analysis of physical properties of 34 [O III] emission-line galaxies (ELGs) at z=3.254$\pm$0.029 in the Extended Chandra Deep Field South (ECDFS). These ELGs are selected from deep narrow H2S(1) and broad Ks imaging of 383 arcmin$^{2}$ obtained with CFHT/WIRCam. We construct spectral energy distributions (SEDs) from U to Ks to derive the physical properties of ELGs. These [O III] ELGs are identified as starburst galaxies with strong [O III] lines of L([O III]) ~ 10$^{42.6}$ - 10$^{44.2}$ erg s$^{-1}$, and have stellar masses of M* ~ 10$^{9.0}$-10$^{10.6}$ M$_\odot$ and star formation rates of ~ 10-210 M$_\odot$ yr$^{-1}$. Our results show that 24% of our sample galaxies are dusty with Av > 1 mag and EW(OIII)$_{rest}$ ~ 70-500 $Å$, which are often missed in optically selected [O III] ELG samples. Their rest-frame UV and optical morphologies from HST/ACS and HST/WFC3 deep imaging reveal that these [O III] ELGs are mostly multiple-component systems (likely mergers) or compact. And 20% of them are nearly invisible in the rest-frame UV owing to heavy dust attenuation. Interestingly, we find that our samples reside in an overdensity consisting of two components: one southeast (SE) with an overdensity factor of $δ_{gal}$ ~ 41 over a volume of 13$^{3}$ cMpc$^{3}$ and the other northwest (NW) with $δ_{gal}$ ~ 38 over a volume of 10$^{3}$ cMpc$^{3}$. The two overdense substructures are expected to be virialized at z=0 with a total mass of ~ 1.1 x 10$^{15}$ M$_\odot$ and ~ 4.8 x 10$^{14}$ M$_\odot$, and probably merge into a Coma-like galaxy cluster.

preprint2021arXiv

Enhanced Superconductivity in the Se-substituted 1T-PdTe$_2$

Two-dimensional transition metal dichalcogenide PdTe$_2$ recently attracts much attention due to its phase coexistence of type-II Dirac semimetal and type-I superconductivity. Here we report a 67 % enhancement of superconducting transition temperature in the 1T-PdSeTe in comparison to that of PdTe2 through partial substitution of Te atoms by Se. The superconductivity has been unambiguously confirmed by the magnetization, resistivity and specific heat measurements. 1T-PdSeTe shows type-II superconductivity with large anisotropy and non-bulk superconductivity nature with volume fraction ~ 20 % estimated from magnetic and heat capacity measurements. 1T-PdSeTe expands the family of superconducting transition metal dichalcogenides and thus provides additional insights for understanding superconductivity and topological physics in the 1T-PdTe$_2$ system

preprint2020arXiv

AGN feedback in the FR II galaxy 3C 220.1

We present results from a deep (174 ks) Chandra observation of the FR-II radio galaxy 3C 220.1, the central brightest cluster galaxy (BCG) of a $kT \sim$ 4 keV cluster at $z=0.61$. The temperature of the hot cluster medium drops from $\sim5.9$ keV to $\sim3.9$ keV at $\sim$ 35 kpc radius, while the temperature at smaller radii may be substantially lower. The central active galactic nucleus (AGN) outshines the whole cluster in X-rays, with a bolometric luminosity of $2.0\times10^{46}$ erg s$^{-1}$ ($\sim10$% of the Eddington rate). The system shows a pair of potential X-ray cavities $\sim35$ kpc east and west of the nucleus. The cavity power is estimated within the range of $1.0\times10^{44}$ erg s$^{-1}$ and $1.7\times10^{45}$ erg s$^{-1}$, from different methods. The X-ray enhancements in the radio lobes could be due to inverse Compton emission, with a total 2-10 keV luminosity of $\sim8.0\times10^{42}$ erg s$^{-1}$. We compare 3C 220.1 with other cluster BCGs, including Cygnus A, as there are few BCGs in rich clusters hosting an FR-II galaxy. We also summarize the jet power of FR-II galaxies from different methods. The comparison suggests that the cavity power of FR-II galaxies likely under-estimates the jet power. The properties of 3C 220.1 suggest that it is at the transition stage from quasar-mode feedback to radio-mode feedback.