Researcher profile

Sheng Yu

Sheng Yu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2022arXiv

An Accurate Unsupervised Method for Joint Entity Alignment and Dangling Entity Detection

Knowledge graph integration typically suffers from the widely existing dangling entities that cannot find alignment cross knowledge graphs (KGs). The dangling entity set is unavailable in most real-world scenarios, and manually mining the entity pairs that consist of entities with the same meaning is labor-consuming. In this paper, we propose a novel accurate Unsupervised method for joint Entity alignment (EA) and Dangling entity detection (DED), called UED. The UED mines the literal semantic information to generate pseudo entity pairs and globally guided alignment information for EA and then utilizes the EA results to assist the DED. We construct a medical cross-lingual knowledge graph dataset, MedED, providing data for both the EA and DED tasks. Extensive experiments demonstrate that in the EA task, UED achieves EA results comparable to those of state-of-the-art supervised EA baselines and outperforms the current state-of-the-art EA methods by combining supervised EA data. For the DED task, UED obtains high-quality results without supervision.

preprint2022arXiv

Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations

Term clustering is important in biomedical knowledge graph construction. Using similarities between terms embedding is helpful for term clustering. State-of-the-art term embeddings leverage pretrained language models to encode terms, and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning. These embeddings provide close embeddings for terms belonging to the same concept. However, from our probing experiments, these embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering. To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples during contrastive learning to learn fine-grained representations which result in better biomedical term clustering. We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.

preprint2022arXiv

BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model

Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understanding (NLU) tasks as NLG achieves satisfying performance in the general domain through constrained language generation or language prompting. We emphasize the lack of in-domain generative language models and the unsystematic generative downstream benchmarks in the biomedical domain, hindering the development of the research community. In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain. We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition. BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks. Furthermore, we conduct ablation studies on the pretraining tasks for BioBART and find that sentence permutation has negative effects on downstream tasks.

preprint2022arXiv

BIOS: An Algorithmically Generated Biomedical Knowledge Graph

Biomedical knowledge graphs (BioMedKGs) are essential infrastructures for biomedical and healthcare big data and artificial intelligence (AI), facilitating natural language processing, model development, and data exchange. For decades, these knowledge graphs have been developed via expert curation; however, this method can no longer keep up with today's AI development, and a transition to algorithmically generated BioMedKGs is necessary. In this work, we introduce the Biomedical Informatics Ontology System (BIOS), the first large-scale publicly available BioMedKG generated completely by machine learning algorithms. BIOS currently contains 4.1 million concepts, 7.4 million terms in two languages, and 7.3 million relation triplets. We present the methodology for developing BIOS, including the curation of raw biomedical terms, computational identification of synonymous terms and aggregation of these terms to create concept nodes, semantic type classification of the concepts, relation identification, and biomedical machine translation. We provide statistics on the current BIOS content and perform preliminary assessments of term quality, synonym grouping, and relation extraction. The results suggest that machine learning-based BioMedKG development is a viable alternative to traditional expert curation.

preprint2022arXiv

Efficient Symptom Inquiring and Diagnosis via Adaptive Alignment of Reinforcement Learning and Classification

Medical automatic diagnosis aims to imitate human doctors in real-world diagnostic processes and to achieve accurate diagnoses by interacting with the patients. The task is formulated as a sequential decision-making problem with a series of symptom inquiring steps and the final diagnosis. Recent research has studied incorporating reinforcement learning for symptom inquiring and classification techniques for disease diagnosis, respectively. However, studies on efficiently and effectively combining the two procedures are still lacking. To address this issue, we devise an adaptive mechanism to align reinforcement learning and classification methods using distribution entropy as the medium. Additionally, we created a new dataset for patient simulation to address the lacking of large-scale evaluation benchmarks. The dataset is extracted from the MedlinePlus knowledge base and contains significantly more diseases and more comprehensive symptoms and examination information than existing datasets. Experimental evaluation shows that our method outperforms three current state-of-the-art methods on different datasets by achieving higher medical diagnosis accuracy with fewer inquiring turns.

preprint2022arXiv

Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-tuning

Entities lie in the heart of biomedical natural language understanding, and the biomedical entity linking (EL) task remains challenging due to the fine-grained and diversiform concept names. Generative methods achieve remarkable performances in general domain EL with less memory usage while requiring expensive pre-training. Previous biomedical EL methods leverage synonyms from knowledge bases (KB) which is not trivial to inject into a generative method. In this work, we use a generative approach to model biomedical EL and propose to inject synonyms knowledge in it. We propose KB-guided pre-training by constructing synthetic samples with synonyms and definitions from KB and require the model to recover concept names. We also propose synonyms-aware fine-tuning to select concept names for training, and propose decoder prompt and multi-synonyms constrained prefix tree for inference. Our method achieves state-of-the-art results on several biomedical EL tasks without candidate selection which displays the effectiveness of proposed pre-training and fine-tuning strategies.

preprint2022arXiv

Multimodal Learning on Graphs for Disease Relation Extraction

Objective: Disease knowledge graphs are a way to connect, organize, and access disparate information about diseases with numerous benefits for artificial intelligence (AI). To create knowledge graphs, it is necessary to extract knowledge from multimodal datasets in the form of relationships between disease concepts and normalize both concepts and relationship types. Methods: We introduce REMAP, a multimodal approach for disease relation extraction and classification. The REMAP machine learning approach jointly embeds a partial, incomplete knowledge graph and a medical language dataset into a compact latent vector space, followed by aligning the multimodal embeddings for optimal disease relation extraction. Results: We apply REMAP approach to a disease knowledge graph with 96,913 relations and a text dataset of 1.24 million sentences. On a dataset annotated by human experts, REMAP improves text-based disease relation extraction by 10.0% (accuracy) and 17.2% (F1-score) by fusing disease knowledge graphs with text information. Further, REMAP leverages text information to recommend new relationships in the knowledge graph, outperforming graph-based methods by 8.4% (accuracy) and 10.4% (F1-score). Conclusion: REMAP is a multimodal approach for extracting and classifying disease relationships by fusing structured knowledge and text information. REMAP provides a flexible neural architecture to easily find, access, and validate AI-driven relationships between disease concepts.

preprint2022arXiv

Semi-constraint Optimal Transport for Entity Alignment with Dangling Cases

Entity alignment (EA) merges knowledge graphs (KGs) by identifying the equivalent entities in different graphs, which can effectively enrich knowledge representations of KGs. However, in practice, different KGs often include dangling entities whose counterparts cannot be found in the other graph, which limits the performance of EA methods. To improve EA with dangling entities, we propose an unsupervised method called Semi-constraint Optimal Transport for Entity Alignment in Dangling cases (SoTead). Our main idea is to model the entity alignment between two KGs as an optimal transport problem from one KG's entities to the others. First, we set pseudo entity pairs between KGs based on pretrained word embeddings. Then, we conduct contrastive metric learning to obtain the transport cost between each entity pair. Finally, we introduce a virtual entity for each KG to "align" the dangling entities from the other KGs, which relaxes the optimization constraints and leads to a semi-constraint optimal transport. In the experimental part, we first show the superiority of SoTead on a commonly-used entity alignment dataset. Besides, to analyze the ability for dangling entity detection with other baselines, we construct a medical cross-lingual knowledge graph dataset, MedED, where our SoTead also reaches state-of-the-art performance.

preprint2022arXiv

Sentence Alignment with Parallel Documents Facilitates Biomedical Machine Translation

Objective: Today's neural machine translation (NMT) can achieve near human-level translation quality and greatly facilitates international communications, but the lack of parallel corpora poses a key problem to the development of translation systems for highly specialized domains, such as biomedicine. This work presents an unsupervised algorithm for deriving parallel corpora from document-level translations by using sentence alignment and explores how training materials affect the performance of biomedical NMT systems. Materials and Methods: Document-level translations are mixed to train bilingual word embeddings (BWEs) for the evaluation of cross-lingual word similarity, and sentence distance is defined by combining semantic and positional similarities of the sentences. The alignment of sentences is formulated as an extended earth mover's distance problem. A Chinese-English biomedical parallel corpus is derived with the proposed algorithm using bilingual articles from UpToDate and translations of PubMed abstracts, which is then used for the training and evaluation of NMT. Results: On two manually aligned translation datasets, the proposed algorithm achieved accurate sentence alignment in the 1-to-1 cases and outperformed competing algorithms in the many-to-many cases. The NMT model fine-tuned on biomedical data significantly improved the in-domain translation quality (zh-en: +17.72 BLEU; en-zh: +17.02 BLEU). Both the size of the training data and the combination of different corpora can significantly affect the model's performance. Conclusion: The proposed algorithm relaxes the assumption for sentence alignment and effectively generates accurate translation pairs that facilitate training high quality biomedical NMT models.

preprint2022arXiv

SpinQ Triangulum: a commercial three-qubit desktop quantum computer

SpinQ Triangulum is the second generation of the desktop quantum computers designed and manufactured by SpinQ Technology. SpinQ's desktop quantum computer series, based on room temperature NMR spectrometer, provide light-weighted, cost-effective and maintenance-free quantum computing platforms that aim to provide real-device experience for quantum computing education for K-12 and college level. These platforms also feature quantum control design capabilities for studying quantum control and quantum noise. Compared with the first generation product, the two-qubit SpinQ Gemini, Triangulum features a three-qubit QPU, smaller dimensions (61 * 33 * 56 cm^3) and lighter (40 kg). Furthermore, the magnetic field is more stable and the performance of quantum control is more accurate. This paper introduces the system design of Triangulum and its new features. As an example of performing quantum computing tasks, we present the implementation of the Harrow-Hassidim-Lloyd (HHL) algorithm on Triangulum, demonstrating Triangulum's capability of undertaking complex quantum computing tasks. SpinQ will continue to develop desktop quantum computing platform with more qubits. Meanwhile, a simplified version of SpinQ Gemini, namely Gemini Mini (https://www.spinq.cn/products#geminiMini-anchor) , has been recently realised. Gemini Mini is much more portable (20* 35 * 26 cm^3, 14 kg) and affordable for most K-12 schools around the world.

preprint2021arXiv

SpinQ Gemini: a desktop quantum computer for education and research

SpinQ Gemini is a commercial desktop quantum computer designed and manufactured by SpinQ Technology. It is an integrated hardware-software system. The first generation product with two qubits was launched in January 2020. The hardware is based on NMR spectrometer, with permanent magnets providing $\sim 1$ T magnetic field. SpinQ Gemini operates under room temperature ($0$-$30^{\circ}$C), highlighting its lightweight (55 kg with a volume of $70\times 40 \times 80$ cm$^3$), cost-effective (under $50$k USD), and maintenance-free. SpinQ Gemini aims to provide real-device experience for quantum computing education for K-12 and at the college level. It also features quantum control design capabilities that benefit the researchers studying quantum control and quantum noise. Since its first launch, SpinQ Gemini has been shipped to institutions in Canada, Taiwan and Mainland China. This paper introduces the system of design of SpinQ Gemini, from hardware to software. We also demonstrate examples for performing quantum computing tasks on SpinQ Gemini, including one task for a variational quantum eigensolver of a two-qubit Heisenberg model. The next generations of SpinQ quantum computing devices will adopt models of more qubits, advanced control functions for researchers with comparable cost, as well as simplified models for much lower cost (under $5$k USD) for K-12 education. We believe that low-cost portable quantum computer products will facilitate hands-on experience for teaching quantum computing at all levels, well-prepare younger generations of students and researchers for the future of quantum technologies.

preprint2020arXiv

High-throughput relation extraction algorithm development associating knowledge articles and electronic health records

Objective: Medical relations are the core components of medical knowledge graphs that are needed for healthcare artificial intelligence. However, the requirement of expert annotation by conventional algorithm development processes creates a major bottleneck for mining new relations. In this paper, we present Hi-RES, a framework for high-throughput relation extraction algorithm development. We also show that combining knowledge articles with electronic health records (EHRs) significantly increases the classification accuracy. Methods: We use relation triplets obtained from structured databases and semistructured webpages to label sentences from target corpora as positive training samples. Two methods are also provided for creating improved negative samples by combining positive samples with naïve negative samples. We propose a common model that summarizes sentence information using large-scale pretrained language models and multi-instance attention, which then joins with the concept embeddings trained from the EHRs for relation prediction. Results: We apply the Hi-RES framework to develop classification algorithms for disorder-disorder relations and disorder-location relations. Millions of sentences are created as training data. Using pretrained language models and EHR-based embeddings individually provides considerable accuracy increases over those of previous models. Joining them together further tremendously increases the accuracy to 0.947 and 0.998 for the two sets of relations, respectively, which are 10-17 percentage points higher than those of previous models. Conclusion: Hi-RES is an efficient framework for achieving high-throughput and accurate relation extraction algorithm development.