Researcher profile

Yefeng Zheng

Yefeng Zheng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
57works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

57 published item(s)

preprint2026arXiv

Attention Needs to Focus: A Unified Perspective on Attention Allocation

The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.

preprint2026arXiv

Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench

While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities, their capability in functional imaging remains largely unexplored. In this work, we identify and quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors. Identifying Positron Emission Tomography (PET) as the quintessential modality to investigate this disconnect, we introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies. Extensive evaluation of 19 state-of-the-art MLLMs reveals a critical safety hazard termed the Chain-of-Thought (CoT) hallucination trap. We observe that standard CoT prompting, widely considered to enhance reasoning, paradoxically decouples linguistic generation from visual evidence in PET, producing clinically fluent but factually ungrounded diagnoses. To resolve this, we propose Atomic Visual Alignment (AVA), a simple fine-tuning strategy that enforces the mastery of low-level functional perception prior to high-level diagnostic reasoning. Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic accuracy by up to 14.83%. Code and data are available at https://github.com/yezanting/PET-Bench.

preprint2026arXiv

VISTA: Variance-Gated Inter-Sequence Test-Time Adaptation for Multi-Sequence MRI Segmentation

Deploying multi-sequence magnetic resonance imaging (MRI) segmentation models to new clinical environments is challenging due to variations in scanners and acquisition protocols. Although existing TTA methods handle basic per-modality shifts, they often fail under a fundamental dual-shift problem, as their adaptation signals fail to capture modality-interaction shifts that disrupt inter-sequence consistency. To address this, we propose Variance-gated Inter-Sequence Test-time Adaptation (VISTA), a source-free framework that tackles modality-interaction shifts. First, we design an Inter-Sequence Intervention Generator (ISIG) that generates a set of consistency probes by swapping low-frequency spectra and entropy-localized patches across sequences, preserving anatomical semantics while challenging inter-sequence dependencies. Second, we introduce Cross-View Disagreement-Aware Pseudo Labeling (CDPL), which establishes a voxel-wise reliability metric using cross-view disagreement variance to dynamically gate self-training and enforce interventional consistency, encouraging the network to rely on robust anatomical semantics. Extensive experiments adapting from standard adult MRI (BraTS-GLI-Pre) to African low-field (BraTS-SSA) and pediatric (BraTS-PED) cohorts show improved performance over competing methods under clinical shifts, achieving absolute Dice improvements of +1.89% (SSA) and +2.82% (PED) over the source model. The code is available at https://github.com/dzp2095/VISTA.

preprint2026arXiv

VoxShield: Protecting 3D Medical Datasets from Unauthorized Training via Frequency-Aware Inter-Slice Disruption

The release of public 3D medical image segmentation (MIS) datasets accelerates clinical research but simultaneously heightens risks of unauthorized AI model training. While Unlearnable Examples (UE) offer protection by injecting imperceptible perturbations to prevent effective model learning, existing methods primarily target 2D scenarios. They neglect the volumetric spatial correlations and inter-slice anatomical consistency inherent in 3D medical volumes, which serve as critical learning priors for 3D segmentation networks. To bridge this gap, we propose VoxShield, a UE framework that explicitly targets the volumetric inductive biases of 3D networks. Our core insight is that by systematically dismantling the cross-slice continuity that 3D architectures rely on, we can fundamentally impair their spatial aggregation process. Specifically, we introduce an Inter-Slice Frequency Consistency Disruption mechanism that maximizes the spectral divergence between adjacent slices, injecting structural incoherence along the $z$-axis. Complementing this structural attack, a Semantic Prediction Disruption module is incorporated. By maximizing the $\ell_1$ divergence between clean and perturbed logits, it forces the injected noise to penetrate the entire network and corrupt the final semantic mapping. Experiments on BraTS19 and FLARE21 demonstrate that VoxShield successfully degrades 3D segmentation performance, reducing the DSC from 80.0% to near 0.0% and from 88.6% to 6.8%, respectively. All protections are achieved with minimal perturbation ($ε=4/255$) to preserve high visual fidelity. The code is available at https://github.com/KK266299/VoxShield.

preprint2024arXiv

Dynamically Masked Discriminator for Generative Adversarial Networks

Training Generative Adversarial Networks (GANs) remains a challenging problem. The discriminator trains the generator by learning the distribution of real/generated data. However, the distribution of generated data changes throughout the training process, which is difficult for the discriminator to learn. In this paper, we propose a novel method for GANs from the viewpoint of online continual learning. We observe that the discriminator model, trained on historically generated data, often slows down its adaptation to the changes in the new arrival generated data, which accordingly decreases the quality of generated results. By treating the generated data in training as a stream, we propose to detect whether the discriminator slows down the learning of new knowledge in generated data. Therefore, we can explicitly enforce the discriminator to learn new knowledge fast. Particularly, we propose a new discriminator, which automatically detects its retardation and then dynamically masks its features, such that the discriminator can adaptively learn the temporally-vary distribution of generated data. Experimental results show our method outperforms the state-of-the-art approaches.

preprint2023arXiv

A New Perspective to Boost Vision Transformer for Medical Image Classification

Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.

preprint2022arXiv

All-Around Real Label Supervision: Cyclic Prototype Consistency Learning for Semi-supervised Medical Image Segmentation

Semi-supervised learning has substantially advanced medical image segmentation since it alleviates the heavy burden of acquiring the costly expert-examined annotations. Especially, the consistency-based approaches have attracted more attention for their superior performance, wherein the real labels are only utilized to supervise their paired images via supervised loss while the unlabeled images are exploited by enforcing the perturbation-based \textit{"unsupervised"} consistency without explicit guidance from those real labels. However, intuitively, the expert-examined real labels contain more reliable supervision signals. Observing this, we ask an unexplored but interesting question: can we exploit the unlabeled data via explicit real label supervision for semi-supervised training? To this end, we discard the previous perturbation-based consistency but absorb the essence of non-parametric prototype learning. Based on the prototypical network, we then propose a novel cyclic prototype consistency learning (CPCL) framework, which is constructed by a labeled-to-unlabeled (L2U) prototypical forward process and an unlabeled-to-labeled (U2L) backward process. Such two processes synergistically enhance the segmentation network by encouraging more discriminative and compact features. In this way, our framework turns previous \textit{"unsupervised"} consistency into new \textit{"supervised"} consistency, obtaining the \textit{"all-around real label supervision"} property of our method. Extensive experiments on brain tumor segmentation from MRI and kidney segmentation from CT images show that our CPCL can effectively exploit the unlabeled data and outperform other state-of-the-art semi-supervised medical image segmentation methods.

preprint2022arXiv

Conquering Data Variations in Resolution: A Slice-Aware Multi-Branch Decoder Network

Fully convolutional neural networks have made promising progress in joint liver and liver tumor segmentation. Instead of following the debates over 2D versus 3D networks (for example, pursuing the balance between large-scale 2D pretraining and 3D context), in this paper, we novelly identify the wide variation in the ratio between intra- and inter-slice resolutions as a crucial obstacle to the performance. To tackle the mismatch between the intra- and inter-slice information, we propose a slice-aware 2.5D network that emphasizes extracting discriminative features utilizing not only in-plane semantics but also out-of-plane coherence for each separate slice. Specifically, we present a slice-wise multi-input multi-output architecture to instantiate such a design paradigm, which contains a Multi-Branch Decoder (MD) with a Slice-centric Attention Block (SAB) for learning slice-specific features and a Densely Connected Dice (DCD) loss to regularize the inter-slice predictions to be coherent and continuous. Based on the aforementioned innovations, we achieve state-of-the-art results on the MICCAI 2017 Liver Tumor Segmentation (LiTS) dataset. Besides, we also test our model on the ISBI 2019 Segmentation of THoracic Organs at Risk (SegTHOR) dataset, and the result proves the robustness and generalizability of the proposed method in other segmentation tasks.

preprint2022arXiv

Deep Convolutional Neural Networks for Molecular Subtyping of Gliomas Using Magnetic Resonance Imaging

Knowledge of molecular subtypes of gliomas can provide valuable information for tailored therapies. This study aimed to investigate the use of deep convolutional neural networks (DCNNs) for noninvasive glioma subtyping with radiological imaging data according to the new taxonomy announced by the World Health Organization in 2016. Methods: A DCNN model was developed for the prediction of the five glioma subtypes based on a hierarchical classification paradigm. This model used three parallel, weight-sharing, deep residual learning networks to process 2.5-dimensional input of trimodal MRI data, including T1-weighted, T1-weighted with contrast enhancement, and T2-weighted images. A data set comprising 1,016 real patients was collected for evaluation of the developed DCNN model. The predictive performance was evaluated via the area under the curve (AUC) from the receiver operating characteristic analysis. For comparison, the performance of a radiomics-based approach was also evaluated. Results: The AUCs of the DCNN model for the four classification tasks in the hierarchical classification paradigm were 0.89, 0.89, 0.85, and 0.66, respectively, as compared to 0.85, 0.75, 0.67, and 0.59 of the radiomics approach. Conclusion: The results showed that the developed DCNN model can predict glioma subtypes with promising performance, given sufficient, non-ill-balanced training data.

preprint2022arXiv

Deformer: Towards Displacement Field Learning for Unsupervised Medical Image Registration

Recently, deep-learning-based approaches have been widely studied for deformable image registration task. However, most efforts directly map the composite image representation to spatial transformation through the convolutional neural network, ignoring its limited ability to capture spatial correspondence. On the other hand, Transformer can better characterize the spatial relationship with attention mechanism, its long-range dependency may be harmful to the registration task, where voxels with too large distances are unlikely to be corresponding pairs. In this study, we propose a novel Deformer module along with a multi-scale framework for the deformable image registration task. The Deformer module is designed to facilitate the mapping from image representation to spatial transformation by formulating the displacement vector prediction as the weighted summation of several bases. With the multi-scale framework to predict the displacement fields in a coarse-to-fine manner, superior performance can be achieved compared with traditional and learning-based approaches. Comprehensive experiments on two public datasets are conducted to demonstrate the effectiveness of the proposed Deformer module as well as the multi-scale framework.

preprint2022arXiv

Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation

Research into Few-shot Semantic Segmentation (FSS) has attracted great attention, with the goal to segment target objects in a query image given only a few annotated support images of the target class. A key to this challenging task is to fully utilize the information in the support images by exploiting fine-grained correlations between the query and support images. However, most existing approaches either compressed the support information into a few class-wise prototypes, or used partial support information (e.g., only foreground) at the pixel level, causing non-negligible information loss. In this paper, we propose Dense pixel-wise Cross-query-and-support Attention weighted Mask Aggregation (DCAMA), where both foreground and background support information are fully exploited via multi-level pixel-wise correlations between paired query and support features. Implemented with the scaled dot-product attention in the Transformer architecture, DCAMA treats every query pixel as a token, computes its similarities with all support pixels, and predicts its segmentation label as an additive aggregation of all the support pixels' labels -- weighted by the similarities. Based on the unique formulation of DCAMA, we further propose efficient and effective one-pass inference for n-shot segmentation, where pixels of all support images are collected for the mask aggregation at once. Experiments show that our DCAMA significantly advances the state of the art on standard FSS benchmarks of PASCAL-5i, COCO-20i, and FSS-1000, e.g., with 3.1%, 9.7%, and 3.6% absolute improvements in 1-shot mIoU over previous best records. Ablative studies also verify the design DCAMA.

preprint2022arXiv

DFTR: Depth-supervised Fusion Transformer for Salient Object Detection

Automated salient object detection (SOD) plays an increasingly crucial role in many computer vision applications. By reformulating the depth information as supervision rather than as input, depth-supervised convolutional neural networks (CNN) have achieved promising results on both RGB and RGB-D SOD scenarios with the merits of no requirements for extra depth networks and depth inputs in the inference stage. This paper, for the first time, seeks to expand the applicability of depth supervision to the Transformer architecture. Specifically, we develop a Depth-supervised Fusion TRansformer (DFTR), to further improve the accuracy of both RGB and RGB-D SOD. The proposed DFTR involves three primary features: 1) DFTR, to the best of our knowledge, is the first pure Transformer-based model for depth-supervised SOD; 2) A multi-scale feature aggregation (MFA) module is proposed to fully exploit the multi-scale features encoded by the Swin Transformer in a coarse-to-fine manner; 3) To enable bidirectional information flow across different streams of features, a novel multi-stage feature fusion (MFF) module is further integrated into our DFTR with the emphasis on salient regions at different network learning stages. We extensively evaluate the proposed DFTR on ten benchmarking datasets. Experimental results show that our DFTR consistently outperforms the existing state-of-the-art methods for both RGB and RGB-D SOD tasks. The code and model will be made publicly available.

preprint2022arXiv

Domain Adaptation Meets Zero-Shot Learning: An Annotation-Efficient Approach to Multi-Modality Medical Image Segmentation

Due to the lack of properly annotated medical data, exploring the generalization capability of the deep model is becoming a public concern. Zero-shot learning (ZSL) has emerged in recent years to equip the deep model with the ability to recognize unseen classes. However, existing studies mainly focus on natural images, which utilize linguistic models to extract auxiliary information for ZSL. It is impractical to apply the natural image ZSL solutions directly to medical images, since the medical terminology is very domain-specific, and it is not easy to acquire linguistic models for the medical terminology. In this work, we propose a new paradigm of ZSL specifically for medical images utilizing cross-modality information. We make three main contributions with the proposed paradigm. First, we extract the prior knowledge about the segmentation targets, called relation prototypes, from the prior model and then propose a cross-modality adaptation module to inherit the prototypes to the zero-shot model. Second, we propose a relation prototype awareness module to make the zero-shot model aware of information contained in the prototypes. Last but not least, we develop an inheritance attention module to recalibrate the relation prototypes to enhance the inheritance process. The proposed framework is evaluated on two public cross-modality datasets including a cardiac dataset and an abdominal dataset. Extensive experiments show that the proposed framework significantly outperforms the state of the arts.

preprint2022arXiv

Double-Uncertainty Guided Spatial and Temporal Consistency Regularization Weighting for Learning-based Abdominal Registration

In order to tackle the difficulty associated with the ill-posed nature of the image registration problem, regularization is often used to constrain the solution space. For most learning-based registration approaches, the regularization usually has a fixed weight and only constrains the spatial transformation. Such convention has two limitations: (i) Besides the laborious grid search for the optimal fixed weight, the regularization strength of a specific image pair should be associated with the content of the images, thus the "one value fits all" training scheme is not ideal; (ii) Only spatially regularizing the transformation may neglect some informative clues related to the ill-posedness. In this study, we propose a mean-teacher based registration framework, which incorporates an additional temporal consistency regularization term by encouraging the teacher model's prediction to be consistent with that of the student model. More importantly, instead of searching for a fixed weight, the teacher enables automatically adjusting the weights of the spatial regularization and the temporal consistency regularization by taking advantage of the transformation uncertainty and appearance uncertainty. Extensive experiments on the challenging abdominal CT-MRI registration show that our training strategy can promisingly advance the original learning-based method in terms of efficient hyperparameter tuning and a better tradeoff between accuracy and smoothness.

preprint2022arXiv

Face Completion with Semantic Knowledge and Collaborative Adversarial Learning

Unlike a conventional background inpainting approach that infers a missing area from image patches similar to the background, face completion requires semantic knowledge about the target object for realistic outputs. Current image inpainting approaches utilize generative adversarial networks (GANs) to achieve such semantic understanding. However, in adversarial learning, the semantic knowledge is learned implicitly and hence good semantic understanding is not always guaranteed. In this work, we propose a collaborative adversarial learning approach to face completion to explicitly induce the training process. Our method is formulated under a novel generative framework called collaborative GAN (collaGAN), which allows better semantic understanding of a target object through collaborative learning of multiple tasks including face completion, landmark detection, and semantic segmentation. Together with the collaGAN, we also introduce an inpainting concentrated scheme such that the model emphasizes more on inpainting instead of autoencoding. Extensive experiments show that the proposed designs are indeed effective and collaborative adversarial learning provides better feature representations of the faces. In comparison with other generative image inpainting models and single task learning methods, our solution produces superior performances on all tasks.

preprint2022arXiv

FedMed-ATL: Misaligned Unpaired Brain Image Synthesis via Affine Transform Loss

The existence of completely aligned and paired multi-modal neuroimaging data has proved its effectiveness in the diagnosis of brain diseases. However, collecting the full set of well-aligned and paired data is impractical, since the practical difficulties may include high cost, long time acquisition, image corruption, and privacy issues. Previously, the misaligned unpaired neuroimaging data (termed as MUD) are generally treated as noisy label. However, such a noisy label-based method fail to accomplish well when misaligned data occurs distortions severely. For example, the angle of rotation is different. In this paper, we propose a novel federated self-supervised learning (FedMed) for brain image synthesis. An affine transform loss (ATL) was formulated to make use of severely distorted images without violating privacy legislation for the hospital. We then introduce a new data augmentation procedure for self-supervised training and fed it into three auxiliary heads, namely auxiliary rotation, auxiliary translation and auxiliary scaling heads. The proposed method demonstrates the advanced performance in both the quality of our synthesized results under a severely misaligned and unpaired data setting, and better stability than other GAN-based algorithms. The proposed method also reduces the demand for deformable registration while encouraging to leverage the misaligned and unpaired data. Experimental results verify the outstanding performance of our learning paradigm compared to other state-of-the-art approaches.

preprint2022arXiv

Finding Influential Instances for Distantly Supervised Relation Extraction

Distant supervision (DS) is a strong way to expand the datasets for enhancing relation extraction (RE) models but often suffers from high label noise. Current works based on attention, reinforcement learning, or GAN are black-box models so they neither provide meaningful interpretation of sample selection in DS nor stability on different domains. On the contrary, this work proposes a novel model-agnostic instance sampling method for DS by influence function (IF), namely REIF. Our method identifies favorable/unfavorable instances in the bag based on IF, then does dynamic instance sampling. We design a fast influence sampling algorithm that reduces the computational complexity from $\mathcal{O}(mn)$ to $\mathcal{O}(1)$, with analyzing its robustness on the selected sampling function. Experiments show that by simply sampling the favorable instances during training, REIF is able to win over a series of baselines that have complicated architectures. We also demonstrate that REIF can support interpretable instance selection.

preprint2022arXiv

Learning Shape Priors by Pairwise Comparison for Robust Semantic Segmentation

Semantic segmentation is important in medical image analysis. Inspired by the strong ability of traditional image analysis techniques in capturing shape priors and inter-subject similarity, many deep learning (DL) models have been recently proposed to exploit such prior information and achieved robust performance. However, these two types of important prior information are usually studied separately in existing models. In this paper, we propose a novel DL model to model both type of priors within a single framework. Specifically, we introduce an extra encoder into the classic encoder-decoder structure to form a Siamese structure for the encoders, where one of them takes a target image as input (the image-encoder), and the other concatenates a template image and its foreground regions as input (the template-encoder). The template-encoder encodes the shape priors and appearance characteristics of each foreground class in the template image. A cosine similarity based attention module is proposed to fuse the information from both encoders, to utilize both types of prior information encoded by the template-encoder and model the inter-subject similarity for each foreground class. Extensive experiments on two public datasets demonstrate that our proposed method can produce superior performance to competing methods.

preprint2022arXiv

MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware Medical Dialogue Generation

Developing conversational agents to interact with patients and provide primary clinical advice has attracted increasing attention due to its huge application potential, especially in the time of COVID-19 Pandemic. However, the training of end-to-end neural-based medical dialogue system is restricted by an insufficient quantity of medical dialogue corpus. In this work, we make the first attempt to build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG, with more than 17K conversations collected from the online health consultation community. Five different categories of entities, including diseases, symptoms, attributes, tests, and medicines, are annotated in each conversation of MedDG as additional labels. To push forward the future research on building expert-sensitive medical dialogue system, we proposes two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation. To acquire a clear comprehension on these two medical dialogue tasks, we implement several state-of-the-art benchmarks, as well as design two dialogue models with a further consideration on the predicted entities. Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset, and the response quality can be enhanced with the help of auxiliary entity information. From human evaluation, the simple retrieval model outperforms several state-of-the-art generative models, indicating that there still remains a large room for improvement on generating medically meaningful responses.

preprint2022arXiv

mmFormer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation

Accurate brain tumor segmentation from Magnetic Resonance Imaging (MRI) is desirable to joint learning of multimodal images. However, in clinical practice, it is not always possible to acquire a complete set of MRIs, and the problem of missing modalities causes severe performance degradation in existing multimodal segmentation methods. In this work, we present the first attempt to exploit the Transformer for multimodal brain tumor segmentation that is robust to any combinatorial subset of available modalities. Concretely, we propose a novel multimodal Medical Transformer (mmFormer) for incomplete multimodal learning with three main components: the hybrid modality-specific encoders that bridge a convolutional encoder and an intra-modal Transformer for both local and global context modeling within each modality; an inter-modal Transformer to build and align the long-range correlations across modalities for modality-invariant features with global semantics corresponding to tumor region; a decoder that performs a progressive up-sampling and fusion with the modality-invariant features to generate robust segmentation. Besides, auxiliary regularizers are introduced in both encoder and decoder to further enhance the model's robustness to incomplete modalities. We conduct extensive experiments on the public BraTS $2018$ dataset for brain tumor segmentation. The results demonstrate that the proposed mmFormer outperforms the state-of-the-art methods for incomplete multimodal brain tumor segmentation on almost all subsets of incomplete modalities, especially by an average 19.07% improvement of Dice on tumor segmentation with only one available modality. The code is available at https://github.com/YaoZhang93/mmFormer.

preprint2022arXiv

Multi-modal Contrastive Representation Learning for Entity Alignment

Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs, which consist of structural triples and images associated with entities. Most previous works focus on how to utilize and encode information from different modalities, while it is not trivial to leverage multi-modal knowledge in entity alignment because of the modality heterogeneity. In this paper, we propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model, to obtain effective joint representations for multi-modal entity alignment. Different from previous works, MCLEA considers task-oriented modality and models the inter-modal relationships for each entity representation. In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions. Extensive experimental results show that MCLEA outperforms state-of-the-art baselines on public datasets under both supervised and unsupervised settings.

preprint2022arXiv

PAC-Bayes Information Bottleneck

Understanding the source of the superior generalization ability of NNs remains one of the most important problems in ML research. There have been a series of theoretical works trying to derive non-vacuous bounds for NNs. Recently, the compression of information stored in weights (IIW) is proved to play a key role in NNs generalization based on the PAC-Bayes theorem. However, no solution of IIW has ever been provided, which builds a barrier for further investigation of the IIW's property and its potential in practical deep learning. In this paper, we propose an algorithm for the efficient approximation of IIW. Then, we build an IIW-based information bottleneck on the trade-off between accuracy and information complexity of NNs, namely PIB. From PIB, we can empirically identify the fitting to compressing phase transition during NNs' training and the concrete connection between the IIW compression and the generalization. Besides, we verify that IIW is able to explain NNs in broad cases, e.g., varying batch sizes, over-parameterization, and noisy labels. Moreover, we propose an MCMC-based algorithm to sample from the optimal weight posterior characterized by PIB, which fulfills the potential of IIW in enhancing NNs in practice.

preprint2022arXiv

Poisoning Semi-supervised Federated Learning via Unlabeled Data: Attacks and Defenses

Semi-supervised Federated Learning (SSFL) has recently drawn much attention due to its practical consideration, i.e., the clients may only have unlabeled data. In practice, these SSFL systems implement semi-supervised training by assigning a "guessed" label to the unlabeled data near the labeled data to convert the unsupervised problem into a fully supervised problem. However, the inherent properties of such semi-supervised training techniques create a new attack surface. In this paper, we discover and reveal a simple yet powerful poisoning attack against SSFL. Our attack utilizes the natural characteristic of semi-supervised learning to cause the model to be poisoned by poisoning unlabeled data. Specifically, the adversary just needs to insert a small number of maliciously crafted unlabeled samples (e.g., only 0.1\% of the dataset) to infect model performance and misclassification. Extensive case studies have shown that our attacks are effective on different datasets and common semi-supervised learning methods. To mitigate the attacks, we propose a defense, i.e., a minimax optimization-based client selection strategy, to enable the server to select the clients who hold the correct label information and high-quality updates. Our defense further employs a quality-based aggregation rule to strengthen the contributions of the selected updates. Evaluations under different attack conditions show that the proposed defense can well alleviate such unlabeled poisoning attacks. Our study unveils the vulnerability of SSFL to unlabeled poisoning attacks and provides the community with potential defense methods.

preprint2022arXiv

Robust Representation via Dynamic Feature Aggregation

Deep convolutional neural network (CNN) based models are vulnerable to the adversarial attacks. One of the possible reasons is that the embedding space of CNN based model is sparse, resulting in a large space for the generation of adversarial samples. In this study, we propose a method, denoted as Dynamic Feature Aggregation, to compress the embedding space with a novel regularization. Particularly, the convex combination between two samples are regarded as the pivot for aggregation. In the embedding space, the selected samples are guided to be similar to the representation of the pivot. On the other side, to mitigate the trivial solution of such regularization, the last fully-connected layer of the model is replaced by an orthogonal classifier, in which the embedding codes for different classes are processed orthogonally and separately. With the regularization and orthogonal classifier, a more compact embedding space can be obtained, which accordingly improves the model robustness against adversarial attacks. An averaging accuracy of 56.91% is achieved by our method on CIFAR-10 against various attack methods, which significantly surpasses a solid baseline (Mixup) by a margin of 37.31%. More surprisingly, empirical results show that, the proposed method can also achieve the state-of-the-art performance for out-of-distribution (OOD) detection, due to the learned compact feature space. An F1 score of 0.937 is achieved by the proposed method, when adopting CIFAR-10 as in-distribution (ID) dataset and LSUN as OOD dataset. Code is available at https://github.com/HaozheLiu-ST/DynamicFeatureAggregation.

preprint2022arXiv

Seg4Reg+: Consistency Learning between Spine Segmentation and Cobb Angle Regression

Automated methods for Cobb angle estimation are of high demand for scoliosis assessment. Existing methods typically calculate the Cobb angle from landmark estimation, or simply combine the low-level task (e.g., landmark detection and spine segmentation) with the Cobb angle regression task, without fully exploring the benefits from each other. In this study, we propose a novel multi-task framework, named Seg4Reg+, which jointly optimizes the segmentation and regression networks. We thoroughly investigate both local and global consistency and knowledge transfer between each other. Specifically, we propose an attention regularization module leveraging class activation maps (CAMs) from image-segmentation pairs to discover additional supervision in the regression network, and the CAMs can serve as a region-of-interest enhancement gate to facilitate the segmentation task in turn. Meanwhile, we design a novel triangle consistency learning to train the two networks jointly for global optimization. The evaluations performed on the public AASCE Challenge dataset demonstrate the effectiveness of each module and superior performance of our model to the state-of-the-art methods.

preprint2022arXiv

Simultaneous Alignment and Surface Regression Using Hybrid 2D-3D Networks for 3D Coherent Layer Segmentation of Retina OCT Images

Automated surface segmentation of retinal layer is important and challenging in analyzing optical coherence tomography (OCT). Recently, many deep learning based methods have been developed for this task and yield remarkable performance. However, due to large spatial gap and potential mismatch between the B-scans of OCT data, all of them are based on 2D segmentation of individual B-scans, which may loss the continuity information across the B-scans. In addition, 3D surface of the retina layers can provide more diagnostic information, which is crucial in quantitative image analysis. In this study, a novel framework based on hybrid 2D-3D convolutional neural networks (CNNs) is proposed to obtain continuous 3D retinal layer surfaces from OCT. The 2D features of individual B-scans are extracted by an encoder consisting of 2D convolutions. These 2D features are then used to produce the alignment displacement field and layer segmentation by two 3D decoders, which are coupled via a spatial transformer module. The entire framework is trained end-to-end. To the best of our knowledge, this is the first study that attempts 3D retinal layer segmentation in volumetric OCT images based on CNNs. Experiments on a publicly available dataset show that our framework achieves superior results to state-of-the-art 2D methods in terms of both layer segmentation accuracy and cross-B-scan 3D continuity, thus offering more clinical values than previous works.

preprint2022arXiv

Tell Me How to Survey: Literature Review Made Simple with Automatic Reading Path Generation

Recent years have witnessed the dramatic growth of paper volumes with plenty of new research papers published every day, especially in the area of computer science. How to glean papers worth reading from the massive literature to do a quick survey or keep up with the latest advancement about a specific research topic has become a challenging task. Existing academic search engines such as Google Scholar return relevant papers by individually calculating the relevance between each paper and query. However, such systems usually omit the prerequisite chains of a research topic and cannot form a meaningful reading path. In this paper, we introduce a new task named Reading Path Generation (RPG) which aims at automatically producing a path of papers to read for a given query. To serve as a research benchmark, we further propose SurveyBank, a dataset consisting of large quantities of survey papers in the field of computer science as well as their citation relationships. Each survey paper contains key phrases extracted from its title and multi-level reading lists inferred from its references. Furthermore, we propose a graph-optimization-based approach for reading path generation which takes the relationship between papers into account. Extensive evaluations demonstrate that our approach outperforms other baselines. A Real-time Reading Path Generation System (RePaGer) has been also implemented with our designed model. To the best of our knowledge, we are the first to target this important research problem. Our source code of RePaGer system and SurveyBank dataset can be found on here.

preprint2021arXiv

Deep Symmetric Adaptation Network for Cross-modality Medical Image Segmentation

Unsupervised domain adaptation (UDA) methods have shown their promising performance in the cross-modality medical image segmentation tasks. These typical methods usually utilize a translation network to transform images from the source domain to target domain or train the pixel-level classifier merely using translated source images and original target images. However, when there exists a large domain shift between source and target domains, we argue that this asymmetric structure could not fully eliminate the domain gap. In this paper, we present a novel deep symmetric architecture of UDA for medical image segmentation, which consists of a segmentation sub-network, and two symmetric source and target domain translation sub-networks. To be specific, based on two translation sub-networks, we introduce a bidirectional alignment scheme via a shared encoder and private decoders to simultaneously align features 1) from source to target domain and 2) from target to source domain, which helps effectively mitigate the discrepancy between domains. Furthermore, for the segmentation sub-network, we train a pixel-level classifier using not only original target images and translated source images, but also original source images and translated target images, which helps sufficiently leverage the semantic information from the images with different styles. Extensive experiments demonstrate that our method has remarkable advantages compared to the state-of-the-art methods in both cross-modality Cardiac and BraTS segmentation tasks.

preprint2021arXiv

Enquire One's Parent and Child Before Decision: Fully Exploit Hierarchical Structure for Self-Supervised Taxonomy Expansion

Taxonomy is a hierarchically structured knowledge graph that plays a crucial role in machine intelligence. The taxonomy expansion task aims to find a position for a new term in an existing taxonomy to capture the emerging knowledge in the world and keep the taxonomy dynamically updated. Previous taxonomy expansion solutions neglect valuable information brought by the hierarchical structure and evaluate the correctness of merely an added edge, which downgrade the problem to node-pair scoring or mini-path classification. In this paper, we propose the Hierarchy Expansion Framework (HEF), which fully exploits the hierarchical structure's properties to maximize the coherence of expanded taxonomy. HEF makes use of taxonomy's hierarchical structure in multiple aspects: i) HEF utilizes subtrees containing most relevant nodes as self-supervision data for a complete comparison of parental and sibling relations; ii) HEF adopts a coherence modeling module to evaluate the coherence of a taxonomy's subtree by integrating hypernymy relation detection and several tree-exclusive features; iii) HEF introduces the Fitting Score for position selection, which explicitly evaluates both path and level selections and takes full advantage of parental relations to interchange information for disambiguation and self-correction. Extensive experiments show that by better exploiting the hierarchical structure and optimizing taxonomy's coherence, HEF vastly surpasses the prior state-of-the-art on three benchmark datasets by an average improvement of 46.7% in accuracy and 32.3% in mean reciprocal rank.

preprint2021arXiv

Ensembled ResUnet for Anatomical Brain Barriers Segmentation

Accuracy segmentation of brain structures could be helpful for glioma and radiotherapy planning. However, due to the visual and anatomical differences between different modalities, the accurate segmentation of brain structures becomes challenging. To address this problem, we first construct a residual block based U-shape network with a deep encoder and shallow decoder, which can trade off the framework performance and efficiency. Then, we introduce the Tversky loss to address the issue of the class imbalance between different foreground and the background classes. Finally, a model ensemble strategy is utilized to remove outliers and further boost performance.

preprint2021arXiv

Lifelong Learning based Disease Diagnosis on Clinical Notes

Current deep learning based disease diagnosis systems usually fall short in catastrophic forgetting, i.e., directly fine-tuning the disease diagnosis model on new tasks usually leads to abrupt decay of performance on previous tasks. What is worse, the trained diagnosis system would be fixed once deployed but collecting training data that covers enough diseases is infeasible, which inspires us to develop a lifelong learning diagnosis system. In this work, we propose to adopt attention to combine medical entities and context, embedding episodic memory and consolidation to retain knowledge, such that the learned model is capable of adapting to sequential disease-diagnosis tasks. Moreover, we establish a new benchmark, named Jarvis-40, which contains clinical notes collected from various hospitals. Our experiments show that the proposed method can achieve state-of-the-art performance on the proposed benchmark.

preprint2021arXiv

MixSearch: Searching for Domain Generalized Medical Image Segmentation Architectures

Considering the scarcity of medical data, most datasets in medical image analysis are an order of magnitude smaller than those of natural images. However, most Network Architecture Search (NAS) approaches in medical images focused on specific datasets and did not take into account the generalization ability of the learned architectures on unseen datasets as well as different domains. In this paper, we address this point by proposing to search for generalizable U-shape architectures on a composited dataset that mixes medical images from multiple segmentation tasks and domains creatively, which is named MixSearch. Specifically, we propose a novel approach to mix multiple small-scale datasets from multiple domains and segmentation tasks to produce a large-scale dataset. Then, a novel weaved encoder-decoder structure is designed to search for a generalized segmentation network in both cell-level and network-level. The network produced by the proposed MixSearch framework achieves state-of-the-art results compared with advanced encoder-decoder networks across various datasets.

preprint2021arXiv

Online Disease Self-diagnosis with Inductive Heterogeneous Graph Convolutional Networks

We propose a Healthcare Graph Convolutional Network (HealGCN) to offer disease self-diagnosis service for online users based on Electronic Healthcare Records (EHRs). Two main challenges are focused in this paper for online disease diagnosis: (1) serving cold-start users via graph convolutional networks and (2) handling scarce clinical description via a symptom retrieval system. To this end, we first organize the EHR data into a heterogeneous graph that is capable of modeling complex interactions among users, symptoms and diseases, and tailor the graph representation learning towards disease diagnosis with an inductive learning paradigm. Then, we build a disease self-diagnosis system with a corresponding EHR Graph-based Symptom Retrieval System (GraphRet) that can search and provide a list of relevant alternative symptoms by tracing the predefined meta-paths. GraphRet helps enrich the seed symptom set through the EHR graph when confronting users with scarce descriptions, hence yield better diagnosis accuracy. At last, we validate the superiority of our model on a large-scale EHR dataset.

preprint2021arXiv

Stabilized Medical Image Attacks

Convolutional Neural Networks (CNNs) have advanced existing medical systems for automatic disease diagnosis. However, a threat to these systems arises that adversarial attacks make CNNs vulnerable. Inaccurate diagnosis results make a negative influence on human healthcare. There is a need to investigate potential adversarial attacks to robustify deep medical diagnosis systems. On the other side, there are several modalities of medical images (e.g., CT, fundus, and endoscopic image) of which each type is significantly different from others. It is more challenging to generate adversarial perturbations for different types of medical images. In this paper, we propose an image-based medical adversarial attack method to consistently produce adversarial perturbations on medical images. The objective function of our method consists of a loss deviation term and a loss stabilization term. The loss deviation term increases the divergence between the CNN prediction of an adversarial example and its ground truth label. Meanwhile, the loss stabilization term ensures similar CNN predictions of this example and its smoothed input. From the perspective of the whole iterations for perturbation generation, the proposed loss stabilization term exhaustively searches the perturbation space to smooth the single spot for local optimum escape. We further analyze the KL-divergence of the proposed loss function and find that the loss stabilization term makes the perturbations updated towards a fixed objective spot while deviating from the ground truth. This stabilization ensures the proposed medical attack effective for different types of medical images while producing perturbations in small variance. Experiments on several medical image analysis benchmarks including the recent COVID-19 dataset show the stability of the proposed method.

preprint2020arXiv

A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance Imaging

Segmentation of cardiac images, particularly late gadolinium-enhanced magnetic resonance imaging (LGE-MRI) widely used for visualizing diseased cardiac structures, is a crucial first step for clinical diagnosis and treatment. However, direct segmentation of LGE-MRIs is challenging due to its attenuated contrast. Since most clinical studies have relied on manual and labor-intensive approaches, automatic methods are of high interest, particularly optimized machine learning approaches. To address this, we organized the "2018 Left Atrium Segmentation Challenge" using 154 3D LGE-MRIs, currently the world's largest cardiac LGE-MRI dataset, and associated labels of the left atrium segmented by three medical experts, ultimately attracting the participation of 27 international teams. In this paper, extensive analysis of the submitted algorithms using technical and biological metrics was performed by undergoing subgroup analysis and conducting hyper-parameter analysis, offering an overall picture of the major design choices of convolutional neural networks (CNNs) and practical considerations for achieving state-of-the-art left atrium segmentation. Results show the top method achieved a dice score of 93.2% and a mean surface to a surface distance of 0.7 mm, significantly outperforming prior state-of-the-art. Particularly, our analysis demonstrated that double, sequentially used CNNs, in which a first CNN is used for automatic region-of-interest localization and a subsequent CNN is used for refined regional segmentation, achieved far superior results than traditional methods and pipelines containing single CNNs. This large-scale benchmarking study makes a significant step towards much-improved segmentation methods for cardiac LGE-MRIs, and will serve as an important benchmark for evaluating and comparing the future works in the field.

preprint2020arXiv

A Macro-Micro Weakly-supervised Framework for AS-OCT Tissue Segmentation

Primary angle closure glaucoma (PACG) is the leading cause of irreversible blindness among Asian people. Early detection of PACG is essential, so as to provide timely treatment and minimize the vision loss. In the clinical practice, PACG is diagnosed by analyzing the angle between the cornea and iris with anterior segment optical coherence tomography (AS-OCT). The rapid development of deep learning technologies provides the feasibility of building a computer-aided system for the fast and accurate segmentation of cornea and iris tissues. However, the application of deep learning methods in the medical imaging field is still restricted by the lack of enough fully-annotated samples. In this paper, we propose a novel framework to segment the target tissues accurately for the AS-OCT images, by using the combination of weakly-annotated images (majority) and fully-annotated images (minority). The proposed framework consists of two models which provide reliable guidance for each other. In addition, uncertainty guided strategies are adopted to increase the accuracy and stability of the guidance. Detailed experiments on the publicly available AGE dataset demonstrate that the proposed framework outperforms the state-of-the-art semi-/weakly-supervised methods and has a comparable performance as the fully-supervised method. Therefore, the proposed method is demonstrated to be effective in exploiting information contained in the weakly-annotated images and has the capability to substantively relieve the annotation workload.

preprint2020arXiv

Comparing to Learn: Surpassing ImageNet Pretraining on Radiographs By Comparing Image Representations

In deep learning era, pretrained models play an important role in medical image analysis, in which ImageNet pretraining has been widely adopted as the best way. However, it is undeniable that there exists an obvious domain gap between natural images and medical images. To bridge this gap, we propose a new pretraining method which learns from 700k radiographs given no manual annotations. We call our method as Comparing to Learn (C2L) because it learns robust features by comparing different image representations. To verify the effectiveness of C2L, we conduct comprehensive ablation studies and evaluate it on different tasks and datasets. The experimental results on radiographs show that C2L can outperform ImageNet pretraining and previous state-of-the-art approaches significantly. Code and models are available.

preprint2020arXiv

Cross-denoising Network against Corrupted Labels in Medical Image Segmentation with Domain Shift

Deep convolutional neural networks (DCNNs) have contributed many breakthroughs in segmentation tasks, especially in the field of medical imaging. However, \textit{domain shift} and \textit{corrupted annotations}, which are two common problems in medical imaging, dramatically degrade the performance of DCNNs in practice. In this paper, we propose a novel robust cross-denoising framework using two peer networks to address domain shift and corrupted label problems with a peer-review strategy. Specifically, each network performs as a mentor, mutually supervised to learn from reliable samples selected by the peer network to combat with corrupted labels. In addition, a noise-tolerant loss is proposed to encourage the network to capture the key location and filter the discrepancy under various noise-contaminant labels. To further reduce the accumulated error, we introduce a class-imbalanced cross learning using most confident predictions at the class-level. Experimental results on REFUGE and Drishti-GS datasets for optic disc (OD) and optic cup (OC) segmentation demonstrate the superior performance of our proposed approach to the state-of-the-art methods.

preprint2020arXiv

Crossover-Net: Leveraging the Vertical-Horizontal Crossover Relation for Robust Segmentation

Robust segmentation for non-elongated tissues in medical images is hard to realize due to the large variation of the shape, size, and appearance of these tissues in different patients. In this paper, we present an end-to-end trainable deep segmentation model termed Crossover-Net for robust segmentation in medical images. Our proposed model is inspired by an insightful observation: during segmentation, the representation from the horizontal and vertical directions can provide different local appearance and orthogonality context information, which helps enhance the discrimination between different tissues by simultaneously learning from these two directions. Specifically, by converting the segmentation task to a pixel/voxel-wise prediction problem, firstly, we originally propose a cross-shaped patch, namely crossover-patch, which consists of a pair of (orthogonal and overlapped) vertical and horizontal patches, to capture the orthogonal vertical and horizontal relation. Then, we develop the Crossover-Net to learn the vertical-horizontal crossover relation captured by our crossover-patches. To achieve this goal, for learning the representation on a typical crossover-patch, we design a novel loss function to (1) impose the consistency on the overlap region of the vertical and horizontal patches and (2) preserve the diversity on their non-overlap regions. We have extensively evaluated our method on CT kidney tumor, MR cardiac, and X-ray breast mass segmentation tasks. Promising results are achieved according to our extensive evaluation and comparison with the state-of-the-art segmentation models.

preprint2020arXiv

Deep Image Clustering with Category-Style Representation

Deep clustering which adopts deep neural networks to obtain optimal representations for clustering has been widely studied recently. In this paper, we propose a novel deep image clustering framework to learn a category-style latent representation in which the category information is disentangled from image style and can be directly used as the cluster assignment. To achieve this goal, mutual information maximization is applied to embed relevant information in the latent representation. Moreover, augmentation-invariant loss is employed to disentangle the representation into category part and style part. Last but not least, a prior distribution is imposed on the latent representation to ensure the elements of the category vector can be used as the probabilities over clusters. Comprehensive experiments demonstrate that the proposed approach outperforms state-of-the-art methods significantly on five public datasets.

preprint2020arXiv

Difficulty-aware Glaucoma Classification with Multi-Rater Consensus Modeling

Medical images are generally labeled by multiple experts before the final ground-truth labels are determined. Consensus or disagreement among experts regarding individual images reflects the gradeability and difficulty levels of the image. However, when being used for model training, only the final ground-truth label is utilized, while the critical information contained in the raw multi-rater gradings regarding the image being an easy/hard case is discarded. In this paper, we aim to take advantage of the raw multi-rater gradings to improve the deep learning model performance for the glaucoma classification task. Specifically, a multi-branch model structure is proposed to predict the most sensitive, most specifical and a balanced fused result for the input images. In order to encourage the sensitivity branch and specificity branch to generate consistent results for consensus labels and opposite results for disagreement labels, a consensus loss is proposed to constrain the output of the two branches. Meanwhile, the consistency/inconsistency between the prediction results of the two branches implies the image being an easy/hard case, which is further utilized to encourage the balanced fusion branch to concentrate more on the hard cases. Compared with models trained only with the final ground-truth labels, the proposed method using multi-rater consensus information has achieved superior performance, and it is also able to estimate the difficulty levels of individual input images when making the prediction.

preprint2020arXiv

Distractor-Aware Neuron Intrinsic Learning for Generic 2D Medical Image Classifications

Medical image analysis benefits Computer Aided Diagnosis (CADx). A fundamental analyzing approach is the classification of medical images, which serves for skin lesion diagnosis, diabetic retinopathy grading, and cancer classification on histological images. When learning these discriminative classifiers, we observe that the convolutional neural networks (CNNs) are vulnerable to distractor interference. This is due to the similar sample appearances from different categories (i.e., small inter-class distance). Existing attempts select distractors from input images by empirically estimating their potential effects to the classifier. The essences of how these distractors affect CNN classification are not known. In this paper, we explore distractors from the CNN feature space via proposing a neuron intrinsic learning method. We formulate a novel distractor-aware loss that encourages large distance between the original image and its distractor in the feature space. The novel loss is combined with the original classification loss to update network parameters by back-propagation. Neuron intrinsic learning first explores distractors crucial to the deep classifier and then uses them to robustify CNN inherently. Extensive experiments on medical image benchmark datasets indicate that the proposed method performs favorably against the state-of-the-art approaches.

preprint2020arXiv

Generative Adversarial Networks for Video-to-Video Domain Adaptation

Endoscopic videos from multicentres often have different imaging conditions, e.g., color and illumination, which make the models trained on one domain usually fail to generalize well to another. Domain adaptation is one of the potential solutions to address the problem. However, few of existing works focused on the translation of video-based data. In this work, we propose a novel generative adversarial network (GAN), namely VideoGAN, to transfer the video-based data across different domains. As the frames of a video may have similar content and imaging conditions, the proposed VideoGAN has an X-shape generator to preserve the intra-video consistency during translation. Furthermore, a loss function, namely color histogram loss, is proposed to tune the color distribution of each translated frame. Two colonoscopic datasets from different centres, i.e., CVC-Clinic and ETIS-Larib, are adopted to evaluate the performance of domain adaptation of our VideoGAN. Experimental results demonstrate that the adapted colonoscopic video generated by our VideoGAN can significantly boost the segmentation accuracy, i.e., an improvement of 5%, of colorectal polyps on multicentre datasets. As our VideoGAN is a general network architecture, we also evaluate its performance with the CamVid driving video dataset on the cloudy-to-sunny translation task. Comprehensive experiments show that the domain gap could be substantially narrowed down by our VideoGAN.

preprint2020arXiv

GREEN: a Graph REsidual rE-ranking Network for Grading Diabetic Retinopathy

The automatic grading of diabetic retinopathy (DR) facilitates medical diagnosis for both patients and physicians. Existing researches formulate DR grading as an image classification problem. As the stages/categories of DR correlate with each other, the relationship between different classes cannot be explicitly described via a one-hot label because it is empirically estimated by different physicians with different outcomes. This class correlation limits existing networks to achieve effective classification. In this paper, we propose a Graph REsidual rE-ranking Network (GREEN) to introduce a class dependency prior into the original image classification network. The class dependency prior is represented by a graph convolutional network with an adjacency matrix. This prior augments image classification pipeline by re-ranking classification results in a residual aggregation manner. Experiments on the standard benchmarks have shown that GREEN performs favorably against state-of-the-art approaches.

preprint2020arXiv

Instance-aware Self-supervised Learning for Nuclei Segmentation

Due to the wide existence and large morphological variances of nuclei, accurate nuclei instance segmentation is still one of the most challenging tasks in computational pathology. The annotating of nuclei instances, requiring experienced pathologists to manually draw the contours, is extremely laborious and expensive, which often results in the deficiency of annotated data. The deep learning based segmentation approaches, which highly rely on the quantity of training data, are difficult to fully demonstrate their capacity in this area. In this paper, we propose a novel self-supervised learning framework to deeply exploit the capacity of widely-used convolutional neural networks (CNNs) on the nuclei instance segmentation task. The proposed approach involves two sub-tasks (i.e., scale-wise triplet learning and count ranking), which enable neural networks to implicitly leverage the prior-knowledge of nuclei size and quantity, and accordingly mine the instance-aware feature representations from the raw data. Experimental results on the publicly available MoNuSeg dataset show that the proposed self-supervised learning approach can remarkably boost the segmentation accuracy of nuclei instance---a new state-of-the-art average Aggregated Jaccard Index (AJI) of 70.63%, is achieved by our self-supervised ResUNet-101. To our best knowledge, this is the first work focusing on the self-supervised learning for instance segmentation.

preprint2020arXiv

Learning and Exploiting Interclass Visual Correlations for Medical Image Classification

Deep neural network-based medical image classifications often use "hard" labels for training, where the probability of the correct category is 1 and those of others are 0. However, these hard targets can drive the networks over-confident about their predictions and prone to overfit the training data, affecting model generalization and adaption. Studies have shown that label smoothing and softening can improve classification performance. Nevertheless, existing approaches are either non-data-driven or limited in applicability. In this paper, we present the Class-Correlation Learning Network (CCL-Net) to learn interclass visual correlations from given training data, and produce soft labels to help with classification tasks. Instead of letting the network directly learn the desired correlations, we propose to learn them implicitly via distance metric learning of class-specific embeddings with a lightweight plugin CCL block. An intuitive loss based on a geometrical explanation of correlation is designed for bolstering learning of the interclass correlations. We further present end-to-end training of the proposed CCL block as a plugin head together with the classification backbone while generating soft labels on the fly. Our experimental results on the International Skin Imaging Collaboration 2018 dataset demonstrate effective learning of the interclass correlations from training data, as well as consistent improvements in performance upon several widely used modern network structures with the CCL block.

preprint2020arXiv

Learning Crisp Edge Detector Using Logical Refinement Network

Edge detection is a fundamental problem in different computer vision tasks. Recently, edge detection algorithms achieve satisfying improvement built upon deep learning. Although most of them report favorable evaluation scores, they often fail to accurately localize edges and give thick and blurry boundaries. In addition, most of them focus on 2D images and the challenging 3D edge detection is still under-explored. In this work, we propose a novel logical refinement network for crisp edge detection, which is motivated by the logical relationship between segmentation and edge maps and can be applied to both 2D and 3D images. The network consists of a joint object and edge detection network and a crisp edge refinement network, which predicts more accurate, clearer and thinner high quality binary edge maps without any post-processing. Extensive experiments are conducted on the 2D nuclei images from Kaggle 2018 Data Science Bowl and a private 3D microscopy images of a monkey brain, which show outstanding performance compared with state-of-the-art methods.

preprint2020arXiv

Leveraging Undiagnosed Data for Glaucoma Classification with Teacher-Student Learning

Recently, deep learning has been adopted to the glaucoma classification task with performance comparable to that of human experts. However, a well trained deep learning model demands a large quantity of properly labeled data, which is relatively expensive since the accurate labeling of glaucoma requires years of specialist training. In order to alleviate this problem, we propose a glaucoma classification framework which takes advantage of not only the properly labeled images, but also undiagnosed images without glaucoma labels. To be more specific, the proposed framework is adapted from the teacher-student-learning paradigm. The teacher model encodes the wrapped information of undiagnosed images to a latent feature space, meanwhile the student model learns from the teacher through knowledge transfer to improve the glaucoma classification. For the model training procedure, we propose a novel training strategy that simulates the real-world teaching practice named as 'Learning To Teach with Knowledge Transfer (L2T-KT)', and establish a 'Quiz Pool' as the teacher's optimization target. Experiments show that the proposed framework is able to utilize the undiagnosed data effectively to improve the glaucoma prediction performance.

preprint2020arXiv

LT-Net: Label Transfer by Learning Reversible Voxel-wise Correspondence for One-shot Medical Image Segmentation

We introduce a one-shot segmentation method to alleviate the burden of manual annotation for medical images. The main idea is to treat one-shot segmentation as a classical atlas-based segmentation problem, where voxel-wise correspondence from the atlas to the unlabelled data is learned. Subsequently, segmentation label of the atlas can be transferred to the unlabelled data with the learned correspondence. However, since ground truth correspondence between images is usually unavailable, the learning system must be well-supervised to avoid mode collapse and convergence failure. To overcome this difficulty, we resort to the forward-backward consistency, which is widely used in correspondence problems, and additionally learn the backward correspondences from the warped atlases back to the original atlas. This cycle-correspondence learning design enables a variety of extra, cycle-consistency-based supervision signals to make the training process stable, while also boost the performance. We demonstrate the superiority of our method over both deep learning-based one-shot segmentation methods and a classical multi-atlas segmentation method via thorough experiments.

preprint2020arXiv

MI^2GAN: Generative Adversarial Network for Medical Image Domain Adaptation using Mutual Information Constraint

Domain shift between medical images from multicentres is still an open question for the community, which degrades the generalization performance of deep learning models. Generative adversarial network (GAN), which synthesize plausible images, is one of the potential solutions to address the problem. However, the existing GAN-based approaches are prone to fail at preserving image-objects in image-to-image (I2I) translation, which reduces their practicality on domain adaptation tasks. In this paper, we propose a novel GAN (namely MI$^2$GAN) to maintain image-contents during cross-domain I2I translation. Particularly, we disentangle the content features from domain information for both the source and translated images, and then maximize the mutual information between the disentangled content features to preserve the image-objects. The proposed MI$^2$GAN is evaluated on two tasks---polyp segmentation using colonoscopic images and the segmentation of optic disc and cup in fundus images. The experimental results demonstrate that the proposed MI$^2$GAN can not only generate elegant translated images, but also significantly improve the generalization performance of widely used deep learning networks (e.g., U-Net).

preprint2020arXiv

Multi-Modality Generative Adversarial Networks with Tumor Consistency Loss for Brain MR Image Synthesis

Magnetic Resonance (MR) images of different modalities can provide complementary information for clinical diagnosis, but whole modalities are often costly to access. Most existing methods only focus on synthesizing missing images between two modalities, which limits their robustness and efficiency when multiple modalities are missing. To address this problem, we propose a multi-modality generative adversarial network (MGAN) to synthesize three high-quality MR modalities (FLAIR, T1 and T1ce) from one MR modality T2 simultaneously. The experimental results show that the quality of the synthesized images by our proposed methods is better than the one synthesized by the baseline model, pix2pix. Besides, for MR brain image synthesis, it is important to preserve the critical tumor information in the generated modalities, so we further introduce a multi-modality tumor consistency loss to MGAN, called TC-MGAN. We use the synthesized modalities by TC-MGAN to boost the tumor segmentation accuracy, and the results demonstrate its effectiveness.

preprint2020arXiv

Multi-Task Neural Networks with Spatial Activation for Retinal Vessel Segmentation and Artery/Vein Classification

Retinal artery/vein (A/V) classification plays a critical role in the clinical biomarker study of how various systemic and cardiovascular diseases affect the retinal vessels. Conventional methods of automated A/V classification are generally complicated and heavily depend on the accurate vessel segmentation. In this paper, we propose a multi-task deep neural network with spatial activation mechanism that is able to segment full retinal vessel, artery and vein simultaneously, without the pre-requirement of vessel segmentation. The input module of the network integrates the domain knowledge of widely used retinal preprocessing and vessel enhancement techniques. We specially customize the output block of the network with a spatial activation mechanism, which takes advantage of a relatively easier task of vessel segmentation and exploits it to boost the performance of A/V classification. In addition, deep supervision is introduced to the network to assist the low level layers to extract more semantic information. The proposed network achieves pixel-wise accuracy of 95.70% for vessel segmentation, and A/V classification accuracy of 94.50%, which is the state-of-the-art performance for both tasks on the AV-DRIVE dataset. Furthermore, we have also tested the model performance on INSPIRE-AVR dataset, which achieves a skeletal A/V classification accuracy of 91.6%.

preprint2020arXiv

Quality Control of Neuron Reconstruction Based on Deep Learning

Neuron reconstruction is essential to generate exquisite neuron connectivity map for understanding brain function. Despite the significant amount of effect that has been made on automatic reconstruction methods, manual tracing by well-trained human annotators is still necessary. To ensure the quality of reconstructed neurons and provide guidance for annotators to improve their efficiency, we propose a deep learning based quality control method for neuron reconstruction in this paper. By formulating the quality control problem into a binary classification task regarding each single point, the proposed approach overcomes the technical difficulties resulting from the large image size and complex neuron morphology. Not only it provides the evaluation of reconstruction quality, but also can locate exactly where the wrong tracing begins. This work presents one of the first comprehensive studies for whole-brain scale quality control of neuron reconstructions. Experiments on five-fold cross validation with a large dataset demonstrate that the proposed approach can detect 74.7% errors with only 1.4% false alerts.

preprint2020arXiv

Revisiting Rubik's Cube: Self-supervised Learning with Volume-wise Transformation for 3D Medical Image Segmentation

Deep learning highly relies on the quantity of annotated data. However, the annotations for 3D volumetric medical data require experienced physicians to spend hours or even days for investigation. Self-supervised learning is a potential solution to get rid of the strong requirement of training data by deeply exploiting raw data information. In this paper, we propose a novel self-supervised learning framework for volumetric medical images. Specifically, we propose a context restoration task, i.e., Rubik's cube++, to pre-train 3D neural networks. Different from the existing context-restoration-based approaches, we adopt a volume-wise transformation for context permutation, which encourages network to better exploit the inherent 3D anatomical information of organs. Compared to the strategy of training from scratch, fine-tuning from the Rubik's cube++ pre-trained weight can achieve better performance in various tasks such as pancreas segmentation and brain tissue segmentation. The experimental results show that our self-supervised learning method can significantly improve the accuracy of 3D deep learning networks on volumetric medical datasets without the use of extra data.

preprint2020arXiv

Self-Loop Uncertainty: A Novel Pseudo-Label for Semi-Supervised Medical Image Segmentation

Witnessing the success of deep learning neural networks in natural image processing, an increasing number of studies have been proposed to develop deep-learning-based frameworks for medical image segmentation. However, since the pixel-wise annotation of medical images is laborious and expensive, the amount of annotated data is usually deficient to well-train a neural network. In this paper, we propose a semi-supervised approach to train neural networks with limited labeled data and a large quantity of unlabeled images for medical image segmentation. A novel pseudo-label (namely self-loop uncertainty), generated by recurrently optimizing the neural network with a self-supervised task, is adopted as the ground-truth for the unlabeled images to augment the training set and boost the segmentation accuracy. The proposed self-loop uncertainty can be seen as an approximation of the uncertainty estimation yielded by ensembling multiple models with a significant reduction of inference time. Experimental results on two publicly available datasets demonstrate the effectiveness of our semi-supervied approach.

preprint2020arXiv

Superpixel-Guided Label Softening for Medical Image Segmentation

Segmentation of objects of interest is one of the central tasks in medical image analysis, which is indispensable for quantitative analysis. When developing machine-learning based methods for automated segmentation, manual annotations are usually used as the ground truth toward which the models learn to mimic. While the bulky parts of the segmentation targets are relatively easy to label, the peripheral areas are often difficult to handle due to ambiguous boundaries and the partial volume effect, etc., and are likely to be labeled with uncertainty. This uncertainty in labeling may, in turn, result in unsatisfactory performance of the trained models. In this paper, we propose superpixel-based label softening to tackle the above issue. Generated by unsupervised over-segmentation, each superpixel is expected to represent a locally homogeneous area. If a superpixel intersects with the annotation boundary, we consider a high probability of uncertain labeling within this area. Driven by this intuition, we soften labels in this area based on signed distances to the annotation boundary and assign probability values within [0, 1] to them, in comparison with the original "hard", binary labels of either 0 or 1. The softened labels are then used to train the segmentation models together with the hard labels. Experimental results on a brain MRI dataset and an optical coherence tomography dataset demonstrate that this conceptually simple and implementation-wise easy method achieves overall superior segmentation performances to baseline and comparison methods for both 3D and 2D medical images.

preprint2020arXiv

TR-GAN: Topology Ranking GAN with Triplet Loss for Retinal Artery/Vein Classification

Retinal artery/vein (A/V) classification lays the foundation for the quantitative analysis of retinal vessels, which is associated with potential risks of various cardiovascular and cerebral diseases. The topological connection relationship, which has been proved effective in improving the A/V classification performance for the conventional graph based method, has not been exploited by the deep learning based method. In this paper, we propose a Topology Ranking Generative Adversarial Network (TR-GAN) to improve the topology connectivity of the segmented arteries and veins, and further to boost the A/V classification performance. A topology ranking discriminator based on ordinal regression is proposed to rank the topological connectivity level of the ground-truth, the generated A/V mask and the intentionally shuffled mask. The ranking loss is further back-propagated to the generator to generate better connected A/V masks. In addition, a topology preserving module with triplet loss is also proposed to extract the high-level topological features and further to narrow the feature distance between the predicted A/V mask and the ground-truth. The proposed framework effectively increases the topological connectivity of the predicted A/V masks and achieves state-of-the-art A/V classification performance on the publicly available AV-DRIVE dataset.