Researcher profile

Shanshan Wang

Shanshan Wang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
31works
0followers
15topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

31 published item(s)

preprint2026arXiv

MPM-LLM4DSE: Reaching the Pareto Frontier in HLS with Multimodal Learning and LLM-Driven Exploration

High-Level Synthesis (HLS) design space exploration (DSE) seeks Pareto-optimal designs within expansive pragma configuration spaces. To accelerate HLS DSE, graph neural networks (GNNs) are commonly employed as surrogates for HLS tools to predict quality of results (QoR) metrics, while multi-objective optimization algorithms expedite the exploration. However, GNN-based prediction methods may not fully capture the rich semantic features inherent in behavioral descriptions, and conventional multi-objective optimization algorithms often do not explicitly account for the domain-specific knowledge regarding how pragma directives influence QoR. To address these limitations, this paper proposes the MPM-LLM4DSE framework, which incorporates a multimodal prediction model (MPM) that simultaneously fuses features from behavioral descriptions and control and data flow graphs. Furthermore, the framework employs a large language model (LLM) as an optimizer, accompanied by a tailored prompt engineering methodology. This methodology incorporates pragma impact analysis on QoR to guide the LLM in generating high-quality configurations (LLM4DSE). Experimental results demonstrate that our multimodal predictive model significantly outperforms state-of-the-art work ProgSG by up to 10.25$\times$. Furthermore, in DSE tasks, the proposed LLM4DSE achieves an average performance gain of 39.90\% over prior methods, validating the effectiveness of our prompting methodology. Code and models are available at https://github.com/wslcccc/MPM-LLM4DSE.

preprint2026arXiv

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

preprint2024arXiv

AID-DTI: Accelerating High-fidelity Diffusion Tensor Imaging with Detail-Preserving Model-based Deep Learning

Deep learning has shown great potential in accelerating diffusion tensor imaging (DTI). Nevertheless, existing methods tend to suffer from Rician noise and detail loss in reconstructing the DTI-derived parametric maps especially when sparsely sampled q-space data are used. This paper proposes a novel method, AID-DTI (Accelerating hIgh fiDelity Diffusion Tensor Imaging), to facilitate fast and accurate DTI with only six measurements. AID-DTI is equipped with a newly designed Singular Value Decomposition (SVD)-based regularizer, which can effectively capture fine details while suppressing noise during network training. Experimental results on Human Connectome Project (HCP) data consistently demonstrate that the proposed method estimates DTI parameter maps with fine-grained details and outperforms three state-of-the-art methods both quantitatively and qualitatively.

preprint2024arXiv

LESEN: Label-Efficient deep learning for Multi-parametric MRI-based Visual Pathway Segmentation

Recent research has shown the potential of deep learning in multi-parametric MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for training is laborious and time-consuming. Therefore, it is crucial to develop effective algorithms in situations with limited labeled samples. In this work, we propose a label-efficient deep learning method with self-ensembling (LESEN). LESEN incorporates supervised and unsupervised losses, enabling the student and teacher models to mutually learn from each other, forming a self-ensembling mean teacher framework. Additionally, we introduce a reliable unlabeled sample selection (RUSS) mechanism to further enhance LESEN's effectiveness. Our experiments on the human connectome project (HCP) dataset demonstrate the superior performance of our method when compared to state-of-the-art techniques, advancing multimodal VP segmentation for comprehensive analysis in clinical and research settings. The implementation code will be available at: https://github.com/aldiak/Semi-Supervised-Multimodal-Visual-Pathway- Delineation.

preprint2024arXiv

MLIP: Medical Language-Image Pre-training with Masked Local Representation Learning

Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs. However, the number of image-text pairs in medical datasets is usually orders of magnitude smaller than that in natural datasets. Besides, medical image-text pairs often involve numerous complex fine-grained correspondences. This paper aims to enhance the data efficiency by introducing multiple-to-multiple local relationship modeling to capture denser supervisions. More specifically, we propose a Medical Language-Image Pre-training (MLIP) framework, which exploits the limited image-text medical data more efficiently through patch-sentence matching. Furthermore, we introduce a masked contrastive learning strategy with semantic integrity estimation to reduce redundancy in images while preserving the underlying semantics. Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.

preprint2024arXiv

Modality Exchange Network for Retinogeniculate Visual Pathway Segmentation

Accurate segmentation of the retinogeniculate visual pathway (RGVP) aids in the diagnosis and treatment of visual disorders by identifying disruptions or abnormalities within the pathway. However, the complex anatomical structure and connectivity of RGVP make it challenging to achieve accurate segmentation. In this study, we propose a novel Modality Exchange Network (ME-Net) that effectively utilizes multi-modal magnetic resonance (MR) imaging information to enhance RGVP segmentation. Our ME-Net has two main contributions. Firstly, we introduce an effective multi-modal soft-exchange technique. Specifically, we design a channel and spatially mixed attention module to exchange modality information between T1-weighted and fractional anisotropy MR images. Secondly, we propose a cross-fusion module that further enhances the fusion of information between the two modalities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches in terms of RGVP segmentation performance.

preprint2024arXiv

Simultaneous q-Space Sampling Optimization and Reconstruction for Fast and High-fidelity Diffusion Magnetic Resonance Imaging

Diffusion Magnetic Resonance Imaging (dMRI) plays a crucial role in the noninvasive investigation of tissue microstructural properties and structural connectivity in the \textit{in vivo} human brain. However, to effectively capture the intricate characteristics of water diffusion at various directions and scales, it is important to employ comprehensive q-space sampling. Unfortunately, this requirement leads to long scan times, limiting the clinical applicability of dMRI. To address this challenge, we propose SSOR, a Simultaneous q-Space sampling Optimization and Reconstruction framework. We jointly optimize a subset of q-space samples using a continuous representation of spherical harmonic functions and a reconstruction network. Additionally, we integrate the unique properties of diffusion magnetic resonance imaging (dMRI) in both the q-space and image domains by applying $l1$-norm and total-variation regularization. The experiments conducted on HCP data demonstrate that SSOR has promising strengths both quantitatively and qualitatively and exhibits robustness to noise.

preprint2022arXiv

Artificial Intelligence Enabled Spectral Reconfigurable Fiber Laser

The combinations of artificial intelligence and lasers provide powerful ways to form smart light sources with ground-breaking functions. Here, a Raman fiber laser (RFL) with reconfigurable and programmable spectra in an ultra-wide bandwidth is developed based on spectral-spatial manipulation of light in multimode fiber (MMF). The proposed fiber laser uses nonlinear gain from cascaded stimulated Raman scattering, random distributed feedback from Rayleigh scattering, and point feedback from an MMF-based smart spectral filter. Through wavefront shaping controlled by a genetic algorithm, light of selective wavelength(s) can be focused in the MMF, forming the filter that, together with the active part of the laser, actively shape the output spectrum with a high degree of freedom. We achieved arbitrary spectral shaping of the cascaded RFL (e.g., continuously tunable single-wavelength and multi-wavelength laser with customizable linewidth, mode separation, and power distribution) from the 1st- to the 3rd-order Stokes emission by adjusting the pump power and auto-optimization of the smart filter. Our research uses artificial-intelligence controlled light manipulation in a fiber platform with multi-eigenmodes and nonlinear gain, mapping the spatial control into the spectral domain as well as extending the linear control of light in MMF to active light emission, which is of great significance for applications in optical communication, sensing, and spectroscopy.

preprint2022arXiv

Collective behavior in the North Rhine-Westphalia motorway network

To understand the dynamics on complex networks, measurement of correlations is indispensable. In a motorway network, it is not sufficient to collect information on fluxes and velocities on all individual links, i.e. parts of the freeways between ramps and highway crosses. The interdependencies and mutual connections are also of considerable interest. We analyze correlations in the complete motorway network in North Rhine-Westphalia, the most populous state in Germany. We view the motorway network as a complex system consisting of road sections which interact via the motion of vehicles, implying structures in the corresponding correlation matrices. In particular, we focus on collective behavior, i.e. coherent motion in the whole network or in large parts of it. To this end, we study the eigenvalue and eigenvector statistics and identify significant sections in the motorway network. We find collective behavior in these significant sections and further explore its causes. We show that collectivity throughout the network cannot directly be related to the traffic states (free, synchronous and congested) in Kerner's three-phase theory. Hence, the degree of collectivity provides a new, complementary observable to characterize the motorway network.

preprint2022arXiv

DASP: Defect and Dopant ab-initio Simulation Package

In order to perform automated calculations of defect and dopant properties in semiconductors and insulators, we developed a software package, Defect and Dopant ab-initio Simulation Package (DASP), which is composed of four modules for calculating: (i) elemental chemical potentials, (ii) defect (dopant) formation energies and transition energy levels, (iii) defect and carrier densities and (iv) carrier dynamics properties of high-density defects. DASP uses the materials genome database for quick determination of competing secondary phases and calculation of the energy above convex hull when calculating the elemental chemical potential that stabilizes compound semiconductors, so it can perform high-throughput prediction of thermodynamic stability of multinary compounds. DASP calls the ab-initio softwares to perform the total energy, structural relaxation and electronic structure calculations of the defect supercells with different structure configurations and charge states, based on which the defect formation energies and transition energy levels are calculated and the corrections for electrostatic potential alignment and image charge interaction can be included. Then DASP can calculate the equilibrium densities of defects and electron and hole carriers as well as the Fermi level in semiconductors under different chemical potential conditions and different growth/working temperature. For high-density defects, DASP can calculate the carrier dynamics properties such as the photoluminescence (PL) spectrum, defect-related radiative and non-radiative carrier capture cross sections, and recombination lifetime of non-equilibrium carriers.

preprint2022arXiv

Expert Knowledge-guided Geometric Representation Learning for Magnetic Resonance Imaging-based Glioma Grading

Radiomics and deep learning have shown high popularity in automatic glioma grading. Radiomics can extract hand-crafted features that quantitatively describe the expert knowledge of glioma grades, and deep learning is powerful in extracting a large number of high-throughput features that facilitate the final classification. However, the performance of existing methods can still be improved as their complementary strengths have not been sufficiently investigated and integrated. Furthermore, lesion maps are usually needed for the final prediction at the testing phase, which is very troublesome. In this paper, we propose an expert knowledge-guided geometric representation learning (ENROL) framework . Geometric manifolds of hand-crafted features and learned features are constructed to mine the implicit relationship between deep learning and radiomics, and therefore to dig mutual consent and essential representation for the glioma grades. With a specially designed manifold discrepancy measurement, the grading model can exploit the input image data and expert knowledge more effectively in the training phase and get rid of the requirement of lesion segmentation maps at the testing phase. The proposed framework is flexible regarding deep learning architectures to be utilized. Three different architectures have been evaluated and five models have been compared, which show that our framework can always generate promising results.

preprint2022arXiv

K-space and Image Domain Collaborative Energy based Model for Parallel MRI Reconstruction

Decreasing magnetic resonance (MR) image acquisition times can potentially make MR examinations more accessible. Prior arts including the deep learning models have been devoted to solving the problem of long MRI imaging time. Recently, deep generative models have exhibited great potentials in algorithm robustness and usage flexibility. Nevertheless, none of existing schemes can be learned or employed to the k-space measurement directly. Furthermore, how do the deep generative models work well in hybrid domain is also worth being investigated. In this work, by taking advantage of the deep energy-based models, we propose a k-space and image domain collaborative generative model to comprehensively estimate the MR data from under-sampled measurement. Experimental comparisons with the state-of-the-arts demonstrated that the proposed hybrid method has less error in reconstruction accuracy and is more stable under different acceleration factors

preprint2022arXiv

Mining Function Homology of Bot Loaders from Honeypot Logs

Self-contained loaders are widely adopted in botnets for injecting loading commands and spawning new bots. While researchers can dissect bot clients to get various information of botnets, the cloud-based and self-contained design of loaders effectively hinders researchers from understanding the loaders' evolution and variation using classic methods. The decoupled nature of bot loaders also dramatically reduces the feasibility of investigating relationships among clients and infrastructures. In this paper, we propose a text-based method to investigate and analyze details of bot loaders using honeypots. We leverage high interaction honeypots to collect request logs and define eight families of bot loaders based on the result of agglomerative clustering. At the function level, we push our study further to explore their homological relationship based on similarity analysis of request logs using sequence aligning techniques. This further exploration discloses that the released code of Mirai keeps spawning new generations of botnets both on the client and the server side. This paper uncovers the homology of active botnet infrastructures, providing a new prospect on finding covert relationships among cybercrimes. Bot loaders are precisely investigated at the function level to yield a new insight for researchers to identify the botnet's infrastructures and track their evolution over time.

preprint2022arXiv

Multi-Weight Respecification of Scan-specific Learning for Parallel Imaging

Parallel imaging is widely used in magnetic resonance imaging as an acceleration technology. Traditional linear reconstruction methods in parallel imaging often suffer from noise amplification. Recently, a non-linear robust artificial-neural-network for k-space interpolation (RAKI) exhibits superior noise resilience over other linear methods. However, RAKI performs poorly at high acceleration rates, and needs a large amount of autocalibration signals as the training samples. In order to tackle these issues, we propose a multi-weight method that implements multiple weighting matrices on the undersampled data, named as MW-RAKI. Enforcing multiple weighted matrices on the measurements can effectively reduce the influence of noise and increase the data constraints. Furthermore, we incorporate the strategy of multiple weighting matrixes into a residual version of RAKI, and form MW-rRAKI.Experimental compari-sons with the alternative methods demonstrated noticeably better reconstruction performances, particularly at high acceleration rates.

preprint2022arXiv

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks. As the core part of PLM, multi-head self-attention is appealing for its ability to jointly attend to information from different positions. However, researchers have found that PLM always exhibits fixed attention patterns regardless of the input (e.g., excessively paying attention to [CLS] or [SEP]), which we argue might neglect important information in the other positions. In this work, we propose a simple yet effective attention guiding mechanism to improve the performance of PLM by encouraging attention towards the established goals. Specifically, we propose two kinds of attention guiding methods, i.e., map discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG). The former definitely encourages the diversity among multiple self-attention heads to jointly attend to information from different representation subspaces, while the latter encourages self-attention to attend to as many different positions of the input as possible. We conduct experiments with multiple general pre-trained models (i.e., BERT, ALBERT, and Roberta) and domain-specific pre-trained models (i.e., BioBERT, ClinicalBERT, BlueBert, and SciBERT) on three benchmark datasets (i.e., MultiNLI, MedNLI, and Cross-genre-IR). Extensive experimental results demonstrate that our proposed MDG and PDG bring stable performance improvements on all datasets with high efficiency and low cost.

preprint2022arXiv

Rethinking the optimization process for self-supervised model-driven MRI reconstruction

Recovering high-quality images from undersampled measurements is critical for accelerated MRI reconstruction. Recently, various supervised deep learning-based MRI reconstruction methods have been developed. Despite the achieved promising performances, these methods require fully sampled reference data, the acquisition of which is resource-intensive and time-consuming. Self-supervised learning has emerged as a promising solution to alleviate the reliance on fully sampled datasets. However, existing self-supervised methods suffer from reconstruction errors due to the insufficient constraint enforced on the non-sampled data points and the error accumulation happened alongside the iterative image reconstruction process for model-driven deep learning reconstrutions. To address these challenges, we propose K2Calibrate, a K-space adaptation strategy for self-supervised model-driven MR reconstruction optimization. By iteratively calibrating the learned measurements, K2Calibrate can reduce the network's reconstruction deterioration caused by statistically dependent noise. Extensive experiments have been conducted on the open-source dataset FastMRI, and K2Calibrate achieves better results than five state-of-the-art methods. The proposed K2Calibrate is plug-and-play and can be easily integrated with different model-driven deep learning reconstruction methods.

preprint2022arXiv

SelfCoLearn: Self-supervised collaborative learning for accelerating dynamic MR imaging

Lately, deep learning has been extensively investigated for accelerating dynamic magnetic resonance (MR) imaging, with encouraging progresses achieved. However, without fully sampled reference data for training, current approaches may have limited abilities in recovering fine details or structures. To address this challenge, this paper proposes a self-supervised collaborative learning framework (SelfCoLearn) for accurate dynamic MR image reconstruction from undersampled k-space data. The proposed framework is equipped with three important components, namely, dual-network collaborative learning, reunderampling data augmentation and a specially designed co-training loss. The framework is flexible to be integrated with both data-driven networks and model-based iterative un-rolled networks. Our method has been evaluated on in-vivo dataset and compared it to four state-of-the-art methods. Results show that our method possesses strong capabilities in capturing essential and inherent representations for direct reconstructions from the undersampled k-space data and thus enables high-quality and fast dynamic MR imaging.

preprint2022arXiv

Spatial Correlation Analysis of Traffic Flow on Parallel Motorways in Germany

With the widely used method of correlation matrix analysis, this study reveals the change of traffic states on parallel motorways in North Rhine-Westphalia, Germany. In terms of the time series of traffic flow and velocity, we carry out a quantitative analysis in correlations and reveal a high level of strongly positive traffic flow correlation and rich structural features in the corresponding correlation matrices. The strong correlation is mainly ascribed to the daily time evolution of traffic flow during the periods of rush hours and non-rush hours. In terms of free flow and congestion, the structural features are able to capture the average traffic situation we derive from our data. Furthermore, the structural features in correlation matrices for individual time periods corroborate our results from the correlation matrices regarding a whole day. The average correlations in traffic flows and velocities over all pairwise sections disclose the traffic behavior during each individual time period. Our contribution uncovers the potential application of correlation analysis on the study of traffic networks as a complex system.

preprint2022arXiv

Specificity-Preserving Federated Learning for MR Image Reconstruction

Federated learning (FL) can be used to improve data privacy and efficiency in magnetic resonance (MR) image reconstruction by enabling multiple institutions to collaborate without needing to aggregate local data. However, the domain shift caused by different MR imaging protocols can substantially degrade the performance of FL models. Recent FL techniques tend to solve this by enhancing the generalization of the global model, but they ignore the domain-specific features, which may contain important information about the device properties and be useful for local reconstruction. In this paper, we propose a specificity-preserving FL algorithm for MR image reconstruction (FedMRI). The core idea is to divide the MR reconstruction model into two parts: a globally shared encoder to obtain a generalized representation at the global level, and a client-specific decoder to preserve the domain-specific properties of each client, which is important for collaborative reconstruction when the clients have unique distribution. Such scheme is then executed in the frequency space and the image space respectively, allowing exploration of generalized representation and client-specific properties simultaneously in different spaces. Moreover, to further boost the convergence of the globally shared encoder when a domain shift is present, a weighted contrastive regularization is introduced to directly correct any deviation between the client and server during optimization. Extensive experiments demonstrate that our FedMRI's reconstructed results are the closest to the ground-truth for multi-institutional data, and that it outperforms state-of-the-art FL methods.

preprint2022arXiv

Universal Generative Modeling for Calibration-free Parallel Mr Imaging

The integration of compressed sensing and parallel imaging (CS-PI) provides a robust mechanism for accelerating MRI acquisitions. However, most such strategies require the explicit formation of either coil sensitivity profiles or a cross-coil correlation operator, and as a result reconstruction corresponds to solving a challenging bilinear optimization problem. In this work, we present an unsupervised deep learning framework for calibration-free parallel MRI, coined universal generative modeling for parallel imaging (UGM-PI). More precisely, we make use of the merits of both wavelet transform and the adaptive iteration strategy in a unified framework. We train a powerful noise conditional score network by forming wavelet tensor as the network input at the training phase. Experimental results on both physical phantom and in vivo datasets implied that the proposed method is comparable and even superior to state-of-the-art CS-PI reconstruction approaches.

preprint2022arXiv

Variable Augmented Network for Invertible MR Coil Compression

A large number of coils are able to provide enhanced signal-to-noise ratio and improve imaging performance in parallel imaging. Nevertheless, the increasing growth of coil number simultaneously aggravates the drawbacks of data storage and reconstruction speed, especially in some iterative reconstructions. Coil compression addresses these issues by generating fewer virtual coils. In this work, a novel variable augmentation network for invertible coil compression termed VAN-ICC is presented. It utilizes inherent reversibility of normalizing flow-based models for high-precision compression and invertible recovery. By employing the variable augmentation technology to image/k-space variables from multi-coils, VAN-ICC trains invertible networks by finding an invertible and bijective function, which can map the original data to the compressed counterpart and vice versa. Experiments conducted on both fully-sampled and under-sampled data verified the effectiveness and flexibility of VAN-ICC. Quantitative and qualitative comparisons with traditional non-deep learning-based approaches demonstrated that VAN-ICC can carry much higher compression effects. Additionally, its performance is not susceptible to different number of virtual coils.

preprint2021arXiv

A coarse-to-fine framework for unsupervised multi-contrast MR image deformable registration with dual consistency constraint

Multi-contrast magnetic resonance (MR) image registration is useful in the clinic to achieve fast and accurate imaging-based disease diagnosis and treatment planning. Nevertheless, the efficiency and performance of the existing registration algorithms can still be improved. In this paper, we propose a novel unsupervised learning-based framework to achieve accurate and efficient multi-contrast MR image registrations. Specifically, an end-to-end coarse-to-fine network architecture consisting of affine and deformable transformations is designed to improve the robustness and achieve end-to-end registration. Furthermore, a dual consistency constraint and a new prior knowledge-based loss function are developed to enhance the registration performances. The proposed method has been evaluated on a clinical dataset containing 555 cases, and encouraging performances have been achieved. Compared to the commonly utilized registration methods, including VoxelMorph, SyN, and LT-Net, the proposed method achieves better registration performance with a Dice score of 0.8397 in identifying stroke lesions. With regards to the registration speed, our method is about 10 times faster than the most competitive method of SyN (Affine) when testing on a CPU. Moreover, we prove that our method can still perform well on more challenging tasks with lacking scanning information data, showing high robustness for the clinical application.

preprint2021arXiv

A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis

This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8% accuracy compared to 75.8% for the audio-only and 68.4% for the video-only equivalent systems.

preprint2021arXiv

Homotopic Gradients of Generative Density Priors for MR Image Reconstruction

Deep learning, particularly the generative model, has demonstrated tremendous potential to significantly speed up image reconstruction with reduced measurements recently. Rather than the existing generative models that often optimize the density priors, in this work, by taking advantage of the denoising score matching, homotopic gradients of generative density priors (HGGDP) are proposed for magnetic resonance imaging (MRI) reconstruction. More precisely, to tackle the low-dimensional manifold and low data density region issues in generative density prior, we estimate the target gradients in higher-dimensional space. We train a more powerful noise conditional score network by forming high-dimensional tensor as the network input at the training phase. More artificial noise is also injected in the embedding space. At the reconstruction stage, a homotopy method is employed to pursue the density prior, such as to boost the reconstruction performance. Experiment results imply the remarkable performance of HGGDP in terms of high reconstruction accuracy; only 10% of the k-space data can still generate images of high quality as effectively as standard MRI reconstruction with the fully sampled data.

preprint2020arXiv

Bounding boxes for weakly supervised segmentation: Global constraints get close to full supervision

We propose a novel weakly supervised learning segmentation based on several global constraints derived from box annotations. Particularly, we leverage a classical tightness prior to a deep learning setting via imposing a set of constraints on the network outputs. Such a powerful topological prior prevents solutions from excessive shrinking by enforcing any horizontal or vertical line within the bounding box to contain, at least, one pixel of the foreground region. Furthermore, we integrate our deep tightness prior with a global background emptiness constraint, guiding training with information outside the bounding box. We demonstrate experimentally that such a global constraint is much more powerful than standard cross-entropy for the background class. Our optimization problem is challenging as it takes the form of a large set of inequality constraints on the outputs of deep networks. We solve it with sequence of unconstrained losses based on a recent powerful extension of the log-barrier method, which is well-known in the context of interior-point methods. This accommodates standard stochastic gradient descent (SGD) for training deep networks, while avoiding computationally expensive and unstable Lagrangian dual steps and projections. Extensive experiments over two different public data sets and applications (prostate and brain lesions) demonstrate that the synergy between our global tightness and emptiness priors yield very competitive performances, approaching full supervision and outperforming significantly DeepCut. Furthermore, our approach removes the need for computationally expensive proposal generation. Our code is shared anonymously.

preprint2020arXiv

Parameter-Transferred Wasserstein Generative Adversarial Network (PT-WGAN) for Low-Dose PET Image Denoising

Due to the widespread use of positron emission tomography (PET) in clinical practice, the potential risk of PET-associated radiation dose to patients needs to be minimized. However, with the reduction in the radiation dose, the resultant images may suffer from noise and artifacts that compromise diagnostic performance. In this paper, we propose a parameter-transferred Wasserstein generative adversarial network (PT-WGAN) for low-dose PET image denoising. The contributions of this paper are twofold: i) a PT-WGAN framework is designed to denoise low-dose PET images without compromising structural details, and ii) a task-specific initialization based on transfer learning is developed to train PT-WGAN using trainable parameters transferred from a pretrained model, which significantly improves the training efficiency of PT-WGAN. The experimental results on clinical data show that the proposed network can suppress image noise more effectively while preserving better image fidelity than recently published state-of-the-art methods. We make our code available at https://github.com/90n9-yu/PT-WGAN.

preprint2020arXiv

Self-adaptive Re-weighted Adversarial Domain Adaptation

Existing adversarial domain adaptation methods mainly consider the marginal distribution and these methods may lead to either under transfer or negative transfer. To address this problem, we present a self-adaptive re-weighted adversarial domain adaptation approach, which tries to enhance domain alignment from the perspective of conditional distribution. In order to promote positive transfer and combat negative transfer, we reduce the weight of the adversarial loss for aligned features while increasing the adversarial force for those poorly aligned measured by the conditional entropy. Additionally, triplet loss leveraging source samples and pseudo-labeled target samples is employed on the confusing domain. Such metric loss ensures the distance of the intra-class sample pairs closer than the inter-class pairs to achieve the class-level alignment. In this way, the high accurate pseudolabeled target samples and semantic alignment can be captured simultaneously in the co-training process. Our method achieved low joint error of the ideal source and target hypothesis. The expected target error can then be upper bounded following Ben-David's theorem. Empirical evidence demonstrates that the proposed model outperforms state of the arts on standard domain adaptation datasets.

preprint2019arXiv

CLCI-Net: Cross-Level fusion and Context Inference Networks for Lesion Segmentation of Chronic Stroke

Segmenting stroke lesions from T1-weighted MR images is of great value for large-scale stroke rehabilitation neuroimaging analyses. Nevertheless, there are great challenges with this task, such as large range of stroke lesion scales and the tissue intensity similarity. The famous encoder-decoder convolutional neural network, which although has made great achievements in medical image segmentation areas, may fail to address these challenges due to the insufficient uses of multi-scale features and context information. To address these challenges, this paper proposes a Cross-Level fusion and Context Inference Network (CLCI-Net) for the chronic stroke lesion segmentation from T1-weighted MR images. Specifically, a Cross-Level feature Fusion (CLF) strategy was developed to make full use of different scale features across different levels; Extending Atrous Spatial Pyramid Pooling (ASPP) with CLF, we have enriched multi-scale features to handle the different lesion sizes; In addition, convolutional long short-term memory (ConvLSTM) is employed to infer context information and thus capture fine structures to address the intensity similarity issue. The proposed approach was evaluated on an open-source dataset, the Anatomical Tracings of Lesions After Stroke (ATLAS) with the results showing that our network outperforms five state-of-the-art methods. We make our code and models available at https://github.com/YH0517/CLCI_Net.

preprint2019arXiv

D-UNet: a dimension-fusion U shape network for chronic stroke lesion segmentation

Assessing the location and extent of lesions caused by chronic stroke is critical for medical diagnosis, surgical planning, and prognosis. In recent years, with the rapid development of 2D and 3D convolutional neural networks (CNN), the encoder-decoder structure has shown great potential in the field of medical image segmentation. However, the 2D CNN ignores the 3D information of medical images, while the 3D CNN suffers from high computational resource demands. This paper proposes a new architecture called dimension-fusion-UNet (D-UNet), which combines 2D and 3D convolution innovatively in the encoding stage. The proposed architecture achieves a better segmentation performance than 2D networks, while requiring significantly less computation time in comparison to 3D networks. Furthermore, to alleviate the data imbalance issue between positive and negative samples for the network training, we propose a new loss function called Enhance Mixing Loss (EML). This function adds a weighted focal coefficient and combines two traditional loss functions. The proposed method has been tested on the ATLAS dataset and compared to three state-of-the-art methods. The results demonstrate that the proposed method achieves the best quality performance in terms of DSC = 0.5349+0.2763 and precision = 0.6331+0.295).

preprint2019arXiv

X-Net: Brain Stroke Lesion Segmentation Based on Depthwise Separable Convolution and Long-range Dependencies

The morbidity of brain stroke increased rapidly in the past few years. To help specialists in lesion measurements and treatment planning, automatic segmentation methods are critically required for clinical practices. Recently, approaches based on deep learning and methods for contextual information extraction have served in many image segmentation tasks. However, their performances are limited due to the insufficient training of a large number of parameters, which sometimes fail in capturing long-range dependencies. To address these issues, we propose a depthwise separable convolution based X-Net that designs a nonlocal operation namely Feature Similarity Module (FSM) to capture long-range dependencies. The adopted depthwise convolution allows to reduce the network size, while the developed FSM provides a more effective, dense contextual information extraction and thus facilitates better segmentation. The effectiveness of X-Net was evaluated on an open dataset Anatomical Tracings of Lesions After Stroke (ATLAS) with superior performance achieved compared to other six state-of-the-art approaches. We make our code and models available at https://github.com/Andrewsher/X-Net.

preprint2018arXiv

Object Activity Scene Description, Construction and Recognition

Action recognition is a critical task for social robots to meaningfully engage with their environment. 3D human skeleton-based action recognition is an attractive research area in recent years. Although, the existing approaches are good at action recognition, it is a great challenge to recognize a group of actions in an activity scene. To tackle this problem, at first, we partition the scene into several primitive actions (PAs) based upon motion attention mechanism. Then, the primitive actions are described by the trajectory vectors of corresponding joints. After that, motivated by text classification based on word embedding, we employ convolution neural network (CNN) to recognize activity scenes by considering motion of joints as "word" of activity. The experimental results on the scenes of human activity dataset show the efficiency of the proposed approach.