Source author record

Cuntai Guan

Cuntai Guan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computer Vision eess.IV Artificial Intelligence eess.SP Computation and Language Human-Computer Interaction Multimedia Neurons and Cognition Quantitative Methods

Catalog footprint

What is connected

15works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.

preprint2026arXiv

SPD Matrix Learning for Neuroimaging Analysis: Perspectives, Methods, and Challenges

Neuroimaging provides essential tools for characterizing brain activity by quantifying connectivity strength between remote regions, using different modalities that capture different aspects of connectivity. Yet, decoding meaningful neural signatures must contend with modality-specific challenges, including measurement noise, spatial and temporal distortions, heterogeneous acquisition protocols, and limited sample sizes. A unifying perspective emerges when these data are expressed through symmetric positive definite (SPD)-valued representations: across neuroimaging modalities, SPD-valued representations naturally give rise to SPD matrices that capture dependencies between sensors or brain regions. Endowing the SPD space with Riemannian metrics equips it with a non-Euclidean geometric structure, enabling principled statistical modeling and machine learning on the resulting manifold. This review consolidates machine learning methodologies that operate on the SPD manifold under a unified framework termed SPD matrix learning. SPD matrix learning brings conceptual clarity across multiple modalities, establishes continuity with decades of geometric statistics in neuroimaging, and positions SPD modeling as a methodological bridge between classical analysis and emerging AI-driven paradigms. We show that (i) modeling on the SPD manifold is mathematically natural and numerically stable, preserving symmetry and positive definiteness while avoiding degeneracies inherent to Euclidean embeddings; (ii) SPD matrix learning extends a broad family of established geometric statistical tools used across neuroimaging; and (iii) SPD matrix learning integrates new-generation AI technologies, driving a new class of neuroimaging problems that were previously out of reach. Taken together, SPD matrix learning offers a principled and forward-looking framework for next-generation neuroimaging analytics.

preprint2022arXiv

ADAST: Attentive Cross-domain EEG-based Sleep Staging Framework with Iterative Self-Training

Sleep staging is of great importance in the diagnosis and treatment of sleep disorders. Recently, numerous data-driven deep learning models have been proposed for automatic sleep staging. They mainly train the model on a large public labeled sleep dataset and test it on a smaller one with subjects of interest. However, they usually assume that the train and test data are drawn from the same distribution, which may not hold in real-world scenarios. Unsupervised domain adaption (UDA) has been recently developed to handle this domain shift problem. However, previous UDA methods applied for sleep staging have two main limitations. First, they rely on a totally shared model for the domain alignment, which may lose the domain-specific information during feature extraction. Second, they only align the source and target distributions globally without considering the class information in the target domain, which hinders the classification performance of the model while testing. In this work, we propose a novel adversarial learning framework called ADAST to tackle the domain shift problem in the unlabeled target domain. First, we develop an unshared attention mechanism to preserve the domain-specific features in both domains. Second, we design an iterative self-training strategy to improve the classification performance on the target domain via target domain pseudo labels. We also propose dual distinct classifiers to increase the robustness and quality of the pseudo labels. The experimental results on six cross-domain scenarios validate the efficacy of our proposed framework and its advantage over state-of-the-art UDA methods. The source code is available at https://github.com/emadeldeen24/ADAST.

preprint2022arXiv

Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3

We propose a cross-modal co-attention model for continuous emotion recognition using visual-audio-linguistic information. The model consists of four blocks. The visual, audio, and linguistic blocks are used to learn the spatial-temporal features of the multi-modal input. A co-attention block is designed to fuse the learned features with the multi-head co-attention mechanism. The visual encoding from the visual block is concatenated with the attention feature to emphasize the visual information. To make full use of the data and alleviate over-fitting, cross-validation is carried out on the training and validation set. The concordance correlation coefficient (CCC) centering is used to merge the results from each fold. The achieved CCC on the test set is $0.520$ for valence and $0.602$ for arousal, which significantly outperforms the baseline method with the corresponding CCC of 0.180 and 0.170 for valence and arousal, respectively. The code is available at https://github.com/sucv/ABAW3.

preprint2022arXiv

MIN2Net: End-to-End Multi-Task Learning for Subject-Independent Motor Imagery EEG Classification

Advances in the motor imagery (MI)-based brain-computer interfaces (BCIs) allow control of several applications by decoding neurophysiological phenomena, which are usually recorded by electroencephalography (EEG) using a non-invasive technique. Despite great advances in MI-based BCI, EEG rhythms are specific to a subject and various changes over time. These issues point to significant challenges to enhance the classification performance, especially in a subject-independent manner. To overcome these challenges, we propose MIN2Net, a novel end-to-end multi-task learning to tackle this task. We integrate deep metric learning into a multi-task autoencoder to learn a compact and discriminative latent representation from EEG and perform classification simultaneously. This approach reduces the complexity in pre-processing, results in significant performance improvement on EEG classification. Experimental results in a subject-independent manner show that MIN2Net outperforms the state-of-the-art techniques, achieving an F1-score improvement of 6.72%, and 2.23% on the SMR-BCI, and OpenBMI datasets, respectively. We demonstrate that MIN2Net improves discriminative information in the latent representation. This study indicates the possibility and practicality of using this model to develop MI-based BCI applications for new users without the need for calibration.

preprint2022arXiv

MT-UDA: Towards Unsupervised Cross-modality Medical Image Segmentation with Limited Source Labels

The success of deep convolutional neural networks (DCNNs) benefits from high volumes of annotated data. However, annotating medical images is laborious, expensive, and requires human expertise, which induces the label scarcity problem. Especially when encountering the domain shift, the problem becomes more serious. Although deep unsupervised domain adaptation (UDA) can leverage well-established source domain annotations and abundant target domain data to facilitate cross-modality image segmentation and also mitigate the label paucity problem on the target domain, the conventional UDA methods suffer from severe performance degradation when source domain annotations are scarce. In this paper, we explore a challenging UDA setting - limited source domain annotations. We aim to investigate how to efficiently leverage unlabeled data from the source and target domains with limited source annotations for cross-modality image segmentation. To achieve this, we propose a new label-efficient UDA framework, termed MT-UDA, in which the student model trained with limited source labels learns from unlabeled data of both domains by two teacher models respectively in a semi-supervised manner. More specifically, the student model not only distills the intra-domain semantic knowledge by encouraging prediction consistency but also exploits the inter-domain anatomical information by enforcing structural consistency. Consequently, the student model can effectively integrate the underlying knowledge beneath available data resources to mitigate the impact of source label scarcity and yield improved cross-modality segmentation performance. We evaluate our method on MM-WHS 2017 dataset and demonstrate that our approach outperforms the state-of-the-art methods by a large margin under the source-label scarcity scenario.

preprint2022arXiv

OCTAve: 2D en face Optical Coherence Tomography Angiography Vessel Segmentation in Weakly-Supervised Learning with Locality Augmentation

While there have been increased researches using deep learning techniques for the extraction of vascular structure from the 2D en face OCTA, for such approach, it is known that the data annotation process on the curvilinear structure like the retinal vasculature is very costly and time consuming, albeit few tried to address the annotation problem. In this work, we propose the application of the scribble-base weakly-supervised learning method to automate the pixel-level annotation. The proposed method, called OCTAve, combines the weakly-supervised learning using scribble-annotated ground truth augmented with an adversarial and a novel self-supervised deep supervision. Our novel mechanism is designed to utilize the discriminative outputs from the discrimination layer of a UNet-like architecture where the Kullback-Liebler Divergence between the aggregate discriminative outputs and the segmentation map predicate is minimized during the training. This combined method leads to the better localization of the vascular structure as shown in our experiments. We validate our proposed method on the large public datasets i.e., ROSE, OCTA-500. The segmentation performance is compared against both state-of-the-art fully-supervised and scribble-based weakly-supervised approaches. The implementation of our work used in the experiments is located at [LINK].

preprint2022arXiv

TSception: Capturing Temporal Dynamics and Spatial Asymmetry from EEG for Emotion Recognition

The high temporal resolution and the asymmetric spatial activations are essential attributes of electroencephalogram (EEG) underlying emotional processes in the brain. To learn the temporal dynamics and spatial asymmetry of EEG towards accurate and generalized emotion recognition, we propose TSception, a multi-scale convolutional neural network that can classify emotions from EEG. TSception consists of dynamic temporal, asymmetric spatial, and high-level fusion layers, which learn discriminative representations in the time and channel dimensions simultaneously. The dynamic temporal layer consists of multi-scale 1D convolutional kernels whose lengths are related to the sampling rate of EEG, which learns the dynamic temporal and frequency representations of EEG. The asymmetric spatial layer takes advantage of the asymmetric EEG patterns for emotion, learning the discriminative global and hemisphere representations. The learned spatial representations will be fused by a high-level fusion layer. Using more generalized cross-validation settings, the proposed method is evaluated on two publicly available datasets DEAP and MAHNOB-HCI. The performance of the proposed network is compared with prior reported methods such as SVM, KNN, FBFgMDM, FBTSC, Unsupervised learning, DeepConvNet, ShallowConvNet, and EEGNet. TSception achieves higher classification accuracies and F1 scores than other methods in most of the experiments. The codes are available at https://github.com/yi-ding-cs/TSception

preprint2021arXiv

DSAL: Deeply Supervised Active Learning from Strong and Weak Labelers for Biomedical Image Segmentation

Image segmentation is one of the most essential biomedical image processing problems for different imaging modalities, including microscopy and X-ray in the Internet-of-Medical-Things (IoMT) domain. However, annotating biomedical images is knowledge-driven, time-consuming, and labor-intensive, making it difficult to obtain abundant labels with limited costs. Active learning strategies come into ease the burden of human annotation, which queries only a subset of training data for annotation. Despite receiving attention, most of active learning methods generally still require huge computational costs and utilize unlabeled data inefficiently. They also tend to ignore the intermediate knowledge within networks. In this work, we propose a deep active semi-supervised learning framework, DSAL, combining active learning and semi-supervised learning strategies. In DSAL, a new criterion based on deep supervision mechanism is proposed to select informative samples with high uncertainties and low uncertainties for strong labelers and weak labelers respectively. The internal criterion leverages the disagreement of intermediate features within the deep learning network for active sample selection, which subsequently reduces the computational costs. We use the proposed criteria to select samples for strong and weak labelers to produce oracle labels and pseudo labels simultaneously at each active learning iteration in an ensemble learning manner, which can be examined with IoMT Platform. Extensive experiments on multiple medical image datasets demonstrate the superiority of the proposed method over state-of-the-art active learning methods.

preprint2021arXiv

Federated Transfer Learning for EEG Signal Classification

The success of deep learning (DL) methods in the Brain-Computer Interfaces (BCI) field for classification of electroencephalographic (EEG) recordings has been restricted by the lack of large datasets. Privacy concerns associated with EEG signals limit the possibility of constructing a large EEG-BCI dataset by the conglomeration of multiple small ones for jointly training machine learning models. Hence, in this paper, we propose a novel privacy-preserving DL architecture named federated transfer learning (FTL) for EEG classification that is based on the federated learning framework. Working with the single-trial covariance matrix, the proposed architecture extracts common discriminative information from multi-subject EEG data with the help of domain adaptation techniques. We evaluate the performance of the proposed architecture on the PhysioNet dataset for 2-class motor imagery classification. While avoiding the actual data sharing, our FTL approach achieves 2% higher classification accuracy in a subject-adaptive analysis. Also, in the absence of multi-subject data, our architecture provides 6% better accuracy compared to other state-of-the-art DL architectures.

preprint2021arXiv

Hierarchical Consistency Regularized Mean Teacher for Semi-supervised 3D Left Atrium Segmentation

Deep learning has achieved promising segmentation performance on 3D left atrium MR images. However, annotations for segmentation tasks are expensive, costly and difficult to obtain. In this paper, we introduce a novel hierarchical consistency regularized mean teacher framework for 3D left atrium segmentation. In each iteration, the student model is optimized by multi-scale deep supervision and hierarchical consistency regularization, concurrently. Extensive experiments have shown that our method achieves competitive performance as compared with full annotation, outperforming other state-of-the-art semi-supervised segmentation methods.

preprint2020arXiv

Brain MRI-based 3D Convolutional Neural Networks for Classification of Schizophrenia and Controls

Convolutional Neural Network (CNN) has been successfully applied on classification of both natural images and medical images but not yet been applied to differentiating patients with schizophrenia from healthy controls. Given the subtle, mixed, and sparsely distributed brain atrophy patterns of schizophrenia, the capability of automatic feature learning makes CNN a powerful tool for classifying schizophrenia from controls as it removes the subjectivity in selecting relevant spatial features. To examine the feasibility of applying CNN to classification of schizophrenia and controls based on structural Magnetic Resonance Imaging (MRI), we built 3D CNN models with different architectures and compared their performance with a handcrafted feature-based machine learning approach. Support vector machine (SVM) was used as classifier and Voxel-based Morphometry (VBM) was used as feature for handcrafted feature-based machine learning. 3D CNN models with sequential architecture, inception module and residual module were trained from scratch. CNN models achieved higher cross-validation accuracy than handcrafted feature-based machine learning. Moreover, testing on an independent dataset, 3D CNN models greatly outperformed handcrafted feature-based machine learning. This study underscored the potential of CNN for identifying patients with schizophrenia using 3D brain MR images and paved the way for imaging-based individual-level diagnosis and prognosis in psychiatric disorders.

preprint2020arXiv

Enhancing the Extraction of Interpretable Information for Ischemic Stroke Imaging from Deep Neural Networks

We implement a visual interpretability method Layer-wise Relevance Propagation (LRP) on top of 3D U-Net trained to perform lesion segmentation on the small dataset of multi-modal images provided by ISLES 2017 competition. We demonstrate that LRP modifications could provide more sensible visual explanations to an otherwise highly noise-skewed saliency map. We also link amplitude of modified signals to useful information content. High amplitude localized signals appear to constitute the noise that undermines the interpretability capacity of LRP. Furthermore, mathematical framework for possible analysis of function approximation is developed by analogy.

preprint2020arXiv

Towards a Fast Steady-State Visual Evoked Potentials (SSVEP) Brain-Computer Interface (BCI)

Steady-state visual evoked potentials (SSVEP) brain-computer interface (BCI) provides reliable responses leading to high accuracy and information throughput. But achieving high accuracy typically requires a relatively long time window of one second or more. Various methods were proposed to improve sub-second response accuracy through subject-specific training and calibration. Substantial performance improvements were achieved with tedious calibration and subject-specific training; resulting in the user's discomfort. So, we propose a training-free method by combining spatial-filtering and temporal alignment (CSTA) to recognize SSVEP responses in sub-second response time. CSTA exploits linear correlation and non-linear similarity between steady-state responses and stimulus templates with complementary fusion to achieve desirable performance improvements. We evaluated the performance of CSTA in terms of accuracy and Information Transfer Rate (ITR) in comparison with both training-based and training-free methods using two SSVEP data-sets. We observed that CSTA achieves the maximum mean accuracy of 97.43$\pm$2.26 % and 85.71$\pm$13.41 % with four-class and forty-class SSVEP data-sets respectively in sub-second response time in offline analysis. CSTA yields significantly higher mean performance (p<0.001) than the training-free method on both data-sets. Compared with training-based methods, CSTA shows 29.33$\pm$19.65 % higher mean accuracy with statistically significant differences in time window less than 0.5 s. In longer time windows, CSTA exhibits either better or comparable performance though not statistically significantly better than training-based methods. We show that the proposed method brings advantages of subject-independent SSVEP classification without requiring training while enabling high target recognition performance in sub-second response time.

preprint2020arXiv

TSception: A Deep Learning Framework for Emotion Detection Using EEG

In this paper, we propose a deep learning framework, TSception, for emotion detection from electroencephalogram (EEG). TSception consists of temporal and spatial convolutional layers, which learn discriminative representations in the time and channel domains simultaneously. The temporal learner consists of multi-scale 1D convolutional kernels whose lengths are related to the sampling rate of the EEG signal, which learns multiple temporal and frequency representations. The spatial learner takes advantage of the asymmetry property of emotion responses at the frontal brain area to learn the discriminative representations from the left and right hemispheres of the brain. In our study, a system is designed to study the emotional arousal in an immersive virtual reality (VR) environment. EEG data were collected from 18 healthy subjects using this system to evaluate the performance of the proposed deep learning network for the classification of low and high emotional arousal states. The proposed method is compared with SVM, EEGNet, and LSTM. TSception achieves a high classification accuracy of 86.03%, which outperforms the prior methods significantly (p<0.05). The code is available at https://github.com/deepBrains/TSception

Cuntai Guan

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

SPD Matrix Learning for Neuroimaging Analysis: Perspectives, Methods, and Challenges

ADAST: Attentive Cross-domain EEG-based Sleep Staging Framework with Iterative Self-Training

Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3

MIN2Net: End-to-End Multi-Task Learning for Subject-Independent Motor Imagery EEG Classification

MT-UDA: Towards Unsupervised Cross-modality Medical Image Segmentation with Limited Source Labels

OCTAve: 2D en face Optical Coherence Tomography Angiography Vessel Segmentation in Weakly-Supervised Learning with Locality Augmentation

TSception: Capturing Temporal Dynamics and Spatial Asymmetry from EEG for Emotion Recognition

DSAL: Deeply Supervised Active Learning from Strong and Weak Labelers for Biomedical Image Segmentation

Federated Transfer Learning for EEG Signal Classification

Hierarchical Consistency Regularized Mean Teacher for Semi-supervised 3D Left Atrium Segmentation

Brain MRI-based 3D Convolutional Neural Networks for Classification of Schizophrenia and Controls

Enhancing the Extraction of Interpretable Information for Ischemic Stroke Imaging from Deep Neural Networks

Towards a Fast Steady-State Visual Evoked Potentials (SSVEP) Brain-Computer Interface (BCI)

TSception: A Deep Learning Framework for Emotion Detection Using EEG