Researcher profile

Seong-Whan Lee

Seong-Whan Lee contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
40works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

40 published item(s)

preprint2026arXiv

ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on boundary-aware learning. ClipTBP introduces a clip-level alignment loss for explicitly learning the semantic relationship between answer segments. ClipTBP also predicts accurate temporal boundaries by applying both main boundary loss and auxiliary boundary loss. ClipTBP consistently improves performance when applied to various existing models and demonstrates more robust boundary prediction performance even in ambiguous query scenarios.

preprint2026arXiv

Compositional Meta-Learning for Mitigating Task Heterogeneity in Physics-Informed Neural Networks

Physics-informed neural networks (PINNs) approximate solutions of partial differential equations (PDEs) by embedding physical laws into the loss function. In parameterized PDE families, variations in coefficients or boundary/initial conditions define distinct tasks. This makes training individual PINNs for each task computationally prohibitive, while cross-task transfer can be sensitive to task heterogeneity. While meta-learning can reduce retraining cost, existing methods often rely on a single global initialization and may suffer from negative transfer, particularly under feature-scarce coordinate inputs and limited training-task availability. We propose the Learning-Affinity Adaptive Modular Physics-Informed Neural Network (LAM-PINN), a compositional framework that leverages task-specific learning dynamics. LAM-PINN combines PDE parameters with learning-affinity metrics from brief transfer sessions to construct a task representation and cluster tasks even with coordinate-only inputs. It decomposes the model into cluster-specialized subnetworks and a shared meta network, and learns routing weights to selectively reuse modules instead of relying on a single global initialization. Across three PDE benchmarks, LAM-PINN achieves an average 19.7-fold reduction in mean squared error (MSE) on unseen tasks using only 10% of the training iterations required by conventional PINNs. These results indicate its effectiveness for generalization to unseen configurations within bounded design spaces of parameterized PDE families in resource-constrained engineering settings.

preprint2023arXiv

Towards Voice Reconstruction from EEG during Imagined Speech

Translating imagined speech from human brain activity into voice is a challenging and absorbing research issue that can provide new means of human communication via brain signals. Endeavors toward reconstructing speech from brain activity have shown their potential using invasive measures of spoken speech data, however, have faced challenges in reconstructing imagined speech. In this paper, we propose NeuroTalk, which converts non-invasive brain signals of imagined speech into the user's own voice. Our model was trained with spoken speech EEG which was generalized to adapt to the domain of imagined speech, thus allowing natural correspondence between the imagined speech and the voice as a ground truth. In our framework, automatic speech recognition decoder contributed to decomposing the phonemes of generated speech, thereby displaying the potential of voice reconstruction from unseen words. Our results imply the potential of speech synthesis from human EEG signals, not only from spoken speech but also from the brain signals of imagined speech.

preprint2022arXiv

Decoding Neural Correlation of Language-Specific Imagined Speech using EEG Signals

Speech impairments due to cerebral lesions and degenerative disorders can be devastating. For humans with severe speech deficits, imagined speech in the brain-computer interface has been a promising hope for reconstructing the neural signals of speech production. However, studies in the EEG-based imagined speech domain still have some limitations due to high variability in spatial and temporal information and low signal-to-noise ratio. In this paper, we investigated the neural signals for two groups of native speakers with two tasks with different languages, English and Chinese. Our assumption was that English, a non-tonal and phonogram-based language, would have spectral differences in neural computation compared to Chinese, a tonal and ideogram-based language. The results showed the significant difference in the relative power spectral density between English and Chinese in specific frequency band groups. Also, the spatial evaluation of Chinese native speakers in the theta band was distinctive during the imagination task. Hence, this paper would suggest the key spectral and spatial information of word imagination with specialized language while decoding the neural signals of speech.

preprint2022arXiv

Factorization Approach for Sparse Spatio-Temporal Brain-Computer Interface

Recently, advanced technologies have unlimited potential in solving various problems with a large amount of data. However, these technologies have yet to show competitive performance in brain-computer interfaces (BCIs) which deal with brain signals. Basically, brain signals are difficult to collect in large quantities, in particular, the amount of information would be sparse in spontaneous BCIs. In addition, we conjecture that high spatial and temporal similarities between tasks increase the prediction difficulty. We define this problem as sparse condition. To solve this, a factorization approach is introduced to allow the model to obtain distinct representations from latent space. To this end, we propose two feature extractors: A class-common module is trained through adversarial learning acting as a generator; Class-specific module utilizes loss function generated from classification so that features are extracted with traditional methods. To minimize the latent space shared by the class-common and class-specific features, the model is trained under orthogonal constraint. As a result, EEG signals are factorized into two separate latent spaces. Evaluations were conducted on a single-arm motor imagery dataset. From the results, we demonstrated that factorizing the EEG signal allows the model to extract rich and decisive features under sparse condition.

preprint2022arXiv

Few-Shot Object Detection with Proposal Balance Refinement

Few-shot object detection has gained significant attention in recent years as it has the potential to greatly reduce the reliance on large amounts of manually annotated bounding boxes. While most existing few-shot object detection literature primarily focuses on bounding box classification by obtaining as discriminative feature embeddings as possible, we emphasize the necessity of handling the lack of intersection-over-union (IoU) variations induced by a biased distribution of novel samples. In this paper, we analyze the IoU imbalance that is caused by the relatively high number of low-quality region proposals, and reveal that it plays a critical role in improving few-shot learning capabilities. The well-known two stage fine-tuning technique causes insufficient quality and quantity of the novel positive samples, which hinders the effective object detection of unseen novel classes. To alleviate this issue, we present a few-shot object detection model with proposal balance refinement, a simple yet effective approach in learning object proposals using an auxiliary sequential bounding box refinement process. This process enables the detector to be optimized on the various IoU scores through additional novel class samples. To fully exploit our sequential stage architecture, we revise the fine-tuning strategy and expose the Region Proposal Network to the novel classes in order to provide increased learning opportunities for the region-of-interest (RoI) classifiers and regressors. Our extensive assessments on PASCAL VOC and COCO demonstrate that our framework substantially outperforms other existing few-shot object detection approaches.

preprint2022arXiv

HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers

Temporal action localization (TAL) is a task of identifying a set of actions in a video, which involves localizing the start and end frames and classifying each action instance. Existing methods have addressed this task by using predefined anchor windows or heuristic bottom-up boundary-matching strategies, which are major bottlenecks in inference time. Additionally, the main challenge is the inability to capture long-range actions due to a lack of global contextual information. In this paper, we present a novel anchor-free framework, referred to as HTNet, which predicts a set of <start time, end time, class> triplets from a video based on a Transformer architecture. After the prediction of coarse boundaries, we refine it through a background feature sampling (BFS) module and hierarchical Transformers, which enables our model to aggregate global contextual information and effectively exploit the inherent semantic relationships in a video. We demonstrate how our method localizes accurate action instances and achieves state-of-the-art performance on two TAL benchmark datasets: THUMOS14 and ActivityNet 1.3.

preprint2022arXiv

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Video Anomaly Detection(VAD) has been traditionally tackled in two main methodologies: the reconstruction-based approach and the prediction-based one. As the reconstruction-based methods learn to generalize the input image, the model merely learns an identity function and strongly causes the problem called generalizing issue. On the other hand, since the prediction-based ones learn to predict a future frame given several previous frames, they are less sensitive to the generalizing issue. However, it is still uncertain if the model can learn the spatio-temporal context of a video. Our intuition is that the understanding of the spatio-temporal context of a video plays a vital role in VAD as it provides precise information on how the appearance of an event in a video clip changes. Hence, to fully exploit the context information for anomaly detection in video circumstances, we designed the transformer model with three different contextual prediction streams: masked, whole and partial. By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video, which leads to a high reconstruction error at the abnormal cases that are unsuitable to the learned context. To verify the effectiveness of our approach, we assess our model on the public benchmark datasets: USCD Pedestrian 2, CUHK Avenue and ShanghaiTech and evaluate the performance with the anomaly score metric of reconstruction error. The results demonstrate that our proposed approach achieves a competitive performance compared to the existing video anomaly detection methods.

preprint2022arXiv

Neural Architecture Adaptation for Object Detection by Searching Channel Dimensions and Mapping Pre-trained Parameters

Most object detection frameworks use backbone architectures originally designed for image classification, conventionally with pre-trained parameters on ImageNet. However, image classification and object detection are essentially different tasks and there is no guarantee that the optimal backbone for classification is also optimal for object detection. Recent neural architecture search (NAS) research has demonstrated that automatically designing a backbone specifically for object detection helps improve the overall accuracy. In this paper, we introduce a neural architecture adaptation method that can optimize the given backbone for detection purposes, while still allowing the use of pre-trained parameters. We propose to adapt both the micro- and macro-architecture by searching for specific operations and the number of layers, in addition to the output channel dimensions of each block. It is important to find the optimal channel depth, as it greatly affects the feature representation capability and computation cost. We conduct experiments with our searched backbone for object detection and demonstrate that our backbone outperforms both manually designed and searched state-of-the-art backbones on the COCO dataset.

preprint2022arXiv

OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos

Although many approaches for multi-human pose estimation in videos have shown profound results, they require densely annotated data which entails excessive man labor. Furthermore, there exists occlusion and motion blur that inevitably lead to poor estimation performance. To address these problems, we propose a method that leverages an attention mask for occluded joints and encodes temporal dependency between frames using transformers. First, our framework composes different combinations of sparsely annotated frames that denote the track of the overall joint movement. We propose an occlusion attention mask from these combinations that enable encoding occlusion-aware heatmaps as a semi-supervised task. Second, the proposed temporal encoder employs transformer architecture to effectively aggregate the temporal relationship and keypoint-wise attention from each time step and accurately refines the target frame&#39;s final pose estimation. We achieve state-of-the-art pose estimation results for PoseTrack2017 and PoseTrack2018 datasets and demonstrate the robustness of our approach to occlusion and motion blur in sparsely annotated video data.

preprint2022arXiv

Prototype-based Domain Generalization Framework for Subject-Independent Brain-Computer Interfaces

Brain-computer interface (BCI) is challenging to use in practice due to the inter/intra-subject variability of electroencephalography (EEG). The BCI system, in general, necessitates a calibration technique to obtain subject/session-specific data in order to tune the model each time the system is utilized. This issue is acknowledged as a key hindrance to BCI, and a new strategy based on domain generalization has recently evolved to address it. In light of this, we&#39;ve concentrated on developing an EEG classification framework that can be applied directly to data from unknown domains (i.e. subjects), using only data acquired from separate subjects previously. For this purpose, in this paper, we proposed a framework that employs the open-set recognition technique as an auxiliary task to learn subject-specific style features from the source dataset while helping the shared feature extractor with mapping the features of the unseen target dataset as a new unseen domain. Our aim is to impose cross-instance style in-variance in the same domain and reduce the open space risk on the potential unseen subject in order to improve the generalization ability of the shared feature extractor. Our experiments showed that using the domain information as an auxiliary network increases the generalization performance.

preprint2022arXiv

Style-Guided Domain Adaptation for Face Presentation Attack Detection

Domain adaptation (DA) or domain generalization (DG) for face presentation attack detection (PAD) has attracted attention recently with its robustness against unseen attack scenarios. Existing DA/DG-based PAD methods, however, have not yet fully explored the domain-specific style information that can provide knowledge regarding attack styles (e.g., materials, background, illumination and resolution). In this paper, we introduce a novel Style-Guided Domain Adaptation (SGDA) framework for inference-time adaptive PAD. Specifically, Style-Selective Normalization (SSN) is proposed to explore the domain-specific style information within the high-order feature statistics. The proposed SSN enables the adaptation of the model to the target domain by reducing the style difference between the target and the source domains. Moreover, we carefully design Style-Aware Meta-Learning (SAML) to boost the adaptation ability, which simulates the inference-time adaptation with style selection process on virtual test domain. In contrast to previous domain adaptation approaches, our method does not require either additional auxiliary models (e.g., domain adaptors) or the unlabeled target domain during training, which makes our method more practical to PAD task. To verify our experiments, we utilize the public datasets: MSU-MFSD, CASIA-FASD, OULU-NPU and Idiap REPLAYATTACK. In most assessments, the result demonstrates a notable gap of performance compared to the conventional DA/DG-based PAD methods.

preprint2022arXiv

Toward Imagined Speech based Smart Communication System: Potential Applications on Metaverse Conditions

Metaverse provides an alternative platform for human interaction in the virtual world. Since virtual platform holds few restrictions in changing the surrounding environments or the appearance of the avatars, it can serve as a platform that reflects human thoughts or even dreams at least in the metaverse world. When it is merged together with the current brain-computer interface (BCI) technology, which enables system control via brain signals, a new paradigm of human interaction through mind may be established in the metaverse conditions. Recent BCI systems are aiming to provide user-friendly and intuitive means of communication using brain signals. Imagined speech has become an alternative neuro-paradigm for communicative BCI since it relies directly on a person&#39;s speech production process, rather than using speech-unrelated neural activity as the means of communication. In this paper, we propose a brain-to-speech (BTS) system for real-world smart communication using brain signals. Also, we show a demonstration of imagined speech based smart home control through communication with a virtual assistant, which can be one of the future applications of brain-metaverse system. We performed pseudo-online analysis using imagined speech electroencephalography data of nine subjects to investigate the potential use of virtual BTS system in the real-world. Average accuracy of 46.54 % (chance level = 7.7 %) and 75.56 % (chance level = 50 %) was acquired in the thirteen-class and binary pseudo-online analysis, respectively. Our results support the potential of imagined speech based smart communication to be applied in the metaverse world.

preprint2021arXiv

Counterfactual Explanation Based on Gradual Construction for Deep Networks

To understand the black-box characteristics of deep networks, counterfactual explanation that deduces not only the important features of an input space but also how those features should be modified to classify input as a target class has gained an increasing interest. The patterns that deep networks have learned from a training dataset can be grasped by observing the feature variation among various classes. However, current approaches perform the feature modification to increase the classification probability for the target class irrespective of the internal characteristics of deep networks. This often leads to unclear explanations that deviate from real-world data distributions. To address this problem, we propose a counterfactual explanation method that exploits the statistics learned from a training dataset. Especially, we gradually construct an explanation by iterating over masking and composition steps. The masking step aims to select an important feature from the input data to be classified as a target class. Meanwhile, the composition step aims to optimize the previously selected feature by ensuring that its output score is close to the logit space of the training data that are classified as the target class. Experimental results show that our method produces human-friendly interpretations on various classification datasets and verify that such interpretations can be achieved with fewer feature modification.

preprint2021arXiv

Decoding Event-related Potential from Ear-EEG Signals based on Ensemble Convolutional Neural Networks in Ambulatory Environment

Recently, practical brain-computer interface is actively carried out, especially, in an ambulatory environment. However, the electroencephalography (EEG) signals are distorted by movement artifacts and electromyography signals when users are moving, which make hard to recognize human intention. In addition, as hardware issues are also challenging, ear-EEG has been developed for practical brain-computer interface and has been widely used. In this paper, we proposed ensemble-based convolutional neural networks in ambulatory environment and analyzed the visual event-related potential responses in scalp- and ear-EEG in terms of statistical analysis and brain-computer interface performance. The brain-computer interface performance deteriorated as 3-14% when walking fast at 1.6 m/s. The proposed methods showed 0.728 in average of the area under the curve. The proposed method shows robust to the ambulatory environment and imbalanced data as well.

preprint2021arXiv

Human Interaction Recognition Framework based on Interacting Body Part Attention

Human activity recognition in videos has been widely studied and has recently gained significant advances with deep learning approaches; however, it remains a challenging task. In this paper, we propose a novel framework that simultaneously considers both implicit and explicit representations of human interactions by fusing information of local image where the interaction actively occurred, primitive motion with the posture of individual subject&#39;s body parts, and the co-occurrence of overall appearance change. Human interactions change, depending on how the body parts of each human interact with the other. The proposed method captures the subtle difference between different interactions using interacting body part attention. Semantically important body parts that interact with other objects are given more weight during feature representation. The combined feature of interacting body part attention-based individual representation and the co-occurrence descriptor of the full-body appearance change is fed into long short-term memory to model the temporal dynamics over time in a single framework. We validate the effectiveness of the proposed method using four widely used public datasets by outperforming the competing state-of-the-art method.

preprint2021arXiv

Visual Question Answering based on Local-Scene-Aware Referring Expression Generation

Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories combined with their relationships or simple question embedding is insufficient for representing complex scenes and explaining decisions. To address this limitation, we propose the use of text expressions generated for images, because such expressions have few structural constraints and can provide richer descriptions of images. The generated expressions can be incorporated with visual features and question embedding to obtain the question-relevant answer. A joint-embedding multi-head attention network is also proposed to model three different information modalities with co-attention. We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction. The quality of the generated expressions was also evaluated on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Experimental results demonstrate the effectiveness of the proposed method and reveal that it outperformed all of the competing methods in terms of both quantitative and qualitative results.

preprint2021arXiv

Weakly Supervised Thoracic Disease Localization via Disease Masks

To enable a deep learning-based system to be used in the medical domain as a computer-aided diagnosis system, it is essential to not only classify diseases but also present the locations of the diseases. However, collecting instance-level annotations for various thoracic diseases is expensive. Therefore, weakly supervised localization methods have been proposed that use only image-level annotation. While the previous methods presented the disease location as the most discriminative part for classification, this causes a deep network to localize wrong areas for indistinguishable X-ray images. To solve this issue, we propose a spatial attention method using disease masks that describe the areas where diseases mainly occur. We then apply the spatial attention to find the precise disease area by highlighting the highest probability of disease occurrence. Meanwhile, the various sizes, rotations and noise in chest X-ray images make generating the disease masks challenging. To reduce the variation among images, we employ an alignment module to transform an input X-ray image into a generalized image. Through extensive experiments on the NIH-Chest X-ray dataset with eight kinds of diseases, we show that the proposed method results in superior localization performances compared to state-of-the-art methods.

preprint2020arXiv

A novel approach to classify natural grasp actions by estimating muscle activity patterns from EEG signals

Developing electroencephalogram (EEG) based brain-computer interface (BCI) systems is challenging. In this study, we analyzed natural grasp actions from EEG. Ten healthy subjects participated in this experiment. They executed and imagined three sustained grasp actions. We proposed a novel approach which estimates muscle activity patterns from EEG signals to improve the overall classification accuracy. For implementation, we have recorded EEG and electromyogram (EMG) simultaneously. Using the similarity of the estimated pattern from EEG signals compare to the activity pattern from EMG signals showed higher classification accuracy than competitive methods. As a result, we obtained the average classification accuracy of 63.89($\pm$7.54)% for actual movement and 46.96($\pm$15.30)% for motor imagery. These are 21.59% and 5.66% higher than the result of the competitive model, respectively. This result is encouraging, and the proposed method could potentially be used in future applications, such as a BCI-driven robot control for handling various daily use objects.

preprint2020arXiv

A Novel Online Action Detection Framework from Untrimmed Video Streams

Online temporal action localization from an untrimmed video stream is a challenging problem in computer vision. It is challenging because of i) in an untrimmed video stream, more than one action instance may appear, including background scenes, and ii) in online settings, only past and current information is available. Therefore, temporal priors, such as the average action duration of training data, which have been exploited by previous action detection methods, are not suitable for this task because of the high intra-class variation in human actions. We propose a novel online action detection framework that considers actions as a set of temporally ordered subclasses and leverages a future frame generation network to cope with the limited information issue associated with the problem outlined above. Additionally, we augment our data by varying the lengths of videos to allow the proposed method to learn about the high intra-class variation in human actions. We evaluate our method using two benchmark datasets, THUMOS&#39;14 and ActivityNet, for an online temporal action localization scenario and demonstrate that the performance is comparable to state-of-the-art methods that have been proposed for offline settings.

preprint2020arXiv

A Two-Stream Symmetric Network with Bidirectional Ensemble for Aerial Image Matching

In this paper, we propose a novel method to precisely match two aerial images that were obtained in different environments via a two-stream deep network. By internally augmenting the target image, the network considers the two-stream with the three input images and reflects the additional augmented pair in the training. As a result, the training process of the deep network is regularized and the network becomes robust for the variance of aerial images. Furthermore, we introduce an ensemble method that is based on the bidirectional network, which is motivated by the isomorphic nature of the geometric transformation. We obtain two global transformation parameters without any additional network or parameters, which alleviate asymmetric matching results and enable significant improvement in performance by fusing two outcomes. For the experiment, we adopt aerial images from Google Earth and the International Society for Photogrammetry and Remote Sensing (ISPRS). To quantitatively assess our result, we apply the probability of correct keypoints (PCK) metric, which measures the degree of matching. The qualitative and quantitative results show the sizable gap of performance compared to the conventional methods for matching the aerial images. All code and our trained model, as well as the dataset are available online.

preprint2020arXiv

Assessment of Unconsciousness for Memory Consolidation Using EEG Signals

The assessment of consciousness and unconsciousness is a challenging issue in modern neuroscience. Consciousness is closely related to memory consolidation in that memory is a critical component of conscious experience. So far, many studies have been reported on memory consolidation during consciousness, but there is little research on memory consolidation during unconsciousness. Therefore, we aim to assess the unconsciousness in terms of memory consolidation using electroencephalogram signals. In particular, we used unconscious state during a nap; because sleep is the only state in which consciousness disappears under normal physiological conditions. Seven participants performed two memory tasks (word-pairs and visuo-spatial) before and after the nap to assess the memory consolidation during unconsciousness. As a result, spindle power in central, parietal, occipital regions during unconsciousness was positively correlated with the performance of location memory. With the memory performance, there was also a negative correlation between delta connectivity and word-pairs memory, alpha connectivity and location memory, and spindle connectivity and word-pairs memory. We additionally observed the significant relationship between unconsciousness and brain changes during memory recall before and after the nap. These findings could help present new insights into the assessment of unconsciousness by exploring the relationship with memory consolidation.

preprint2020arXiv

Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder

In recent works, a flow-based neural vocoder has shown significant improvement in real-time speech generation task. The sequence of invertible flow operations allows the model to convert samples from simple distribution to audio samples. However, training a continuous density model on discrete audio data can degrade model performance due to the topological difference between latent and actual distribution. To resolve this problem, we propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation. Data dequantization is a well-known method in image generation but has not yet been studied in the audio domain. For this reason, we implement various audio dequantization methods in flow-based neural vocoder and investigate the effect on the generated audio. We conduct various objective performance assessments and subjective evaluation to show that audio dequantization can improve audio generation quality. From our experiments, using audio dequantization produces waveform audio with better harmonic structure and fewer digital artifacts.

preprint2020arXiv

Classification of High-Dimensional Motor Imagery Tasks based on An End-to-end role assigned convolutional neural network

A brain-computer interface (BCI) provides a direct communication pathway between user and external devices. Electroencephalogram (EEG) motor imagery (MI) paradigm is widely used in non-invasive BCI to obtain encoded signals contained user intention of movement execution. However, EEG has intricate and non-stationary properties resulting in insufficient decoding performance. By imagining numerous movements of a single-arm, decoding performance can be improved without artificial command matching. In this study, we collected intuitive EEG data contained the nine different types of movements of a single-arm from 9 subjects. We propose an end-to-end role assigned convolutional neural network (ERA-CNN) which considers discriminative features of each upper limb region by adopting the principle of a hierarchical CNN architecture. The proposed model outperforms previous methods on 3-class, 5-class and two different types of 7-class classification tasks. Hence, we demonstrate the possibility of decoding user intention by using only EEG signals with robust performance using an ERA-CNN.

preprint2020arXiv

Classification of Imagined Speech Using Siamese Neural Network

Imagined speech is spotlighted as a new trend in the brain-machine interface due to its application as an intuitive communication tool. However, previous studies have shown low classification performance, therefore its use in real-life is not feasible. In addition, no suitable method to analyze it has been found. Recently, deep learning algorithms have been applied to this paradigm. However, due to the small amount of data, the increase in classification performance is limited. To tackle these issues, in this study, we proposed an end-to-end framework using Siamese neural network encoder, which learns the discriminant features by considering the distance between classes. The imagined words (e.g., arriba (up), abajo (down), derecha (right), izquierda (left), adelante (forward), and atrás (backward)) were classified using the raw electroencephalography (EEG) signals. We obtained a 6-class classification accuracy of 31.40% for imagined speech, which significantly outperformed other methods. This was possible because the Siamese neural network, which increases the distance between dissimilar samples while decreasing the distance between similar samples, was used. In this regard, our method can learn discriminant features from a small dataset. The proposed framework would help to increase the classification performance of imagined speech for a small amount of data and implement an intuitive communication system.

preprint2020arXiv

Decoding of Grasp Motions from EEG Signals Based on a Novel Data Augmentation Strategy

Electroencephalogram (EEG) based brain-computer interface (BCI) systems are useful tools for clinical purposes like neural prostheses. In this study, we collected EEG signals related to grasp motions. Five healthy subjects participated in this experiment. They executed and imagined five sustained-grasp actions. We proposed a novel data augmentation method that increases the amount of training data using labels obtained from electromyogram (EMG) signals analysis. For implementation, we recorded EEG and EMG simultaneously. The data augmentation over the original EEG data concluded higher classification accuracy than other competitors. As a result, we obtained the average classification accuracy of 52.49% for motor execution (ME) and 40.36% for motor imagery (MI). These are 9.30% and 6.19% higher, respectively than the result of the comparable methods. Moreover, the proposed method could minimize the need for the calibration session, which reduces the practicality of most BCIs. This result is encouraging, and the proposed method could potentially be used in future applications such as a BCI-driven robot control for handling various daily use objects.

preprint2020arXiv

Decoding of Intuitive Visual Motion Imagery Using Convolutional Neural Network under 3D-BCI Training Environment

In this study, we adopted visual motion imagery, which is a more intuitive brain-computer interface (BCI) paradigm, for decoding the intuitive user intention. We developed a 3-dimensional BCI training platform and applied it to assist the user in performing more intuitive imagination in the visual motion imagery experiment. The experimental tasks were selected based on the movements that we commonly used in daily life, such as picking up a phone, opening a door, eating food, and pouring water. Nine subjects participated in our experiment. We presented statistical evidence that visual motion imagery has a high correlation from the prefrontal and occipital lobes. In addition, we selected the most appropriate electroencephalography channels using a functional connectivity approach for visual motion imagery decoding and proposed a convolutional neural network architecture for classification. As a result, the averaged classification performance of the proposed architecture for 4 classes from 16 channels was 67.50 % across all subjects. This result is encouraging, and it shows the possibility of developing a BCI-based device control system for practical applications such as neuroprosthesis and a robotic arm.

preprint2020arXiv

Decoding Visual Recognition of Objects from EEG Signals based on Attention-Driven Convolutional Neural Network

The ability to perceive and recognize objects is fundamental for the interaction with the external environment. Studies that investigate them and their relationship with brain activity changes have been increasing due to the possible application in an intuitive brain-machine interface (BMI). In addition, the distinctive patterns when presenting different visual stimuli that make data differentiable enough to be classified have been studied. However, reported classification accuracy still low or employed techniques for obtaining brain signals are impractical to use in real environments. In this study, we aim to decode electroencephalography (EEG) signals depending on the provided visual stimulus. Subjects were presented with 72 photographs belonging to 6 different semantic categories. We classified 6 categories and 72 exemplars according to visual stimuli using EEG signals. In order to achieve a high classification accuracy, we proposed an attention driven convolutional neural network and compared our results with conventional methods used for classifying EEG signals. We reported an accuracy of 50.37% and 26.75% for 6-class and 72-class, respectively. These results statistically outperformed other conventional methods. This was possible because of the application of the attention network using human visual pathways. Our findings showed that EEG signals are possible to differentiate when subjects are presented with visual stimulus of different semantic categories and at an exemplar-level with a high classification accuracy; this demonstrates its viability to be applied it in a real-world BMI.

preprint2020arXiv

End-to-End Automatic Sleep Stage Classification Using Spectral-Temporal Sleep Features

Sleep disorder is one of many neurological diseases that can affect greatly the quality of daily life. It is very burdensome to manually classify the sleep stages to detect sleep disorders. Therefore, the automatic sleep stage classification techniques are needed. However, the previous automatic sleep scoring methods using raw signals are still low classification performance. In this study, we proposed an end-to-end automatic sleep staging framework based on optimal spectral-temporal sleep features using a sleep-edf dataset. The input data were modified using a bandpass filter and then applied to a convolutional neural network model. For five sleep stage classification, the classification performance 85.6% and 91.1% using the raw input data and the proposed input, respectively. This result also shows the highest performance compared to conventional studies using the same dataset. The proposed framework has shown high performance by using optimal features associated with each sleep stage, which may help to find new features in the automatic sleep stage method.

preprint2020arXiv

Few-Shot Learning with Geometric Constraints

In this article, we consider the problem of few-shot learning for classification. We assume a network trained for base categories with a large number of training examples, and we aim to add novel categories to it that have only a few, e.g., one or five, training examples. This is a challenging scenario because: 1) high performance is required in both the base and novel categories; and 2) training the network for the new categories with a few training examples can contaminate the feature space trained well for the base categories. To address these challenges, we propose two geometric constraints to fine-tune the network with a few training examples. The first constraint enables features of the novel categories to cluster near the category weights, and the second maintains the weights of the novel categories far from the weights of the base categories. By applying the proposed constraints, we extract discriminative features for the novel categories while preserving the feature space learned for the base categories. Using public data sets for few-shot learning that are subsets of ImageNet, we demonstrate that the proposed method outperforms prevalent methods by a large margin.

preprint2020arXiv

Few-Shot Object Detection via Knowledge Transfer

Conventional methods for object detection usually require substantial amounts of training data and annotated bounding boxes. If there are only a few training data and annotations, the object detectors easily overfit and fail to generalize. It exposes the practical weakness of the object detectors. On the other hand, human can easily master new reasoning rules with only a few demonstrations using previously learned knowledge. In this paper, we introduce a few-shot object detection via knowledge transfer, which aims to detect objects from a few training examples. Central to our method is prototypical knowledge transfer with an attached meta-learner. The meta-learner takes support set images that include the few examples of the novel categories and base categories, and predicts prototypes that represent each category as a vector. Then, the prototypes reweight each RoI (Region-of-Interest) feature vector from a query image to remodels R-CNN predictor heads. To facilitate the remodeling process, we predict the prototypes under a graph structure, which propagates information of the correlated base categories to the novel categories with explicit guidance of prior knowledge that represents correlations among categories. Extensive experiments on the PASCAL VOC dataset verifies the effectiveness of the proposed method.

preprint2020arXiv

Gradual Relation Network: Decoding Intuitive Upper Extremity Movement Imaginations Based on Few-Shot EEG Learning

Brain-computer interface (BCI) is a communication tool that connects users and external devices. In a real-time BCI environment, a calibration procedure is particularly necessary for each user and each session. This procedure consumes a significant amount of time that hinders the application of a BCI system in a real-world scenario. To avoid this problem, we adopt the metric based few-shot learning approach for decoding intuitive upper-extremity movement imagination (MI) using a gradual relation network (GRN) that can gradually consider the combination of temporal and spectral groups. We acquired the MI data of the upper-arm, forearm, and hand associated with intuitive upper-extremity movement from 25 subjects. The grand average multiclass classification results under offline analysis were 42.57%, 55.60%, and 80.85% in 1-, 5-, and 25-shot settings, respectively. In addition, we could demonstrate the feasibility of intuitive MI decoding using the few-shot approach in real-time robotic arm control scenarios. Five participants could achieve a success rate of 78% in the drinking task. Hence, we demonstrated the feasibility of the online robotic arm control with shortened calibration time by focusing on human body parts but also the accommodation of various untrained intuitive MI decoding based on the proposed GRN.

preprint2020arXiv

Mel-spectrogram augmentation for sequence to sequence voice conversion

For training the sequence-to-sequence voice conversion model, we need to handle an issue of insufficient data about the number of speech pairs which consist of the same utterance. This study experimentally investigated the effects of Mel-spectrogram augmentation on training the sequence-to-sequence voice conversion (VC) model from scratch. For Mel-spectrogram augmentation, we adopted the policies proposed in SpecAugment. In addition, we proposed new policies (i.e., frequency warping, loudness and time length control) for more data variations. Moreover, to find the appropriate hyperparameters of augmentation policies without training the VC model, we proposed hyperparameter search strategy and the new metric for reducing experimental cost, namely deformation per deteriorating ratio. We compared the effect of these Mel-spectrogram augmentation methods based on various sizes of training set and augmentation policies. In the experimental results, the time axis warping based policies (i.e., time length control and time warping.) showed better performance than other policies. These results indicate that the use of the Mel-spectrogram augmentation is more beneficial for training the VC model.

preprint2020arXiv

Prediction of Event Related Potential Speller Performance Using Resting-State EEG

Event-related potential (ERP) speller can be utilized in device control and communication for locked-in or severely injured patients. However, problems such as inter-subject performance instability and ERP-illiteracy are still unresolved. Therefore, it is necessary to predict classification performance before performing an ERP speller in order to use it efficiently. In this study, we investigated the correlations with ERP speller performance using a resting-state before an ERP speller. In specific, we used spectral power and functional connectivity according to four brain regions and five frequency bands. As a result, the delta power in the frontal region and functional connectivity in the delta, alpha, gamma bands are significantly correlated with the ERP speller performance. Also, we predicted the ERP speller performance using EEG features in the resting-state. These findings may contribute to investigating the ERP-illiteracy and considering the appropriate alternatives for each user.

preprint2020arXiv

Prediction of Memory Retrieval Performance Using Ear-EEG Signals

Many studies have explored brain signals during the performance of a memory task to predict later remembered items. However, prediction methods are still poorly used in real life and are not practical due to the use of electroencephalography (EEG) recorded from the scalp. Ear-EEG has been recently used to measure brain signals due to its flexibility when applying it to real world environments. In this study, we attempt to predict whether a shown stimulus is going to be remembered or forgotten using ear-EEG and compared its performance with scalp-EEG. Our results showed that there was no significant difference between ear-EEG and scalp-EEG. In addition, the higher prediction accuracy was obtained using a convolutional neural network (pre-stimulus: 74.06%, on-going stimulus: 69.53%) and it was compared to other baseline methods. These results showed that it is possible to predict performance of a memory task using ear-EEG signals and it could be used for predicting memory retrieval in a practical brain-computer interface.

preprint2020arXiv

Reconstructing ERP Signals Using Generative Adversarial Networks for Mobile Brain-Machine Interface

Practical brain-machine interfaces have been widely studied to accurately detect human intentions using brain signals in the real world. However, the electroencephalography (EEG) signals are distorted owing to the artifacts such as walking and head movement, so brain signals may be large in amplitude rather than desired EEG signals. Due to these artifacts, detecting accurately human intention in the mobile environment is challenging. In this paper, we proposed the reconstruction framework based on generative adversarial networks using the event-related potentials (ERP) during walking. We used a pre-trained convolutional encoder to represent latent variables and reconstructed ERP through the generative model which shape similar to the opposite of encoder. Finally, the ERP was classified using the discriminative model to demonstrate the validity of our proposed framework. As a result, the reconstructed signals had important components such as N200 and P300 similar to ERP during standing. The accuracy of reconstructed EEG was similar to raw noisy EEG signals during walking. The signal-to-noise ratio of reconstructed EEG was significantly increased as 1.3. The loss of the generative model was 0.6301, which is comparatively low, which means training generative model had high performance. The reconstructed ERP consequentially showed an improvement in classification performance during walking through the effects of noise reduction. The proposed framework could help recognize human intention based on the brain-machine interface even in the mobile environment.

preprint2020arXiv

Self-Augmentation: Generalizing Deep Networks to Unseen Classes for Few-Shot Learning

Few-shot learning aims to classify unseen classes with a few training examples. While recent works have shown that standard mini-batch training with a carefully designed training strategy can improve generalization ability for unseen classes, well-known problems in deep networks such as memorizing training statistics have been less explored for few-shot learning. To tackle this issue, we propose self-augmentation that consolidates self-mix and self-distillation. Specifically, we exploit a regional dropout technique called self-mix, in which a patch of an image is substituted into other values in the same image. Then, we employ a backbone network that has auxiliary branches with its own classifier to enforce knowledge sharing. Lastly, we present a local representation learner to further exploit a few training examples for unseen classes. Experimental results show that the proposed method outperforms the state-of-the-art methods for prevalent few-shot benchmarks and improves the generalization ability.

preprint2020arXiv

Spatio-Temporal Dynamics of Visual Imagery for Intuitive Brain-Computer Interface

Visual imagery is an intuitive brain-computer interface paradigm, referring to the emergence of the visual scene. Despite its convenience, analysis of its intrinsic characteristics is limited. In this study, we demonstrate the effect of time interval and channel selection that affects the decoding performance of the multi-class visual imagery. We divided the epoch into time intervals of 0-1 s and 1-2 s and performed six-class classification in three different brain regions: whole brain, visual cortex, and prefrontal cortex. In the time interval, 0-1 s group showed 24.2 % of average classification accuracy, which was significantly higher than the 1-2 s group in the prefrontal cortex. In the three different regions, the classification accuracy of the prefrontal cortex showed significantly higher performance than the visual cortex in 0-1 s interval group, implying the cognitive arousal during the visual imagery. This finding would provide crucial information in improving the decoding performance.

preprint2020arXiv

Three-Stream Fusion Network for First-Person Interaction Recognition

First-person interaction recognition is a challenging task because of unstable video conditions resulting from the camera wearer&#39;s movement. For human interaction recognition from a first-person viewpoint, this paper proposes a three-stream fusion network with two main parts: three-stream architecture and three-stream correlation fusion. Thre three-stream architecture captures the characteristics of the target appearance, target motion, and camera ego-motion. Meanwhile the three-stream correlation fusion combines the feature map of each of the three streams to consider the correlations among the target appearance, target motion and camera ego-motion. The fused feature vector is robust to the camera movement and compensates for the noise of the camera ego-motion. Short-term intervals are modeled using the fused feature vector, and a long short-term memory(LSTM) model considers the temporal dynamics of the video. We evaluated the proposed method on two-public benchmark datasets to validate the effectiveness of our approach. The experimental results show that the proposed fusion method successfully generated a discriminative feature vector, and our network outperformed all competing activity recognition methods in first-person videos where considerable camera ego-motion occurs.

preprint2020arXiv

Towards Brain-Computer Interfaces for Drone Swarm Control

Noninvasive brain-computer interface (BCI) decodes brain signals to understand user intention. Recent advances have been developed for the BCI-based drone control system as the demand for drone control increases. Especially, drone swarm control based on brain signals could provide various industries such as military service or industry disaster. This paper presents a prototype of a brain swarm interface system for a variety of scenarios using a visual imagery paradigm. We designed the experimental environment that could acquire brain signals under a drone swarm control simulator environment. Through the system, we collected the electroencephalogram (EEG) signals with respect to four different scenarios. Seven subjects participated in our experiment and evaluated classification performances using the basic machine learning algorithm. The grand average classification accuracy is higher than the chance level accuracy. Hence, we could confirm the feasibility of the drone swarm control system based on EEG signals for performing high-level tasks.