Researcher profile

Jee-weon Jung

Jee-weon Jung contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2022arXiv

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

The performance of spoofing countermeasure systems depends fundamentally upon the use of sufficiently representative training data. With this usually being limited, current solutions typically lack generalisation to attacks encountered in the wild. Strategies to improve reliability in the face of uncontrolled, unpredictable attacks are hence needed. We report in this paper our efforts to use self-supervised learning in the form of a wav2vec 2.0 front-end with fine tuning. Despite initial base representations being learned using only bona fide data and no spoofed data, we obtain the lowest equal error rates reported in the literature for both the ASVspoof 2021 Logical Access and Deepfake databases. When combined with data augmentation,these results correspond to an improvement of almost 90% relative to our baseline system.

preprint2022arXiv

Baseline Systems for the First Spoofing-Aware Speaker Verification Challenge: Score and Embedding Fusion

Deep learning has brought impressive progress in the study of both automatic speaker verification (ASV) and spoofing countermeasures (CM). Although solutions are mutually dependent, they have typically evolved as standalone sub-systems whereby CM solutions are usually designed for a fixed ASV system. The work reported in this paper aims to gauge the improvements in reliability that can be gained from their closer integration. Results derived using the popular ASVspoof2019 dataset indicate that the equal error rate (EER) of a state-of-the-art ASV system degrades from 1.63% to 23.83% when the evaluation protocol is extended with spoofed trials.%subjected to spoofing attacks. However, even the straightforward integration of ASV and CM systems in the form of score-sum and deep neural network-based fusion strategies reduce the EER to 1.71% and 6.37%, respectively. The new Spoofing-Aware Speaker Verification (SASV) challenge has been formed to encourage greater attention to the integration of ASV and CM systems as well as to provide a means to benchmark different solutions.

preprint2022arXiv

Pushing the limits of raw waveform speaker recognition

In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs. The model incorporates recent advances in machine learning and speaker verification, including the Res2Net backbone module and multi-layer feature aggregation. Our best model achieves an equal error rate of 0.89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin. We also explore the application of the proposed model in the context of self-supervised learning framework. Our self-supervised model outperforms single phase-based existing works in this line of research. Finally, we show that self-supervised pre-training is effective for the semi-supervised scenario where we only have a small set of labelled training data, along with a larger set of unlabelled examples.

preprint2022arXiv

SASV 2022: The First Spoofing-Aware Speaker Verification Challenge

The first spoofing-aware speaker verification (SASV) challenge aims to integrate research efforts in speaker verification and anti-spoofing. We extend the speaker verification scenario by introducing spoofed trials to the usual set of target and impostor trials. In contrast to the established ASVspoof challenge where the focus is upon separate, independently optimised spoofing detection and speaker verification sub-systems, SASV targets the development of integrated and jointly optimised solutions. Pre-trained spoofing detection and speaker verification models are provided as open source and are used in two baseline SASV solutions. Both models and baselines are freely available to participants and can be used to develop back-end fusion approaches or end-to-end solutions. Using the provided common evaluation protocol, 23 teams submitted SASV solutions. When assessed with target, bona fide non-target and spoofed non-target trials, the top-performing system reduces the equal error rate of a conventional speaker verification system from 23.83% to 0.13%. SASV challenge results are a testament to the reliability of today's state-of-the-art approaches to spoofing detection and speaker verification.

preprint2022arXiv

SASV Challenge 2022: A Spoofing Aware Speaker Verification Challenge Evaluation Plan

ASV (automatic speaker verification) systems are intrinsically required to reject both non-target (e.g., voice uttered by different speaker) and spoofed (e.g., synthesised or converted) inputs. However, there is little consideration for how ASV systems themselves should be adapted when they are expected to encounter spoofing attacks, nor when they operate in tandem with CMs (spoofing countermeasures), much less how both systems should be jointly optimised. The goal of the first SASV (spoofing-aware speaker verification) challenge, a special sesscion in ISCA INTERSPEECH 2022, is to promote development of integrated systems that can perform ASV and CM simultaneously.

preprint2021arXiv

DcaseNet: An integrated pretrained deep neural network for detecting and classifying acoustic scenes and events

Although acoustic scenes and events include many related tasks, their combined detection and classification have been scarcely investigated. We propose three architectures of deep neural networks that are integrated to simultaneously perform acoustic scene classification, audio tagging, and sound event detection. The first two architectures are inspired by human cognitive processes. The first architecture resembles the short-term perception for scene classification of adults, who can detect various sound events that are then used to identify the acoustic scene. The second architecture resembles the long-term learning of babies, being also the concept underlying self-supervised learning. Babies first observe the effects of abstract notions such as gravity and then learn specific tasks using such perceptions. The third architecture adds a few layers to the second one that solely perform a single task before its corresponding output layer. We aim to build an integrated system that can serve as a pretrained model to perform the three abovementioned tasks. Experiments on three datasets demonstrate that the proposed architecture, called DcaseNet, can be either directly used for any of the tasks while providing suitable results or fine-tuned to improve the performance of one task. The code and pretrained DcaseNet weights are available at https://github.com/Jungjee/DcaseNet.

preprint2021arXiv

Graph Attention Networks for Speaker Verification

This work presents a novel back-end framework for speaker verification using graph attention networks. Segment-wise speaker embeddings extracted from multiple crops within an utterance are interpreted as node representations of a graph. The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score. We first construct a graph using segment-wise speaker embeddings and then input these to graph attention networks. After a few graph attention layers with residual connections, each node is projected into a one-dimensional space using affine transform, followed by a readout operation resulting in a scalar similarity score. To enable successful adaptation for speaker verification, we propose techniques such as separating trainable weights for attention map calculations between segment-wise speaker embeddings from different utterances. The effectiveness of the proposed framework is validated using three different speaker embedding extractors trained with different architectures and objective functions. Experimental results demonstrate consistent improvement over various baseline back-end classifiers, with an average equal error rate improvement of 20% over the cosine similarity back-end without test time augmentation.

preprint2020arXiv

A study on the role of subsidiary information in replay attack spoofing detection

In this study, we analyze the role of various categories of subsidiary information in conducting replay attack spoofing detection: `Room Size', `Reverberation', `Speaker-to-ASV distance, `Attacker-to-Speaker distance', and `Replay Device Quality'. As a means of analyzing subsidiary information, we use two frameworks to either subtract or include a category of subsidiary information to the code extracted from a deep neural network. For subtraction, we utilize an adversarial process framework which makes the code orthogonal to the basis vectors of the subsidiary information. For addition, we utilize the multi-task learning framework to include subsidiary information to the code. All experiments are conducted using the ASVspoof 2019 physical access scenario with the provided meta data. Through the analysis of the result of the two approaches, we conclude that various categories of subsidiary information does not reside enough in the code when the deep neural network is trained for binary classification. Explicitly including various categories of subsidiary information through the multi-task learning framework can help improve performance in closed set condition.

preprint2020arXiv

Acoustic Scene Classification using Audio Tagging

Acoustic scene classification systems using deep neural networks classify given recordings into pre-defined classes. In this study, we propose a novel scheme for acoustic scene classification which adopts an audio tagging system inspired by the human perception mechanism. When humans identify an acoustic scene, the existence of different sound events provides discriminative information which affects the judgement. The proposed framework mimics this mechanism using various approaches. Firstly, we employ three methods to concatenate tag vectors extracted using an audio tagging system with an intermediate hidden layer of an acoustic scene classification system. We also explore the multi-head attention on the feature map of an acoustic scene classification system using tag vectors. Experiments conducted on the detection and classification of acoustic scenes and events 2019 task 1-a dataset demonstrate the effectiveness of the proposed scheme. Concatenation and multi-head attention show a classification accuracy of 75.66 % and 75.58 %, respectively, compared to 73.63 % accuracy of the baseline. The system with the proposed two approaches combined demonstrates an accuracy of 76.75 %.

preprint2020arXiv

Capturing scattered discriminative information using a deep architecture in acoustic scene classification

Frequently misclassified pairs of classes that share many common acoustic properties exist in acoustic scene classification (ASC). To distinguish such pairs of classes, trivial details scattered throughout the data could be vital clues. However, these details are less noticeable and are easily removed using conventional non-linear activations (e.g. ReLU). Furthermore, making design choices to emphasize trivial details can easily lead to overfitting if the system is not sufficiently generalized. In this study, based on the analysis of the ASC task's characteristics, we investigate various methods to capture discriminative information and simultaneously mitigate the overfitting problem. We adopt a max feature map method to replace conventional non-linear activations in a deep neural network, and therefore, we apply an element-wise comparison between different filters of a convolution layer's output. Two data augment methods and two deep architecture modules are further explored to reduce overfitting and sustain the system's discriminative power. Various experiments are conducted using the detection and classification of acoustic scenes and events 2020 task1-a dataset to validate the proposed methods. Our results show that the proposed system consistently outperforms the baseline, where the single best performing system has an accuracy of 70.4% compared to 65.1% of the baseline.

preprint2020arXiv

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker verification systems that directly input raw waveforms. For example, RawNet extracts speaker embeddings from raw waveforms, which simplifies the process pipeline and demonstrates competitive performance. In this study, we improve RawNet by scaling feature maps using various methods. The proposed mechanism utilizes a scale vector that adopts a sigmoid non-linear function. It refers to a vector with dimensionality equal to the number of filters in a given feature map. Using a scale vector, we propose to scale the feature map multiplicatively, additively, or both. In addition, we investigate replacing the first convolution layer with the sinc-convolution layer of SincNet. Experiments performed on the VoxCeleb1 evaluation dataset demonstrate the effectiveness of the proposed methods, and the best performing system reduces the equal error rate by half compared to the original RawNet. Expanded evaluation results obtained using the VoxCeleb1-E and VoxCeleb-H protocols marginally outperform existing state-of-the-art systems.

preprint2020arXiv

Segment Aggregation for short utterances speaker verification using raw waveforms

Most studies on speaker verification systems focus on long-duration utterances, which are composed of sufficient phonetic information. However, the performances of these systems are known to degrade when short-duration utterances are inputted due to the lack of phonetic information as compared to the long utterances. In this paper, we propose a method that compensates for the performance degradation of speaker verification for short utterances, referred to as "segment aggregation". The proposed method adopts an ensemble-based design to improve the stability and accuracy of speaker verification systems. The proposed method segments an input utterance into several short utterances and then aggregates the segment embeddings extracted from the segmented inputs to compose a speaker embedding. Then, this method simultaneously trains the segment embeddings and the aggregated speaker embedding. In addition, we also modified the teacher-student learning method for the proposed method. Experimental results on different input duration using the VoxCeleb1 test set demonstrate that the proposed technique improves speaker verification performance by about 45.37% relatively compared to the baseline system with 1-second test utterance condition.

preprint2020arXiv

Self-supervised pre-training with acoustic configurations for replay spoofing detection

Constructing a dataset for replay spoofing detection requires a physical process of playing an utterance and re-recording it, presenting a challenge to the collection of large-scale datasets. In this study, we propose a self-supervised framework for pretraining acoustic configurations using datasets published for other tasks, such as speaker verification. Here, acoustic configurations refer to the environmental factors generated during the process of voice recording but not the voice itself, including microphone types, place and ambient noise levels. Specifically, we select pairs of segments from utterances and train deep neural networks to determine whether the acoustic configurations of the two segments are identical. We validate the effectiveness of the proposed method based on the ASVspoof 2019 physical access dataset utilizing two well-performing systems. The experimental results demonstrate that the proposed method outperforms the baseline approach by 30%.