Source author record

Nam Soo Kim

Nam Soo Kim appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Computation and Language Sound Machine Learning eess.SP hep-lat

Catalog footprint

What is connected

10works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Deep learning study on the Dirac eigenvalue spectrum of staggered quarks

We study the chirality of staggered quarks on the Dirac eigenvalue spectrum using deep learning (DL) techniques. The Kluberg-Stern method to construct staggered bilinear operators conserves continuum property such as recursion relations, uniqueness of chirality, and Ward identities, which leads to a unique and characteristic pattern (we call it "leakage pattern (LP)") in the matrix elements of the chirality operator sandwiched between two quark eigenstates of staggered Dirac operator. DL analysis gives $99.4(2)\%$ accuracy on normal gauge configurations and $0.998$ AUC (Area Under ROC Curve) for classifying non-zero mode octets in the Dirac eigenvalue spectrum. It confirms that the leakage pattern is universal on normal gauge configurations. The multi-layer perceptron (MLP) method turns out to be the best DL model for our study on the LP.

preprint2022arXiv

Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependency

For a large portion of real-life utterances, the intention cannot be solely decided by either their semantic or syntactic characteristics. Although not all the sociolinguistic and pragmatic information can be digitized, at least phonetic features are indispensable in understanding the spoken language. Especially in head-final languages such as Korean, sentence-final prosody has great importance in identifying the speaker's intention. This paper suggests a system which identifies the inherent intention of a spoken utterance given its transcript, in some cases using auxiliary acoustic features. The main point here is a separate distinction for cases where discrimination of intention requires an acoustic cue. Thus, the proposed classification system decides whether the given utterance is a fragment, statement, question, command, or a rhetorical question/command, utilizing the intonation-dependency coming from the head-finality. Based on an intuitive understanding of the Korean language that is engaged in the data annotation, we construct a network which identifies the intention of a speech, and validate its utility with the test sentences. The system, if combined with up-to-date speech recognizers, is expected to be flexibly inserted into various language understanding modules.

preprint2022arXiv

StyleKQC: A Style-Variant Paraphrase Corpus for Korean Questions and Commands

Paraphrasing is often performed with less concern for controlled style conversion. Especially for questions and commands, style-variant paraphrasing can be crucial in tone and manner, which also matters with industrial applications such as dialog systems. In this paper, we attack this issue with a corpus construction scheme that simultaneously considers the core content and style of directives, namely intent and formality, for the Korean language. Utilizing manually generated natural language queries on six daily topics, we expand the corpus to formal and informal sentences by human rewriting and transferring. We verify the validity and industrial applicability of our approach by checking the adequate classification and inference performance that fit with conventional fine-tuning approaches, at the same time proposing a supervised formality transfer task.

preprint2021arXiv

Continuous Monitoring of Blood Pressure with Evidential Regression

Photoplethysmogram (PPG) signal-based blood pressure (BP) estimation is a promising candidate for modern BP measurements, as PPG signals can be easily obtained from wearable devices in a non-invasive manner, allowing quick BP measurement. However, the performance of existing machine learning-based BP measuring methods still fall behind some BP measurement guidelines and most of them provide only point estimates of systolic blood pressure (SBP) and diastolic blood pressure (DBP). In this paper, we present a cutting-edge method which is capable of continuously monitoring BP from the PPG signal and satisfies healthcare criteria such as the Association for the Advancement of Medical Instrumentation (AAMI) and the British Hypertension Society (BHS) standards. Furthermore, the proposed method provides the reliability of the predicted BP by estimating its uncertainty to help diagnose medical condition based on the model prediction. Experiments on the MIMIC II database verify the state-of-the-art performance of the proposed method under several metrics and its ability to accurately represent uncertainty in prediction.

preprint2021arXiv

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Recently, attention-based encoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are not capable of online inference, various online attention schemes have been developed to reduce ASR latency for better user experience. However, a common limitation of the conventional softmax-based online attention approaches is that they introduce an additional hyperparameter related to the length of the attention window, requiring multiple trials of model training for tuning the hyperparameter. In order to deal with this problem, we propose a novel softmax-free attention method and its modified formulation for online attention, which does not need any additional hyperparameter at the training phase. Through a number of ASR experiments, we demonstrate the tradeoff between the latency and performance of the proposed online attention technique can be controlled by merely adjusting a threshold at the test phase. Furthermore, the proposed methods showed competitive performance to the conventional global and online attentions in terms of word-error-rates (WERs).

preprint2020arXiv

Disentangled speaker and nuisance attribute embedding for robust speaker verification

Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). In this paper, we propose a novel fully supervised training method for extracting a speaker embedding vector disentangled from the variability caused by the nuisance attributes. The proposed framework was compared with the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset. Experimental results show that the proposed approach can extract speaker embeddings robust to channel and emotional variability.

preprint2020arXiv

Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation

For multi-channel speech recognition, speech enhancement techniques such as denoising or dereverberation are conventionally applied as a front-end processor. Deep learning-based front-ends using such techniques require aligned clean and noisy speech pairs which are generally obtained via data simulation. Recently, several joint optimization techniques have been proposed to train the front-end without parallel data within an end-to-end automatic speech recognition (ASR) scheme. However, the ASR objective is sub-optimal and insufficient for fully training the front-end, which still leaves room for improvement. In this paper, we propose a novel approach which incorporates flow-based density estimation for the robust front-end using non-parallel clean and noisy speech. Experimental results on the CHiME-4 dataset show that the proposed method outperforms the conventional techniques where the front-end is trained only with ASR objective.

preprint2020arXiv

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized end-to-end structures that preserve the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on Fluent Speech Command, an English SLU benchmark. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation.

preprint2020arXiv

Text Matters but Speech Influences: A Computational Analysis of Syntactic Ambiguity Resolution

Analyzing how human beings resolve syntactic ambiguity has long been an issue of interest in the field of linguistics. It is, at the same time, one of the most challenging issues for spoken language understanding (SLU) systems as well. As syntactic ambiguity is intertwined with issues regarding prosody and semantics, the computational approach toward speech intention identification is expected to benefit from the observations of the human language processing mechanism. In this regard, we address the task with attentive recurrent neural networks that exploit acoustic and textual features simultaneously and reveal how the modalities interact with each other to derive sentence meaning. Utilizing a speech corpus recorded on Korean scripts of syntactically ambiguous utterances, we revealed that co-attention frameworks, namely multi-hop attention and cross-attention, show significantly superior performance in disambiguating speech intention. With further analysis, we demonstrate that the computational models reflect the internal relationship between auditory and linguistic processes.

preprint2020arXiv

WaveNODE: A Continuous Normalizing Flow for Speech Synthesis

In recent years, various flow-based generative models have been proposed to generate high-fidelity waveforms in real-time. However, these models require either a well-trained teacher network or a number of flow steps making them memory-inefficient. In this paper, we propose a novel generative model called WaveNODE which exploits a continuous normalizing flow for speech synthesis. Unlike the conventional models, WaveNODE places no constraint on the function used for flow operation, thus allowing the usage of more flexible and complex functions. Moreover, WaveNODE can be optimized to maximize the likelihood without requiring any teacher network or auxiliary loss terms. We experimentally show that WaveNODE achieves comparable performance with fewer parameters compared to the conventional flow-based vocoders.

Nam Soo Kim

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Deep learning study on the Dirac eigenvalue spectrum of staggered quarks

Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependency

StyleKQC: A Style-Variant Paraphrase Corpus for Korean Questions and Commands

Continuous Monitoring of Blood Pressure with Evidential Regression

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Disentangled speaker and nuisance attribute embedding for robust speaker verification

Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Text Matters but Speech Influences: A Computational Analysis of Syntactic Ambiguity Resolution

WaveNODE: A Continuous Normalizing Flow for Speech Synthesis