Source author record

Hiroshi Sato

Hiroshi Sato appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Sound math.AG astro-ph.HE eess.SP Machine Learning physics.chem-ph

Catalog footprint

What is connected

12works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR

It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with single-channel speech enhancement (SE). In this paper, we investigate the causes of ASR performance degradation by decomposing the SE errors using orthogonal projection-based decomposition (OPD). OPD decomposes the SE errors into noise and artifact components. The artifact component is defined as the SE error signal that cannot be represented as a linear combination of speech and noise sources. We propose manually scaling the error components to analyze their impact on ASR. We experimentally identify the artifact component as the main cause of performance degradation, and we find that mitigating the artifact can greatly improve ASR performance. Furthermore, we demonstrate that the simple observation adding (OA) technique (i.e., adding a scaled version of the observed signal to the enhanced speech) can monotonically increase the signal-to-artifact ratio under a mild condition. Accordingly, we experimentally confirm that OA improves ASR performance for both simulated and real recordings. The findings of this paper provide a better understanding of the influence of SE errors on ASR and open the door to future research on novel approaches for designing effective single-channel SE front-ends for ASR.

preprint2022arXiv

Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition

The combination of a deep neural network (DNN) -based speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end is a widely used approach to implement overlapping speech recognition. However, the SE front-end generates processing artifacts that can degrade the ASR performance. We previously found that such performance degradation can occur even under fully overlapping conditions, depending on the signal-to-interference ratio (SIR) and signal-to-noise ratio (SNR). To mitigate the degradation, we introduced a rule-based method to switch the ASR input between the enhanced and observed signals, which showed promising results. However, the rule's optimality was unclear because it was heuristically designed and based only on SIR and SNR values. In this work, we propose a DNN-based switching method that directly estimates whether ASR will perform better on the enhanced or observed signals. We also introduce soft-switching that computes a weighted sum of the enhanced and observed signals for ASR input, with weights given by the switching model's output posteriors. The proposed learning-based switching showed performance comparable to that of rule-based oracle switching. The soft-switching further improved the ASR performance and achieved a relative character error rate reduction of up to 23 % as compared with the conventional method.

preprint2022arXiv

Listen only to me! How well can target speech extraction handle false alarms?

Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. It is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.

preprint2022arXiv

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utterance. While most conventional approaches focus on improving {\it average performance} given a set of enrollment utterances, here we propose to guarantee the {\it worst performance}, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enrollment source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enrollment variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case performance by focusing on training with difficult enrollment cases where extraction does not perform well. In addition, we investigate the effectiveness of auxiliary speaker identification loss (SI-loss) as another way to improve robustness over enrollments. Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.

preprint2021arXiv

Multimodal Attention Fusion for Target Speaker Extraction

Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than single modality methods for simulated data, its adaptation towards realistic situations has not been fully explored as well as evaluations on real recorded mixtures. One of the major issues to handle realistic situations is how to make the system robust to clue corruption because in real recordings both clues may not be equally reliable, e.g. visual clues may be affected by occlusions. In this work, we propose a novel attention mechanism for multi-modal fusion and its training methods that enable to effectively capture the reliability of the clues and weight the more reliable ones. Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data. Moreover, we also record an audio-visual dataset of simultaneous speech with realistic visual clue corruption and show that audio-visual target speaker extraction with our proposals successfully work on real data.

preprint2021arXiv

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

Although recent advances in deep learning technology improved automatic speech recognition (ASR), it remains difficult to recognize speech when it overlaps other people's voices. Speech separation or extraction is often used as a front-end to ASR to handle such overlapping speech. However, deep neural network-based speech enhancement can generate `processing artifacts' as a side effect of the enhancement, which degrades ASR performance. For example, it is well known that single-channel noise reduction for non-speech noise (non-overlapping speech) often does not improve ASR. Likewise, the processing artifacts may also be detrimental to ASR in some conditions when processing overlapping speech with a separation/extraction method, although it is usually believed that separation/extraction improves ASR. In order to answer the question `Do we always have to separate/extract speech from mixtures?', we analyze ASR performance on observed and enhanced speech at various noise and interference conditions, and show that speech enhancement degrades ASR under some conditions even for overlapping speech. Based on these findings, we propose a simple switching algorithm between observed and enhanced speech based on the estimated signal-to-interference ratio and signal-to-noise ratio. We demonstrated experimentally that such a simple switching mechanism can improve recognition performance when processing artifacts are detrimental to ASR.

preprint2020arXiv

Observation of exotic water in hydrophilic nanospace of porous coordination polymers

The fundamental understanding of water confined in porous coordination polymers (PCPs) is significantly important not only for their applications such as gas storage and separation, but also for exploring the confinement effects in the nanoscale spaces. Here, we report the observation of an exotic water in the well-designed hydrophilic nanopores of PCPs. Single-crystal X-ray diffraction found that nanoconfined water has an ordered structure that is characteristic in ices, but infrared spectroscopy revealed a significant number of broken hydrogen bonds that is characteristic in liquids. We found that their structural properties are quite similar to those of solid-liquid supercritical water predicted in hydrophobic nanospace at extremely high pressure. Our results will open up not only new potential applications of exotic water in PCPs to control chemical reactions but also experimental systems to clarify the existence of solid-liquid critical points.

preprint2020arXiv

Toric Fano manifolds of dimension at most eight with positive second Chern characters

We show that any toric Fano manifold of dimension at most eight with the positive second Chern character is isomorphic to the projective space by using polymake.

preprint2018arXiv

Notes on toric varieties from Mori theoretic viewpoint, II

We give new estimates of lengths of extremal rays of birational type for toric varieties. We can see that our new estimates are the best by constructing some examples explicitly. As applications, we discuss the nefness and pseudo-effectivity of adjoint bundles of projective toric varieties. We also treat some generalizations of Fujita's freeness and very ampleness for toric varieties.

preprint2012arXiv

Suzaku investigation into the nature of the nearest ultraluminous X-ray source, M33 X-8

The X-ray spectrum of the nearest ultraluminous X-ray source, M33 X-8, obtained by Suzaku during 2010 January 11 -- 13, was closely analyzed to examine its nature. It is, by far, the only data with the highest signal statistic in 0.4 -- 10 keV range. Despite being able to reproduce the X-ray spectrum, Comptonization of the disk photons failed to give a physically meaningful solution. A modified version of the multi-color disk model, in which the dependence of the disk temperature on the radius is described as r^(-p) with p being a free parameter, can also approximate the spectrum. From this model, the innermost disk temperature and bolometric luminosity were obtained as T_in = 2.00-0.05+0.06 keV and L_disk = 1.36 x 10^39 (cos i)^(-1) ergs/s, respectively, where i is the disk inclination. A small temperature gradient of p = 0.535-0.005+0.004, together with the high disk temperature, is regarded as the signatures of the slim accretion disk model, suggesting that M33 X-8 was accreting at high mass accretion rate. With a correction factor for the slim disk taken into account, the innermost disk radius, R_in =81.9-6.5+5.9 (cos i)^(-0.5) km, corresponds to the black hole mass of M \sim 10 M_sun (cos i)^(-0.5). Accordingly, the bolometric disk luminosity is estimated to be about 80 (cos i)^(-0.5)% of the Eddington limit. A numerically calculated slim disk spectrum was found to reach a similar result. Thus, the extremely super-Eddington luminosity is not required to explain the nature of M33 X-8. This conclusion is utilized to argue for the existence of intermediate mass black holes with M > 100 M_sun radiating at the sub/trans-Eddington luminosity, among ultraluminous X-ray sources with L_disk > 10^(40) ergs/s.

preprint2011arXiv

The numerical class of a surface on a toric manifold

In this paper, we give a method to describe the numerical class of a torus invariant surface on a projective toric manifold. As applications, we can classify toric 2-Fano manifolds of Picard number 2 or of dimension at most 4.

preprint2004arXiv

Introduction to the toric Mori theory

The main purpose of this paper is to give a simple and non-combinatorial proof of the toric Mori theory. Here, the toric Mori theory means the (log) Minimal Model Program (MMP, for short) for toric varieties. We minimize the arguments on fans and their decompositions. We recommend this paper to the following people: (A) those who are uncomfortable with manipulating fans and their decompositions, (B) those who are familiar with toric geometry but not with the MMP. People in the category (A) will be relieved from tedious combinatorial arguments in several problems. Those in the category (B) will discover the potential of the toric Mori theory. As applications, we treat the Zariski decomposition on toric varieties and the partial resolutions of non-degenerate hypersurface singularities. By these applications, the reader will learn to use the toric Mori theory.

Hiroshi Sato

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR

Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition

Listen only to me! How well can target speech extraction handle false alarms?

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

Multimodal Attention Fusion for Target Speaker Extraction

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

Observation of exotic water in hydrophilic nanospace of porous coordination polymers

Toric Fano manifolds of dimension at most eight with positive second Chern characters

Notes on toric varieties from Mori theoretic viewpoint, II

Suzaku investigation into the nature of the nearest ultraluminous X-ray source, M33 X-8

The numerical class of a surface on a toric manifold

Introduction to the toric Mori theory