Source author record

Tomohiro Tanaka

Tomohiro Tanaka appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Machine Learning astro-ph.GA astro-ph.SR eess.AS Sound eess.SP

Catalog footprint

What is connected

11works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models

Self-supervised learning (SSL) is seen as a very promising approach with high performance for several speech downstream tasks. Since the parameters of SSL models are generally so large that training and inference require a lot of memory and computational cost, it is desirable to produce compact SSL models without a significant performance degradation by applying compression methods such as knowledge distillation (KD). Although the KD approach is able to shrink the depth and/or width of SSL model structures, there has been little research on how varying the depth and width impacts the internal representation of the small-footprint model. This paper provides an empirical study that addresses the question. We investigate the performance on SUPERB while varying the structure and KD methods so as to keep the number of parameters constant; this allows us to analyze the contribution of the representation introduced by varying the model architecture. Experiments demonstrate that a certain depth is essential for solving content-oriented tasks (e.g. automatic speech recognition) accurately, whereas a certain width is necessary for achieving high performance on several speaker-oriented tasks (e.g. speaker identification). Based on these observations, we identify, for SUPERB, a more compressed model with better performance than previous studies.

preprint2022arXiv

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utterance. While most conventional approaches focus on improving {\it average performance} given a set of enrollment utterances, here we propose to guarantee the {\it worst performance}, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enrollment source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enrollment variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case performance by focusing on training with difficult enrollment cases where extraction does not perform well. In addition, we investigate the effectiveness of auxiliary speaker identification loss (SI-loss) as another way to improve robustness over enrollments. Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.

preprint2021arXiv

Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training. Audio-visual speech separation is a technique to estimate the individual speech signals from a mixture using the visual signals of the speakers. Conventional studies on audio-visual speech separation mainly train the separation model on the audio-only loss, which reflects the distance between the source signals and the separated signals. However, conventional losses do not reflect the characteristics of the speech signals, including the speaker's characteristics and phonetic information, which leads to distortion or remaining noise. To address this problem, we propose the cross-modal correspondence (CMC) loss, which is based on the cooccurrence of the speech signal and the visual signal. Since the visual signal is not affected by background noise and contains speaker and phonetic information, using the CMC loss enables the audio-visual speech separation model to remove noise while preserving the speech characteristics. Experimental results demonstrate that the proposed method learns the cooccurrence on the basis of CMC loss, which improves separation performance.

preprint2021arXiv

End-to-End Automatic Speech Recognition with Deep Mutual Learning

This paper is the first study to apply deep mutual learning (DML) to end-to-end ASR models. In DML, multiple models are trained simultaneously and collaboratively by mimicking each other throughout the training process, which helps to attain the global optimum and prevent models from making over-confident predictions. While previous studies applied DML to simple multi-class classification problems, there are no studies that have used it on more complex sequence-to-sequence mapping problems. For this reason, this paper presents a method to apply DML to state-of-the-art Transformer-based end-to-end ASR models. In particular, we propose to combine DML with recent representative training techniques. i.e., label smoothing, scheduled sampling, and SpecAugment, each of which are essential for powerful end-to-end ASR models. We expect that these training techniques work well with DML because DML has complementary characteristics. We experimented with two setups for Japanese ASR tasks: large-scale modeling and compact modeling. We demonstrate that DML improves the ASR performance of both modeling setups compared with conventional learning methods including knowledge distillation. We also show that combining DML with the existing training techniques effectively improves ASR performance.

preprint2021arXiv

Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling. In addition, in order to enable the proposed model to use long-range sequential contexts, we also propose a large-context knowledge distillation that distills the knowledge from a pre-trained large-context language model in the training phase. We evaluate the effectiveness of the proposed model and proposed training method on Japanese discourse ASR tasks.

preprint2021arXiv

Large-Context Conversational Representation Learning: Self-Supervised Learning for Conversational Documents

This paper presents a novel self-supervised learning method for handling conversational documents consisting of transcribed text of human-to-human conversations. One of the key technologies for understanding conversational documents is utterance-level sequential labeling, where labels are estimated from the documents in an utterance-by-utterance manner. The main issue with utterance-level sequential labeling is the difficulty of collecting labeled conversational documents, as manual annotations are very costly. To deal with this issue, we propose large-context conversational representation learning (LC-CRL), a self-supervised learning method specialized for conversational documents. A self-supervised learning task in LC-CRL involves the estimation of an utterance using all the surrounding utterances based on large-context language modeling. In this way, LC-CRL enables us to effectively utilize unlabeled conversational documents and thereby enhances the utterance-level sequential labeling. The results of experiments on scene segmentation tasks using contact center conversational datasets demonstrate the effectiveness of the proposed method.

preprint2021arXiv

MAPGN: MAsked Pointer-Generator Network for sequence-to-sequence pre-training

This paper presents a self-supervised learning method for pointer-generator networks to improve spoken-text normalization. Spoken-text normalization that converts spoken-style text into style normalized text is becoming an important technology for improving subsequent processing such as machine translation and summarization. The most successful spoken-text normalization method to date is sequence-to-sequence (seq2seq) mapping using pointer-generator networks that possess a copy mechanism from an input sequence. However, these models require a large amount of paired data of spoken-style text and style normalized text, and it is difficult to prepare such a volume of data. In order to construct spoken-text normalization model from the limited paired data, we focus on self-supervised learning which can utilize unpaired text data to improve seq2seq models. Unfortunately, conventional self-supervised learning methods do not assume that pointer-generator networks are utilized. Therefore, we propose a novel self-supervised learning method, MAsked Pointer-Generator Network (MAPGN). The proposed method can effectively pre-train the pointer-generator network by learning to fill masked tokens using the copy mechanism. Our experiments demonstrate that MAPGN is more effective for pointer-generator networks than the conventional self-supervised learning methods in two spoken-text normalization tasks.

preprint2015arXiv

Dense Clumps and Candidates for Molecular Outflows in W40

We report results of the CO(J=3-2) and HCO+(J=4-3) observations of the W40 HII region with the ASTE 10 m telescope (HPBW~22 arcsec) to search for molecular outflows and dense clumps.We found that the velocity field in the region is highly complex, consisting of at least four distinct velocity components at V LSR ~ 3, 5, 7, and 10 km/s. The ~7 km/s component represents the systemic velocity of cold gas surroundingthe entire region, and causes heavy absorption in the CO spectra over the velocity range 6 <V LSR< 9 km/s. The ~5 and ~10 km/s components exhibit high CO temperature (>40 K) and are found mostly around the HII region, suggesting that these components are likely to be tracing dense gas interacting with the expanding shell around the HII region. Based on the CO data, we identified 13 regions of high velocity gas which we interpret as candidate outflow lobes. Using the HCO+ data, we also identified six clumps and estimated their physical parameters. On the basis of the ASTE data and near-infrared images from 2MASS, we present an updated three-dimensional model of this region. In order to investigate molecular outflows in W40, the SiO (J=1-0, v=0) emission line and some other emission lines at 40 GHz were also observed with the 45 m telescope at the Nobeyama Radio Observatory, but they were not detected at the present sensitivity.

preprint2014arXiv

Cluster Formation Triggered by Filament Collisions in Serpens South

The Serpens South infrared dark cloud consists of several filamentary ridges, some of which fragment into dense clumps. On the basis of CCS ($J_N=4_3-3_2$), HC$_3$N ($J=5-4$), N$_2$H$^+$ ($J=1-0$), and SiO ($J=2-1, v=0$) observations, we investigated the kinematics and chemical evolution of these filamentary ridges. We find that CCS is extremely abundant along the main filament in the protocluster clump. We emphasize that Serpens South is the first cluster-forming region where extremely-strong CCS emission is detected. The CCS-to-N$_2$H$^+$ abundance ratio is estimated to be about 0.5 toward the protocluster clump, whereas it is about 3 in the other parts of the main filament. We identify six dense ridges with different $V_{\rm LSR}$. These ridges appear to converge toward the protocluster clump, suggesting that the collisions of these ridges may have triggered cluster formation. The collisions presumably happened within a few $\times \ 10^5$ yr because CCS is abundant only in such a short time. The short lifetime agrees with the fact that the number fraction of Class I objects, whose typical lifetime is $0.4 \times \ 10^5$ yr, is extremely high as about 70 percent in the protocluster clump. In the northern part, two ridges appear to have partially collided, forming a V-shape clump. In addition, we detected strong bipolar SiO emission that is due to the molecular outflow blowing out of the protostellar clump, as well as extended weak SiO emission that may originate from the filament collisions.

preprint2014arXiv

High abundance ratio of $^{13}$CO to C$^{18}$O toward photon-dominated regions in the Orion-A giant molecular cloud

Aims. We derive physical properties such as the optical depths and the column densities of $^{13}$CO and C$^{18}$O to investigate the relationship between the far ultraviolet (FUV) radiation and the abundance ratios between $^{13}$CO and C$^{18}$O. Method. We have carried out wide-field (0.4 deg$^2$) observations with an angular resolution of 25.8 arcsec ($\sim$ 0.05 pc) in $^{13}$CO ($J$=1--0) and C$^{18}$O ($J$=1--0) toward the Orion-A giant molecular cloud using the Nobeyama 45 m telescope in the on-the-fly mode. Results. Overall distributions and velocity structures of the $^{13}$CO and C$^{18}$O emissions are similar to those of the $^{12}$CO ($J$=1--0) emission. The optical depths of the $^{13}$CO and C18O emission lines are estimated to be 0.05 $<$ $τ_{\rm ^{13}CO}$ $<$ 1.54 and 0.01 $<$ $τ_{\rm C^{18}O}$ $<$ 0.18, respectively. The column densities of the $^{13}$CO and C$^{18}$O emission lines are estimated to be 0.2 $\times$ 10$^{16}$ $<$ $N_{\rm ^{13}CO}$ $<$ 3.7 $\times$ 10$^{17}$ cm$^{-2}$ and 0.4 $\times$ 10$^{15}$ $<$ $N_{\rm C^{18}O}$ $<$ 3.5 $\times$ 10$^{16}$ cm$^{-2}$, respectively. The abundance ratios between $^{13}$CO and C$^{18}$O, $X_{\rm ^{13}CO}$/$X_{\rm C^{18}O}$, are found to be 5.7 - 33.0. The mean value of $X_{\rm ^{13}CO}$/$X_{\rm C^{18}O}$ in the nearly edge-on photon-dominated regions is found to be 16.47 $\pm$ 0.10, which is a third larger than that the solar system value of 5.5. The mean value of $X_{\rm ^{13}CO}$/$X_{\rm C^{18}O}$ in the other regions is found to be 12.29 $\pm$ 0.02. The difference of the abundance ratio is most likely due to the selective FUV photodissociation of C$^{18}$O.

preprint2013arXiv

The Dynamical State of The Serpens South Filamentary Infrared Dark Cloud

We present the results of N$_2$H$^+$ ($J=1-0$) observations toward Serpens South, the nearest cluster-forming, infrared dark cloud. The physical quantities are derived by fitting the hyperfine structure of N$_2$H$^+$. The Herschel and 1.1-mm continuum maps show that a pc-scale filament fragments into three clumps with radii of $0.1-0.2$ pc and masses of $40-230M_\odot$. We find that the clumps contain smaller-scale ($\sim 0.04$ pc) structures, i.e., dense cores. We identify 70 cores by applying CLUMPFIND to the N$_2$H$^+$ data cube. In the central cluster-forming clump, the excitation temperature and line-width tend to be large, presumably due to protostellar outflow feedback and stellar radiation. However, for all the clumps, the virial ratios are evaluated to be $0.1-0.3$, indicating that the internal motions play only a minor role in the clump support. The clumps exhibit no free-fall, but low-velocity infall, and thus the clumps should be supported by additional forces. The most promising force is the globally-ordered magnetic field observed toward this region. We propose that the Serpens South filament was close to magnetically-critical and ambipolar diffusion triggered the cluster formation. We find that the northern clump, which shows no active star formation, has a mass and radius comparable to the central cluster-forming clump, and therefore, it is a likely candidate of a {\it pre-protocluster clump}. The initial condition for cluster formation is likely to be a magnetically-supported clump of cold, quiescent gas. This appears to contradict the accretion-driven turbulence scenario, for which the turbulence in the clumps is maintained by the accretion flow.

Tomohiro Tanaka

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

End-to-End Automatic Speech Recognition with Deep Mutual Learning

Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

Large-Context Conversational Representation Learning: Self-Supervised Learning for Conversational Documents

MAPGN: MAsked Pointer-Generator Network for sequence-to-sequence pre-training

Dense Clumps and Candidates for Molecular Outflows in W40

Cluster Formation Triggered by Filament Collisions in Serpens South

High abundance ratio of $^{13}$CO to C$^{18}$O toward photon-dominated regions in the Orion-A giant molecular cloud

The Dynamical State of The Serpens South Filamentary Infrared Dark Cloud