Source author record

Xiaofei Li

Xiaofei Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Sound eess.AS math.AP Machine Learning math-ph math.MP

Catalog footprint

What is connected

12works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Fine-tune the pretrained ATST model for sound event detection

Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.

preprint2022arXiv

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization

Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two microphone channels. Though DP-RTF fully encodes the sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes to learn DP-RTF with deep neural networks for robust binaural sound source localization. A DP-RTF learning network is designed to regress the binaural sensor signals to a real-valued representation of DP-RTF. It consists of a branched convolutional neural network module to separately extract the inter-channel magnitude and phase patterns, and a convolutional recurrent neural network module for joint feature learning. To better explore the speech spectra to aid the DP-RTF estimation, a monaural speech enhancement network is used to recover the direct-path spectrograms from the noisy ones. The enhanced spectrograms are stacked onto the noisy spectrograms to act as the input of the DP-RTF learning network. We train one unique DP-RTF learning network using many different binaural arrays to enable the generalization of DP-RTF learning across arrays. This way avoids time-consuming training data collection and network retraining for a new array, which is very useful in practical application. Experimental results on both simulated and real-world data show the effectiveness of the proposed method for direction of arrival (DOA) estimation in the noisy and reverberant environment, and a good generalization ability to unseen binaural arrays.

preprint2022arXiv

Multi-channel Narrow-band Deep Speech Separation with Full-band Permutation Invariant Training

This paper addresses the problem of multi-channel multi-speech separation based on deep learning techniques. In the short time Fourier transform domain, we propose an end-to-end narrow-band network that directly takes as input the multi-channel mixture signals of one frequency, and outputs the separated signals of this frequency. In narrow-band, the spatial information (or inter-channel difference) can well discriminate between speakers at different positions. This information is intensively used in many narrow-band speech separation methods, such as beamforming and clustering of spatial vectors. The proposed network is trained to learn a rule to automatically exploit this information and perform speech separation. Such a rule should be valid for any frequency, thence the network is shared by all frequencies. In addition, a full-band permutation invariant training criterion is proposed to solve the frequency permutation problem encountered by most narrow-band methods. Experiments show that, by focusing on deeply learning the narrow-band information, the proposed method outperforms the oracle beamforming method and the state-of-the-art deep learning based method.

preprint2022arXiv

Multichannel Speech Separation with Narrow-band Conformer

This work proposes a multichannel speech separation method with narrow-band Conformer (named NBC). The network is trained to learn to automatically exploit narrow-band speech separation information, such as spatial vector clustering of multiple speakers. Specifically, in the short-time Fourier transform (STFT) domain, the network processes each frequency independently, and is shared by all frequencies. For one frequency, the network inputs the STFT coefficients of multichannel mixture signals, and predicts the STFT coefficients of separated speech signals. Clustering of spatial vectors shares a similar principle with the self-attention mechanism in the sense of computing the similarity of vectors and then aggregating similar vectors. Therefore, Conformer would be especially suitable for the present problem. Experiments show that the proposed narrow-band Conformer achieves better speech separation performance than other state-of-the-art methods by a large margin.

preprint2022arXiv

RCT: Random Consistency Training for Semi-supervised Sound Event Detection

Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem of data deficiency. The integration of semi-supervised learning (SSL) largely mitigates such problem while bringing no extra annotation budget. This paper researches on several core modules of SSL, and introduces a random consistency training (RCT) strategy. First, a self-consistency loss is proposed to fuse with the teacher-student model to stabilize the training. Second, a hard mixup data augmentation is proposed to account for the additive property of sounds. Third, a random augmentation scheme is applied to flexibly combine different types of data augmentations. Experiments show that the proposed strategy outperform other widely-used strategies.

preprint2022arXiv

SRP-DNN: Learning Direct-Path Phase Difference for Multiple Moving Sound Source Localization

Multiple moving sound source localization in real-world scenarios remains a challenging issue due to interaction between sources, time-varying trajectories, distorted spatial cues, etc. In this work, we propose to use deep learning techniques to learn competing and time-varying direct-path phase differences for localizing multiple moving sound sources. A causal convolutional recurrent neural network is designed to extract the direct-path phase difference sequence from signals of each microphone pair. To avoid the assignment ambiguity and the problem of uncertain output-dimension encountered when simultaneously predicting multiple targets, the learning target is designed in a weighted sum format, which encodes source activity in the weight and direct-path phase differences in the summed value. The learned direct-path phase differences for all microphone pairs can be directly used to construct the spatial spectrum according to the formulation of steered response power (SRP). This deep neural network (DNN) based SRP method is referred to as SRP-DNN. The locations of sources are estimated by iteratively detecting and removing the dominant source from the spatial spectrum, in which way the interaction between sources is reduced. Experimental results on both simulated and real-world data show the superiority of the proposed method in the presence of noise and reverberation.

preprint2021arXiv

Semi-supervised Sound Event Detection using Random Augmentation and Consistency Regularization

Sound event detection is a core module for acoustic environmental analysis. Semi-supervised learning technique allows to largely scale up the dataset without increasing the annotation budget, and recently attracts lots of research attention. In this work, we study on two advanced semi-supervised learning techniques for sound event detection. Data augmentation is important for the success of recent deep learning systems. This work studies the audio-signal random augmentation method, which provides an augmentation strategy that can handle a large number of different audio transformations. In addition, consistency regularization is widely adopted in recent state-of-the-art semi-supervised learning methods, which exploits the unlabelled data by constraining the prediction of different transformations of one sample to be identical to the prediction of this sample. This work finds that, for semi-supervised sound event detection, consistency regularization is an effective strategy, especially the best performance is achieved when it is combined with the MeanTeacher model.

preprint2020arXiv

Neutral inclusions, weakly neutral inclusions, and an over-determined problem for confocal ellipsoids

An inclusion is said to be neutral to uniform fields if upon insertion into a homogenous medium with a uniform field it does not perturb the uniform field at all. It is said to be weakly neutral if it perturbs the uniform field mildly. Such inclusions are of interest in relation to invisibility cloaking and effective medium theory. There have been some attempts lately to construct or to show existence of such inclusions in the form of core-shell structure or a single inclusion with the imperfect bonding parameter attached to its boundary. The purpose of this paper is to review recent progress in such attempts. We also discuss about the over-determined problem for confocal ellipsoids which is closely related with the neutral inclusion, and its equivalent formulation in terms of Newtonian potentials. The main body of this paper consists of reviews on known results, but some new results are also included.

preprint2020arXiv

Polarization tensor vanishing structure of general shape: Existence for small perturbations of balls

The polarization tensor is a geometric quantity associated with a domain. It is a signature of the small inclusion's existence inside a domain and used in the small volume expansion method to reconstruct small inclusions by boundary measurements. In this paper, we consider the question of the polarization tensor vanishing structure of general shape. The only known examples of the polarization tensor vanishing structure are concentric disks and balls. We prove, by the implicit function theorem on Banach spaces, that a small perturbation of a ball can be enclosed by a domain so that the resulting inclusion of the core-shell structure becomes polarization tensor vanishing. The boundary of the enclosing domain is given by a sphere perturbed by spherical harmonics of degree zero and two. This is a continuation of the earlier work \cite{KLS2D} for two dimensions.

preprint2016arXiv

Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization

This paper addresses the problem of binaural localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an inter-frame spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions.

preprint2014arXiv

A New Model for Solving Narrow Escape Problem in Domain with Long Neck

The narrow escape problem arises in deriving the asymptotic expansion of the solution of an inhomogeneous mixed Dirichlet-Neumann boundary value problem. In this paper, we mainly deal with narrow escape problem in a smooth domain connected to a long neck-Dendritic spine shape domain, which has a certain significance in biology. Since the special geometry of dendritic spine, we develop a new model for solving this narrow escape problem which is Neumann-Robin Boundary Model. This model transform spine singular domain to smooth spine head domain by inserting Robin boundary condition to the connection part between spine head and neck. We rigorously find the high-order asymptotic expansion of Neumann-Robin Boundary Model and apply it to the solution of narrow escape problem in a dendritic spine shape domain. Our results show that the asymptotic expansion of the Neumann-Robin Boundary Model can be easily applied to the narrow escape problem for any smooth spine head domain with straight spine neck. By numerical simulations, we show that there is great agreement between the results of our Neumann-Robin Boundary Model and the original escape problem. In this paper, we also get some results for non-straight long spine neck case by considering curvature of spine neck.

preprint2013arXiv

Bounds on the size of an inclusion using the translation method for two-dimensional complex conductivity

The size estimation problem in electrical impedance tomography is considered when the conductivity is a complex number and the body is two-dimensional. Upper and lower bounds on the volume fraction of the unknown inclusion embedded in the body are derived in terms of two pairs of voltage and current data measured on the boundary of the body. These bounds are derived using the translation method. We also provide numerical examples to show that these bounds are quite tight and stable under measurement noise.

Xiaofei Li

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Fine-tune the pretrained ATST model for sound event detection

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization

Multi-channel Narrow-band Deep Speech Separation with Full-band Permutation Invariant Training

Multichannel Speech Separation with Narrow-band Conformer

RCT: Random Consistency Training for Semi-supervised Sound Event Detection

SRP-DNN: Learning Direct-Path Phase Difference for Multiple Moving Sound Source Localization

Semi-supervised Sound Event Detection using Random Augmentation and Consistency Regularization

Neutral inclusions, weakly neutral inclusions, and an over-determined problem for confocal ellipsoids

Polarization tensor vanishing structure of general shape: Existence for small perturbations of balls

Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization

A New Model for Solving Narrow Escape Problem in Domain with Long Neck

Bounds on the size of an inclusion using the translation method for two-dimensional complex conductivity