Source author record

Zbynek Koldovsky

Zbynek Koldovsky appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

4works
6topics
4close collaborators

Actions

Connect this record

Log in to claim

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2022arXiv

Target Speech Extraction: Independent Vector Extraction Guided by Supervised Speaker Identification

This manuscript proposes a novel robust procedure for the extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is performed via independent vector extraction (IVE). Since the blind IVE cannot distinguish the target source by itself, it is guided towards the SOI via frame-wise speaker identification based on deep learning. Still, an incorrect speaker can be extracted due to guidance failings, especially when processing challenging data. To identify such cases, we propose a criterion for non-intrusively assessing the estimated speaker. It utilizes the same model as the speaker identification, so no additional training is required. When incorrect extraction is detected, we propose a ``deflation'' step in which the incorrect source is subtracted from the mixture and, subsequently, another attempt to extract the SOI is performed. The process is repeated until successful extraction is achieved. The proposed procedure is experimentally tested on artificial and real-world datasets containing challenging phenomena: source movements, reverberation, transient noise, or microphone failures. The method is compared with state-of-the-art blind algorithms as well as with current fully supervised deep learning-based methods.

preprint2019arXiv

Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates

This work addresses the problem of block-online processing for multi-channel speech enhancement. Such processing is vital in scenarios with moving speakers and/or when very short utterances are processed, e.g., in voice assistant scenarios. We consider several variants of a system that performs beamforming supported by DNN-based voice activity detection (VAD) followed by post-filtering. The speaker is targeted through estimating relative transfer functions between microphones. Each block of the input signals is processed independently in order to make the method applicable in highly dynamic environments. Owing to the short length of the processed block, the statistics required by the beamformer are estimated less precisely. The influence of this inaccuracy is studied and compared to the processing regime when recordings are treated as one block (batch processing). The experimental evaluation of the proposed method is performed on large datasets of CHiME-4 and on another dataset featuring moving target speaker. The experiments are evaluated in terms of objective and perceptual criteria (such as signal-to-interference ratio (SIR) or perceptual evaluation of speech quality (PESQ), respectively). Moreover, word error rate (WER) achieved by a baseline automatic speech recognition system is evaluated, for which the enhancement method serves as a front-end solution. The results indicate that the proposed method is robust with respect to short length of the processed block. Significant improvements in terms of the criteria and WER are observed even for the block length of 250 ms.

preprint2015arXiv

Spatial Source Subtraction Based on Incomplete Measurements of Relative Transfer Function

Relative impulse responses between microphones are usually long and dense due to the reverberant acoustic environment. Estimating them from short and noisy recordings poses a long-standing challenge of audio signal processing. In this paper we apply a novel strategy based on ideas of Compressed Sensing. Relative transfer function (RTF) corresponding to the relative impulse response can often be estimated accurately from noisy data but only for certain frequencies. This means that often only an incomplete measurement of the RTF is available. A complete RTF estimate can be obtained through finding its sparsest representation in the time-domain: that is, through computing the sparsest among the corresponding relative impulse responses. Based on this approach, we propose to estimate the RTF from noisy data in three steps. First, the RTF is estimated using any conventional method such as the non-stationarity-based estimator by Gannot et al. or through Blind Source Separation. Second, frequencies are determined for which the RTF estimate appears to be accurate. Third, the RTF is reconstructed through solving a weighted $\ell_1$ convex program, which we propose to solve via a computationally efficient variant of the SpaRSA (Sparse Reconstruction by Separable Approximation) algorithm. An extensive experimental study with real-world recordings has been conducted. It has been shown that the proposed method is capable of improving many conventional estimators used as the first step in most situations.

preprint2012arXiv

Cramer-Rao-Induced Bounds for CANDECOMP/PARAFAC tensor decomposition

This paper presents a Cramer-Rao lower bound (CRLB) on the variance of unbiased estimates of factor matrices in Canonical Polyadic (CP) or CANDECOMP/PARAFAC (CP) decompositions of a tensor from noisy observations, (i.e., the tensor plus a random Gaussian i.i.d. tensor). A novel expression is derived for a bound on the mean square angular error of factors along a selected dimension of a tensor of an arbitrary dimension. The expression needs less operations for computing the bound, O(NR^6), than the best existing state-of-the art algorithm, O(N^3R^6) operations, where N and R are the tensor order and the tensor rank. Insightful expressions are derived for tensors of rank 1 and rank 2 of arbitrary dimension and for tensors of arbitrary dimension and rank, where two factor matrices have orthogonal columns. The results can be used as a gauge of performance of different approximate CP decomposition algorithms, prediction of their accuracy, and for checking stability of a given decomposition of a tensor (condition whether the CRLB is finite or not). A novel expression is derived for a Hessian matrix needed in popular damped Gauss-Newton method for solving the CP decomposition of tensors with missing elements. Beside computing the CRLB for these tensors the expression may serve for design of damped Gauss-Newton algorithm for the decomposition.