Source author record

Alberto N. Escalante-B.

Alberto N. Escalante-B. appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Machine Learning Sound Computer Vision eess.SP Artificial Intelligence

Catalog footprint

What is connected

6works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering

Complex-valued processing has brought deep learning-based speech enhancement and signal extraction to a new level. Typically, the process is based on a time-frequency (TF) mask which is applied to a noisy spectrogram, while complex masks (CM) are usually preferred over real-valued masks due to their ability to modify the phase. Recent work proposed to use a complex filter instead of a point-wise multiplication with a mask. This allows to incorporate information from previous and future time steps exploiting local correlations within each frequency band. In this work, we propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering. First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception. The second stage employs deep filtering to enhance the periodic components of speech. Additionally to taking advantage of perceptual properties of speech, we enforce network sparsity via separable convolutions and extensive grouping in linear and recurrent layers to design a low complexity architecture. We further show that our two stage deep filtering approach outperforms complex masks over a variety of frequency resolutions and latencies and demonstrate convincing performance compared to other state-of-the-art models.

preprint2022arXiv

DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio

Deep learning-based speech enhancement has seen huge improvements and recently also expanded to full band audio (48 kHz). However, many approaches have a rather high computational complexity and require big temporal buffers for real time usage e.g. due to temporal convolutions or attention. Both make those approaches not feasible on embedded devices. This work further extends DeepFilterNet, which exploits harmonic structure of speech allowing for efficient speech enhancement (SE). Several optimizations in the training procedure, data augmentation, and network structure result in state-of-the-art SE performance while reducing the real-time factor to 0.04 on a notebook Core-i5 CPU. This makes the algorithm applicable to run on embedded devices in real-time. The DeepFilterNet framework can be obtained under an open source license.

preprint2020arXiv

CLC: Complex Linear Coding for the DNS 2020 Challenge

Complex-valued processing brought deep learning-based speech enhancement and signal extraction to a new level. Typically, the noise reduction process is based on a time-frequency (TF) mask which is applied to a noisy spectrogram. Complex masks (CM) usually outperform real-valued masks due to their ability to modify the phase. Recent work proposed to use a complex linear combination of coefficients called complex linear coding (CLC) instead of a point-wise multiplication with a mask. This allows to incorporate information from previous and optionally future time steps which results in superior performance over mask-based enhancement for certain noise conditions. In fact, the linear combination enables to model quasi-steady properties like the spectrum within a frequency band. In this work, we apply CLC to the Deep Noise Suppression (DNS) challenge and propose CLC as an alternative to traditional mask-based processing, e.g. used by the baseline. We evaluated our models using the provided test set and an additional validation set with real-world stationary and non-stationary noises. Based on the published test set, we outperform the baseline w.r.t. the scale independent signal distortion ratio (SI-SDR) by about 3dB.

preprint2020arXiv

Lightweight Online Noise Reduction on Embedded Devices using Hierarchical Recurrent Neural Networks

Deep-learning based noise reduction algorithms have proven their success especially for non-stationary noises, which makes it desirable to also use them for embedded devices like hearing aids (HAs). This, however, is currently not possible with state-of-the-art methods. They either require a lot of parameters and computational power and thus are only feasible using modern CPUs. Or they are not suitable for online processing, which requires constraints like low-latency by the filter bank and the algorithm itself. In this work, we propose a mask-based noise reduction approach. Using hierarchical recurrent neural networks, we are able to drastically reduce the number of neurons per layer while including temporal context via hierarchical connections. This allows us to optimize our model towards a minimum number of parameters and floating-point operations (FLOPs), while preserving noise reduction quality compared to previous work. Our smallest network contains only 5k parameters, which makes this algorithm applicable on embedded devices. We evaluate our model on a mixture of EUROM and a real-world noise database and report objective metrics on unseen noise.

preprint2016arXiv

Improved graph-based SFA: Information preservation complements the slowness principle

Slow feature analysis (SFA) is an unsupervised-learning algorithm that extracts slowly varying features from a multi-dimensional time series. A supervised extension to SFA for classification and regression is graph-based SFA (GSFA). GSFA is based on the preservation of similarities, which are specified by a graph structure derived from the labels. It has been shown that hierarchical GSFA (HGSFA) allows learning from images and other high-dimensional data. The feature space spanned by HGSFA is complex due to the composition of the nonlinearities of the nodes in the network. However, we show that the network discards useful information prematurely before it reaches higher nodes, resulting in suboptimal global slowness and an under-exploited feature space. To counteract these problems, we propose an extension called hierarchical information-preserving GSFA (HiGSFA), where information preservation complements the slowness-maximization goal. We build a 10-layer HiGSFA network to estimate human age from facial photographs of the MORPH-II database, achieving a mean absolute error of 3.50 years, improving the state-of-the-art performance. HiGSFA and HGSFA support multiple-labels and offer a rich feature space, feed-forward training, and linear complexity in the number of samples and dimensions. Furthermore, HiGSFA outperforms HGSFA in terms of feature slowness, estimation accuracy and input reconstruction, giving rise to a promising hierarchical supervised-learning approach.

preprint2015arXiv

Theoretical Analysis of the Optimal Free Responses of Graph-Based SFA for the Design of Training Graphs

Slow feature analysis (SFA) is an unsupervised learning algorithm that extracts slowly varying features from a time series. Graph-based SFA (GSFA) is a supervised extension that can solve regression problems if followed by a post-processing regression algorithm. A training graph specifies arbitrary connections between the training samples. The connections in current graphs, however, only depend on the rank of the involved labels. Exploiting the exact label values makes further improvements in estimation accuracy possible. In this article, we propose the exact label learning (ELL) method to create a graph that codes the desired label explicitly, so that GSFA is able to extract a normalized version of it directly. The ELL method is used for three tasks: (1) We estimate gender from artificial images of human faces (regression) and show the advantage of coding additional labels, particularly skin color. (2) We analyze two existing graphs for regression. (3) We extract compact discriminative features to classify traffic sign images. When the number of output features is limited, a higher classification rate is obtained compared to a graph equivalent to nonlinear Fisher discriminant analysis. The method is versatile, directly supports multiple labels, and provides higher accuracy compared to current graphs for the problems considered.