Source author record

Jesper Jensen

Jesper Jensen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Machine Learning Information Theory math.IT Sound Computer Vision eess.IV

Catalog footprint

What is connected

8works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Audio-Visual Speech Inpainting with Deep Learning

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

preprint2021arXiv

Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

This paper considers speech enhancement of signals picked up in one noisy environment which must be presented to a listener in another noisy environment. Recently, it has been shown that an optimal solution to this problem requires the consideration of the noise sources in both environments jointly. However, the existing optimal mutual information based method requires a complicated system model that includes natural speech variations, and relies on approximations and assumptions of the underlying signal distributions. In this paper, we propose to use a simpler signal model and optimize speech intelligibility based on the Approximated Speech Intelligibility Index (ASII). We derive a closed-form solution to the joint far- and near-end speech enhancement problem that is independent of the marginal distribution of signal coefficients, and that achieves similar performance to existing work. In addition, we do not need to model or optimize for natural speech variations.

preprint2020arXiv

Exploring Filterbank Learning for Keyword Spotting

Despite their great performance over the years, handcrafted speech features are not necessarily optimal for any particular speech application. Consequently, with greater or lesser success, optimal filterbank learning has been studied for different speech processing tasks. In this paper, we fill in a gap by exploring filterbank learning for keyword spotting (KWS). Two approaches are examined: filterbank matrix learning in the power spectral domain and parameter learning of a psychoacoustically-motivated gammachirp filterbank. Filterbank parameters are optimized jointly with a modern deep residual neural network-based KWS back-end. Our experimental results reveal that, in general, there are no statistically significant differences, in terms of KWS accuracy, between using a learned filterbank and handcrafted speech features. Thus, while we conclude that the latter are still a wise choice when using modern KWS back-ends, we also hypothesize that this could be a symptom of information redundancy, which opens up new research possibilities in the field of small-footprint KWS.

preprint2020arXiv

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous if the receiver is the human auditory system. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally overlooked in the literature. Also, we found that waveform matching performance metrics must be used with caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.

preprint2020arXiv

Vocoder-Based Speech Synthesis from Silent Videos

Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches.

preprint2010arXiv

n-Channel Asymmetric Entropy-Constrained Multiple-Description Lattice Vector Quantization

This paper is about the design and analysis of an index-assignment (IA) based multiple-description coding scheme for the n-channel asymmetric case. We use entropy constrained lattice vector quantization and restrict attention to simple reconstruction functions, which are given by the inverse IA function when all descriptions are received or otherwise by a weighted average of the received descriptions. We consider smooth sources with finite differential entropy rate and MSE fidelity criterion. As in previous designs, our construction is based on nested lattices which are combined through a single IA function. The results are exact under high-resolution conditions and asymptotically as the nesting ratios of the lattices approach infinity. For any n, the design is asymptotically optimal within the class of IA-based schemes. Moreover, in the case of two descriptions and finite lattice vector dimensions greater than one, the performance is strictly better than that of existing designs. In the case of three descriptions, we show that in the limit of large lattice vector dimensions, points on the inner bound of Pradhan et al. can be achieved. Furthermore, for three descriptions and finite lattice vector dimensions, we show that the IA-based approach yields, in the symmetric case, a smaller rate loss than the recently proposed source-splitting approach.

preprint2006arXiv

n-Channel Entropy-Constrained Multiple-Description Lattice Vector Quantization

In this paper we derive analytical expressions for the central and side quantizers which, under high-resolutions assumptions, minimize the expected distortion of a symmetric multiple-description lattice vector quantization (MD-LVQ) system subject to entropy constraints on the side descriptions for given packet-loss probabilities. We consider a special case of the general n-channel symmetric multiple-description problem where only a single parameter controls the redundancy tradeoffs between the central and the side distortions. Previous work on two-channel MD-LVQ showed that the distortions of the side quantizers can be expressed through the normalized second moment of a sphere. We show here that this is also the case for three-channel MD-LVQ. Furthermore, we conjecture that this is true for the general n-channel MD-LVQ. For given source, target rate and packet-loss probabilities we find the optimal number of descriptions and construct the MD-LVQ system that minimizes the expected distortion. We verify theoretical expressions by numerical simulations and show in a practical setup that significant performance improvements can be achieved over state-of-the-art two-channel MD-LVQ by using three-channel MD-LVQ.

preprint2005arXiv

n-Channel Asymmetric Multiple-Description Lattice Vector Quantization

We present analytical expressions for optimal entropy-constrained multiple-description lattice vector quantizers which, under high-resolutions assumptions, minimize the expected distortion for given packet-loss probabilities. We consider the asymmetric case where packet-loss probabilities and side entropies are allowed to be unequal and find optimal quantizers for any number of descriptions in any dimension. We show that the normalized second moments of the side-quantizers are given by that of an $L$-dimensional sphere independent of the choice of lattices. Furthermore, we show that the optimal bit-distribution among the descriptions is not unique. In fact, within certain limits, bits can be arbitrarily distributed.

Jesper Jensen

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Audio-Visual Speech Inpainting with Deep Learning

Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

Exploring Filterbank Learning for Keyword Spotting

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Vocoder-Based Speech Synthesis from Silent Videos

n-Channel Asymmetric Entropy-Constrained Multiple-Description Lattice Vector Quantization

n-Channel Entropy-Constrained Multiple-Description Lattice Vector Quantization

n-Channel Asymmetric Multiple-Description Lattice Vector Quantization