The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Deep neural networks have been widely applied in the field of environmental sound classification. However, due to the scarcity of carefully labeled data, their training process suffers from over-fitting. Data augmentation is a technique that alleviates this issue. It augments the training set with synthetic data that are created by modifying some parameters of the real data. However, not all kinds...
A previously proposed iterative filter design procedure, referred to as Iterative DFT-domain Inversion (IDI), is applied in the context of multizone sound reproduction in reverberant environments. The IDI approach aims at iteratively solving a least-squares problem, where the true reproduction error in the time domain is considered rather than narrowband errors in the frequency domain. In this paper,...
The application of time-frequency-domain techniques in spatial audio is relatively new, as first attempts were published about 15 years ago. A common property of the techniques is that the sound field is captured with multiple microphones, and its properties are analyzed for each time instance and individually for different frequency bands. These properties can be described by a set of parameters...
Many machine learning tasks have been shown solvable with impressive levels of success given large amounts of training data and computational power. For the problems which lack data sufficient to achieve high performance, methods for transfer learning can be applied. These refer to performing the new task while having prior knowledge of the nature of the data, gained by first performing a different...
As part of the 2016 public evaluation challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016), the second task focused on evaluating sound event detection systems using synthetic mixtures of office sounds. This task, which follows the ‘Event Detection-Office Synthetic’ task of DCASE 2013, studies the behaviour of tested algorithms when facing controlled levels of audio...
This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. Previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering training-recognition mismatch. In our study, two experimental setups are used: matched training-recognition condition and mismatched training-recognition...
This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental spectra from a mixture. In this study, we propose a novel network architecture that extends the recently developed densely connected convolutional network (DenseNet),...
Multichannel audio enhancement and source separation traditionally attempt to isolate a single source and remove all background noise. In listening enhancement applications, however, a portion of the background sources should be retained to preserve the listener's spatial awareness. We describe a time-varying spatial filter designed to apply a different gain to each sound source with minimal distortion...
In this paper, we propose a new time-frequency mask method for computational auditory scene analysis (CASA) based on convex optimization of the binary mask. In the proposed method, the pitch estimation and segment segregation in conventional CASA are completely replaced by the convex optimization of speech power. Considering the cross-correlation between the power spectra of noisy speech and noise...
Audio source separation is the act of isolating sound sources in an audio scene. One application of source separation is singing voice extraction. In this work, we present a novel approach for music/voice separation that uses the 2D Fourier Transform (2DFT). Our approach leverages how periodic patterns manifest in the 2D Fourier Transform and is connected to research in biological auditory systems...
Natural conversations are spontaneous exchanges involving two or more people speaking in an intermittent manner. Therefore one expects such conversation to have intervals where some of the speakers are silent. Yet, most (multichannel) audio source separation (MASS) methods consider the sound sources to be continuously emitting on the total duration of the processed mixture. In this paper we propose...
In this article, we present a target speaker dependent speech enhancement system, to enhance a specific target talker in presence of real life background noises. The proposed system uses a multi-channel processing stage to produce a noise reference signal. This noise reference signal is further used, to not only compute the residual noise statistics from the multichannel stage output, but also to...
This paper introduces a new method for single-channel denoising that sheds new light on classical early developments on this topic that occurred in the 70's and 80's with Wiener filtering and spectral subtraction. Operating both in the short-time Fourier transform domain, these methods consist in estimating the power spectral density (PSD) of the noise without speech. Then, the clean speech signal...
This paper addresses the problem of audio source separation from (possibly under-determined) multichannel convolutive mixtures. We propose a separation method based on the convolutive transfer function (CTF) in the short-time Fourier transform domain. For strongly reverberant signals, the CTF is a much more appropriate model than the widely-used multiplicative transfer function approximation. An Expectation-Maximization...
In this work we propose novel joint and sequential multimodal approaches for the task of single channel audio source separation in videos. This is done within the popular non-negative matrix factorization framework using information about the sounding object's motion. Specifically, we present methods that utilize non-negative least squares formulation to couple motion and audio information. The proposed...
The ability to separate speech from non-stationary background disturbances using only a single channel of information has increased significantly with the adoption of deep learning techniques. In these approaches, a time-frequency mask that recovers clean speech from noisy mixtures is learned from data. Recurrent neural networks are particularly well-suited to this sequential prediction task, with...
Deep neural networks (DNN) have been successfully employed for the problem of monaural sound source separation achieving state-of-the-art results. In this paper, we propose using convolutional recurrent neural network (CRNN) architecture for tackling this problem. We focus on a scenario where low algorithmic delay (< 10 ms) is paramount, and relatively little training data is available. We show...
We propose an efficient method to estimate source power spectral densities (PSDs) in a multi-source reverberant environment using a spherical microphone array. The proposed method utilizes the spatial correlation between the spherical harmonics (SH) coefficients of a sound field to estimate source PSDs. The use of the spatial cross-correlation of the SH coefficients allows us to employ the method...
This paper addresses the problem of joint wideband localization and acquisition of acoustic sources. The source locations as well as acquisition of the original source signals are obtained in a joint fashion by solving a sparse recovery problem. Spatial sparsity is enforced by discretizing the acoustic scene into a grid of predefined dimensions. In practice, energy leakage from the source location...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.