The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The development of new speech enhancement techniques is a continuous progress to combat the impairment of speech signals by various acoustical environmental influences. In this contribution we propose a new two-stage speech enhancement algorithm, exploiting the source-filter model to decompose a denoised target signal, and specifically we manipulate the excitation signal in the cepstral domain. The...
This paper addresses our new online meeting recognition prototype, which works even in noisy environments. For speech enhancement, we employ a mask-based minimum variance distortionless response (MVDR) beamformer, which has recently shown to be a successful front-end for a state-of-the-art deep neural network (DNN)-based automatic speech recognition (ASR) system. To ensure more accurate and computationally...
Accurate estimation of the Direction of Arrival (DOA) of a sound source is an important prerequisite for a wide range of acoustic signal processing applications. However, in enclosed environments, early reflections and late reverberation often lead to localization errors. Recent work demonstrated that improved robustness against reverberation can be achieved by clustering only the DOAs from direct-path...
By introducing the nonholonomic constraints, the nonholonomic natural gradient algorithm is effective to overcome the shortcomings of traditional natural gradient algorithm. Namely, when the source signal amplitude changes rapidly over time or is equal to zero in a certain period of time, it can still work well. In addition, selecting the different estimate function in different stage can get the...
With the popularity of mobile terminal equipment, voice communication is becoming more and more frequent, the application of speech recognition scene is increasing. All these put forward higher requirements on the accuracy of speech recognition, therefore, how to enhance the speech as effectively as possible is becoming more and more important. At present, a lot of research has been done on the preprocessing...
In this paper, we present comparative study of digital speech processing on Bangla speech signal. We represent oral characteristics of Bangla alphabet in terms of pitch and formant. We worked with both vowels and consonants to show their difference in practical use. We take oral speech signals as voice record and extract phonemes to analyze in both time and frequency domains. Both male and female...
A new efficient measure for predicting estimation accuracy is proposed and successfully applied to multistream-based unsupervised adaptation of ASR systems to address data uncertainty when the ground-truth is unknown. The proposed measure is an extension of the M-measure, which predicts confidence in the output of a probability estimator by measuring the divergences of probability estimates spaced...
In the field of phonetics, voice onset time (VOT) is a major parameter of human speech defining linguistic contrasts in voicing. In this article, a landmark-based method of automatic VOT estimation in acoustic signals is presented. The proposed technique is based on a combination of two landmark detection procedures for release burst onset and glottal activity detection. Robust release burst detection...
The vocabulary is a vital component of automatic speech recognition(ASR) systems. For a specific Chinese speech recognition task, using a large general vocabulary not only leads to a much longer time to decode, but also hurts the recognition accuracy. In this paper, we proposed an unsupervised algorithm to select task-specific words from a large general vocabulary. The out-of-vocabulary(OOV) rate...
Users interact with mobile apps with certain intents such as finding a restaurant. Some intents and their corresponding activities are complex and may involve multiple apps; for example, a restaurant app, a messenger app and a calendar app may be needed to plan a dinner with friends. However, activities may be quite personal and third-party developers would not be building apps to specifically handle...
We address the problem of estimation of the Fujisaki model parameters for F0 synthesis. For this, we propose the use of a very efficient search and optimization method termed the ‘direct-search’ (Hooke and Jeeves, 1961) which belongs to the class of derivative-free unconstrained optimization methods, in the sense that it is applicable for non-linear optimization problems which are not amenable for...
Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. In addition to support one hypothesis, the LR provides a theoretically founded estimate of the relative strength of its support. Despite this nice theoretical aspect, the LR accepts some practical...
The term of “World Englishes” describes the current state of English and one of their main characteristics is a large diversity of pronunciation, called accents. In our previous studies, we developed several techniques to realize effective clustering and visualization of the diversity. For this aim, the accent gap between two speakers has to be quantified independently of extra-linguistic factors...
We describe a method of lexicon expansion to tackle variations of spontaneous speech. The variations of utterances are found widely in the programs such as conversations talk shows and are typically observed as unintelligible utterances with a high speech-rate. Unlike read speech in news programs, these variations often severely degrade automatic speech recognition (ASR) performance. Then, these variations...
In this paper, we propose a novel noise masking method based on Computational Auditory Scene Analysis by using an adaptive factor. Although it has succeeded in the field of speech separation and speech enhancement to some extent, the usage of fixed thresholds used for segregation and labeling heavily affects the processing performance. Focusing on this issue, the proposed method utilizes the Normalized...
In this paper, we propose a frequency-domain speech enhancement algorithm with phase estimation, in which the speech model is modeled by a Gaussian mixture model (GMM) in the log-spectral domain and two closed-form log-spectral amplitude estimators for speech and noise are derived directly by using a Mixture-Maximum (MIXMAX) model. Because the accurate estimation of speech phase could help to reduce...
An automatic speech recognition (ASR) is commonly used in these days. Current ASR systems perform well in ideal environment, however it does not perform well in realistic noisy environment. As a robust ASR, ETSI has standardized Advanced Front-End (AFE) that adopts two-stage of iterative Wiener filter (IWF) to realize a speech enhancement as the front-end of ASR. In the ETSI AFE, FFT is used to estimate...
Voice-pathology detection from a subject's voice is a promising technology for pre-diagnosis of larynx diseases. Glottal source estimation in particular plays a very important role in voice-pathology analysis. For more accurate estimation of the spectral envelope and glottal source of the pathology voice, we propose a method that can automatically generate the topology of the glottal source Hidden...
Nonnegative matrix factorization (NMF) is a matrix factorization technique that might find meaningful latent nonnegative components. Since, however, the objective function is non-convex, the source separation performance can degrade when the iterative update of the basis matrix is stuck to a poor local minimum. Most of the research updates basis iteratively to minimize certain objective function with...
This paper proposes a system to convert neutral speech to emotional with controlled intensity of emotions. Most of previous researches considering synthesis of emotional voices used statistical or concatenative methods that can synthesize emotions in categorical emotional states such as joy, angry, sad, etc. While humans sometimes enhance or relieve emotional states and intensity during daily life,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.