The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Good speaker recognition systems should identify the speaker irrespective of what is spoken, including non-speech sounds that are often produced during natural conversations. In this work, the inclusion of breath sounds in the training phase of the speaker recognition is analyzed using the popular Gaussian mixture model-universal background model (GMM-UBM) and deep neural network (DNN) based systems...
Dysarthria is a motor speech impairment, often characterized by speech that is generally indiscernible by human listeners. Assessment of the severity level of dysarthria provides an understanding of the patient's progression in the underlying cause and is essential for planning therapy, as well as improving automatic dysarthric speech recognition. In this paper, we propose a non-linguistic manner...
Automatic and spontaneous speech emotion recognition is an important part of a human-computer interactive system. However, emotion identification in spontaneous speech is difficult because most often the emotion expressed by the speaker are not necessarily as prominent as in acted speech. In this paper, we propose a spontaneous speech emotion recognition framework that makes use of the associated...
Use of the error correcting codes (ECC) in a multiclass audio emotion recognition problem is proposed to improve the emotion recognition accuracy. We visualize the emotion recognition system as a noisy communication channel, thus motivating the use of ECC. We assume the emotion recognition process consists of an audio feature extractor followed by an artificial neural network (ANN) for emotion classification...
We propose the use of the popular error correcting codes (ECC) in a multi-class audio emotion recognition scenario to improve the emotion recognition accuracy in spoken speech. In this paper, we visualize the emotion recognition system as a noisy communication channel, thus motivating the use of ECC in the emotion recognition process. We assume the emotion recognition process consists of an audio...
In this paper we show that the knowledge of noise statistics contaminating a signal leads to a better choice of filter to remove the noise. Very specifically, we show theoretically that the additive white Gaussian noise (AWGN) contaminating a signal can be filtered best by using a Gaussian filter mask which has some relation with the noise statistic of the AWGN. The main contribution of the paper...
Speech is one of the most popular parameter used to identify a speaker by her spoken phrase. Feature extraction from speech is a necessary first step in a speaker identification process. Traditionally computation of the Mel Frequency Cepstral Coefficient (MFCC) features use hamming window, as a preprocessing step to reduce spectral leakages. However, hamming window results in reasonable side lobes...
In this work, source features are explored for classifying infant cries. Different types of infant cries considered in this work are hunger, pain and wet-diaper. The various excitation source features explored in this work are source features namely epoch interval contour (EIC), epoch strength contour (ESC), epoch sharpness, slope of EIC and ESC features. In this work Gaussian Mixture Models (GMM)...
In this paper we explore the performance of multilingual speaker recognition systems developed on the IITKGP-MLILSC speech corpus. Closed-set speaker identification and speaker verification experiments are individually conducted on 13 widely spoken Indian languages. In particular, we focus on the effect of language mismatch in the speaker recognition performance of individual languages and all languages...
The basic goal of this work is to develop a Consonant-Vowel Recognition System (CVRS) for determining a sequence of Consonant-Vowel (CV) units present in a given speech utterance. In this work, we are focusing on developing CVRSs for Indian languages namely Bengali and Odia. This framework of developing CVRSs can be extended to any Indian languages. We have developed two separate CVRSs for Bengali...
Voice based call centers enable customers to query for information by speaking to agents in the call center. Most often these call conversations are recorded for analysis with the intent of trying to identify things that can help improve the performance of the call center to serve the customer better. Today the recorded conversations are analyzed by humans by listening to call conversations, which...
It is a well known fact that majority of rural India earns its livelihood from agriculture and farming. Although India is a net exporter of various agricultural products, the farmer who happens to be the primary producer, has remained information poor which puts him at a disadvantage. With little or no knowledge of prices at the markets, farmers have no leverage to negotiate better prices for their...
Continuous density hidden Markov models (CD-HMMs) are doubly stochastic processes which are extensively used in speech and image signal processing. Especially in case of isolated spoken word recognition systems, the spoken words are usually modeled using HMMs. While CD-HMMs are in extensive use, to most of the speech community the HMMs remain abstract in the sense there has been no nice way of visualizing...
The rate at which we speak has a bearing on its comprehensibility and is important in recent times with mushrooming call center operations. An optimal speaking rate is one that is neither too fast not it is too slow. A fast spoken speech makes conversation unintelligible while a slower speaking rate speech makes the conversation boring. Speaking rate definitely varies depending on the emotional state...
Speaker change detection is a necessary first step in several applications. In this paper, we propose an unsupervised two pass algorithm for speaker change detection in conversational speech. Generalized Likelihood Ratio (GLR) metric is used in the first pass to coarsely identify speaker change points and during the second pass, these candidate change points are finely analyzed assuming that the initial...
The potential use of non-linear speech features has not been investigated for music analysis although other commonly used speech features like Mel Frequency Ceptral Coefficients (MFCC) and pitch have been used extensively. In this paper, we assume an audio signal to be a sum of modulated sinusoidal and then use the energy separation algorithm to decompose the audio into amplitude and frequency modulation...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.