The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In machine recognition the benefit of utilizing multiple evidence lies in the combination schemes employed. In speaker verification (SV) tasks the score level combination scheme is widely used. The score level combination scheme provides interesting improvements in the overall performance, but when evidence from different features are complementary in nature. It is conjecture that collectively contributed...
In this paper, a text independent speaker recognition system based on Gaussian mixture models (GMM) was developed with a specific focus on the use of a voice activated detector (VAD) algorithm in the training and testing. At the training level, a modified estimation/maximization (EM) algorithm is used. It is less prone to get trapped around a local maximum and so, it will have more chance to converge...
Voice applications often require the ability to make user-friendly responses by judging the user or user-type from an extremely short utterance, such as a single word. However, it is assumed that performance becomes degraded as the utterance length decreases. In this paper, we examine the performance of speaker identification for extremely short utterances of less than two seconds and then study the...
Binaural features of interaural level difference and interaural phase difference have proved to be very effective in training deep neural networks (DNNs), to generate time-frequency masks for target speech extraction in speech-speech mixtures. However, effectiveness of binaural features is reduced in more common speech-noise scenarios, since the noise may over-shadow the speech in adverse conditions...
This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. Previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering training-recognition mismatch. In our study, two experimental setups are used: matched training-recognition condition and mismatched training-recognition...
The enhancement of speech degraded with the non-stationary noise types that typify real-world conditions has remained a challenging problem for several decades. However, recent use of data driven methods for this task has brought great performance improvements. In this paper, we develop a speech enhancement framework based on the extreme learning machine. Experimental results show that the proposed...
Accurate dialect identification technique helps in improving the speech recognition systems that exist in most of the present day electronic devices and is also expected to help in providing new services in the field of e-health and telemedicine which is especially important for older and homebound people. The accuracy of a dialect identification system is highly dependent on its speech corpora. Therefore,...
The emotional database can be classified as spontaneous and simulated emotions. Spontaneous emotions can be identified based on the two parameters 1) Arousal and 2) Valence values represented in a two dimensional plane. Arousal measures how calming or exciting the information is, whereas valence measures postive or negative affectivity of information. The objective of the paper is to predict the arousal...
In this paper, Transcriber that can be used to automatically transcribe interviews in Indonesian using speech-to-text and speaker diarization technology is described. The main feature of the software is generating interview transcription automatically and providing an option if grouping by group of speakers is required. Transcriber is designed to work in two modes that give users the freedom to provide...
In the presence of environmental noise, speaker verification systems inevitably see a decrease in performance. This paper proposes the (1) use of two parallel classifiers, (2) feature enhancement based on blind signal-to-noise ratio (SNR) estimation and (3) fusion, to improve the performance of speaker verification systems. The two classifiers are based on Gaussian mixture models and the partial least-squares...
The paper proposes the use of just mostly voiced speech (MVS) for speaker verification (SV). The speech is partitioned into an MVS part and a non-MVS part by a simple machine classification. SV experiments were held with a standard Gaussian mixture model (GMM) with universal background model (UBM) system and a GMM with computationally improved individual background model (IBM) system. They demonstrate...
In this paper, telephone conversation test results are reported. The main goal of the research is to derive a quality assessment model for today's Voice over Internet Telephony (VoIP) communication including the influence of end-point terminals with their internal signal processing. For this reason, two different terminals were used during the test and a possibly wide range of impairments were simulated,...
Good speaker recognition systems should identify the speaker irrespective of what is spoken, including non-speech sounds that are often produced during natural conversations. In this work, the inclusion of breath sounds in the training phase of the speaker recognition is analyzed using the popular Gaussian mixture model-universal background model (GMM-UBM) and deep neural network (DNN) based systems...
This paper aims to compare the Linear Predictive Cepstral Coefficients (LPCC) method, the Mel-frequency Cepstral Coefficient (MFCC) method, their concatenation (LPCC-MFCC), and a new proposed feature fusion approach based on method involving this concatenation with the respective averages normalization; Linear predictive and Mel-frequency Cepstral Coefficients (LMACC) through applying a multi-layer...
This paper compares the use of signal to noise ratio (SNR)-dependent and SNR-independent mixtures of probabilistic linear discriminant analysis (PLDA) versus conventional PLDA, under multi-noise and multi-SNR conditions for a small-set speaker verification system. Results indicate that conventional PLDA is more robust under multi-SNR conditions. The effect of the testing speech length is also examined...
The performance of speech emotion classifiers greatly degrade when the training conditions do not match the testing conditions. This problem is observed in cross-corpora evaluations, even when the corpora are similar. The lack of generalization is particularly problematic when the emotion classifiers are used in real applications. This study addresses this problem by combining active learning (AL)...
When emotion recognition systems are used in new domains, the classification performance usually drops due to mismatches between training and testing conditions. Annotations of new data in the new domain is expensive and time demanding. Therefore, it is important to design strategies that efficiently use limited amount of new data to improve the robustness of the classification system. The use of...
Dialect can be defined as a variety of a language that is distinguished from other varieties of the same language by pronunciation, grammar and vocabulary. The process of recognizing such dialects is called Dialect Identification. Kamrupi, although a dialect of the Assamese language, is spoken both in Assam (Kamrup district) and North Bengal. In this paper, we describe a method to identify not just...
This paper describes a method for Speech Emotion Recognition (SER) using Deep Neural Network (DNN) architecture with convolutional, pooling and fully connected layers. We used 3 class subset (angry, neutral, sad) of German Corpus (Berlin Database of Emotional Speech) containing 271 labeled recordings with total length of 783 seconds. Raw audio data were standardized so every audio file has zero mean...
This special plenary session will celebrate Prof. McCluskey (who passed away in 2016) through three keynote speeches by world-renowned scholars on the next wave of pioneering innovations, starting with a memorial speech by Prof. Jacob Abraham of University of Texas at Austin.
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.