The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
For i-vector model, normalization approach is Probabilistic linear discriminant analysis and has a significant performance for verification of speaker. However it requires a huge development data which cost a lot in many cases. Unsupervised adaption method is a possible approach, which use unlabeled data to adapt PLDA scattering matrices to the target domain. In this paper, ‘local training’ approach...
Model-based approaches to Speaker Verification (SV), such as Joint Factor Analysis (JFA), i-vector and relevance Maximum-a-Posteriori (MAP), have shown to provide state-of-the-art performance for text-dependent systems with fixed phrases. The performance of i-vector and JFA models has been further enhanced by estimating posteriors from Deep Neural Network (DNN) instead of Gaussian Mixture Model (GMM)...
For text-independent short-utterance speaker recognition (SUSR), the performance often degrades dramatically. This paper presents a combination approach to the SUSR tasks with two phonetic-aware systems: one is the DNN-based i-vector system and the other is our recently proposed subregion-based GMM-UBM system. The former employs phone posteriors to construct an i-vector model in which the shared statistics...
This work focuses on Emirati speaker verification systems in neutral talking environments based on each of First-Order Hidden Markov Models (HMMls), Second-Order Hidden Markov Models (HMM2s), and Third-Order Hidden Markov Models (HMM3s) as classifiers. These systems have been evaluated on our collected Emirati speech database which is comprised of 25 male and 25 female Emirati speakers using Mel-Frequency...
The popular i-vector model represents speakers as low-dimensional continuous vectors (i-vectors), and hence it is a way of continuous speaker embedding. In this paper, we investigate binary speaker embedding, which transforms i-vectors to binary vectors (codes) by a hash function. We start from locality sensitive hashing (LSH), a simple binarization approach where binary codes are derived from a set...
Diagnosis and monitoring of Parkinson's disease has a number of challenges as there is no definitive biomarker despite the broad range of symptoms. Research is ongoing to produce objective measures that can either diagnose Parkinson's or act as an objective decision support tool. Recent research on speech based measures have demonstrated promising results. This study aims to investigate the characteristics...
Obstructive sleep apnea (OSA) is a common sleep-related breathing disorder. Previous studies associated OSA with anatomical abnormalities of the upper respiratory tract that may be reflected in the acoustic characteristics of speech. We tested the hypothesis that the speech signal carries essential information that can assist in early assessment of OSA severity by estimating apnea-hypopnea index (AHI)...
In this paper we present a new database with speech recordings in Spanish. The database contains recordings of 54 native Spanish speakers. It is appropriate to be used in the development and testing of better Speaker Verification systems. The recording procedure, equipments and speech tasks are detailed. Experiments using the GMM-UBM speaker verification methodology were performed. The methodology...
Emotional interaction plays an important role in human-computer interaction domains. One of the major limitations in the study of emotion interaction is the lack of databases. This paper describes a database for emotion interactions of the elderly. The database was collected with audio and video from sixteen actors (8 female and 8 male) in daily conversations of TV series, which covers seven type...
Gesticulation, together with the speech, is an important part of natural and affective human-human interaction. Analysis of gesticulation and speech is expected to help designing more natural human-computer interaction (HCI) systems. We build the JestKOD database, which consists of speech and motion capture recordings of dyadic interactions. In this paper we describe our annotation efforts and present...
In this paper a speaker dependent cohort selection for T-norm score normalization is proposed in the context of text-dependent speaker verification. The goal of the proposed technique is to find a set of cohort speakers who are close to the target speaker. In order to properly select the subset of speakers for the normalization, a distance between each target speaker model and the the available normalization...
Speaker voice characteristics are an important aspect of forensic phonetics. Previous studies have suggested that all the features present in the speech signals are not equally important for speaker discrimination, and it is well-known that subsets of phonemes are more informative than others. However, most of theses studies have concerned a whole group of speakers, without taking into account the...
Automatic speech recognition is one of the challenging area in the field of speech signal processing. Automatic speech recognition technology converts speech signal into text. This paper presents the implementation of isolated kannada word recognizer using Vector Quantization (VQ) and Fuzzy-C Means (FCM) techniques. The paper compares and contrasts the recognition accuracies of FCM and k-means techniques...
In this paper a CART-based pause duration prediction model has been developed for Malayalam language. Prosodic features like pause durations, syllable prolongations etc. play an important role in making the speech output from a Text To Speech (TTS) system more intelligible. An analysis on the various factors that affect pause duration for Malayalam language has not been conducted till date. Here,...
In this paper we present a movie summarization system and we investigate what composes high quality movie summaries in terms of user experience evaluation. We propose state-of-the-art audio, visual and text techniques for the detection of perceptually salient events from movies. The evaluation of such computational models is usually based on the comparison of the similarity between the system-detected...
This paper presents work on the use of segmental modelling and phonetic features for phoneme based speech recognition. The motivation for the work is to lessen the effects of the IID assumption in HMM based recognition. The use of phonetic features which are derived across the duration of a phonetic segment is discussed. In conjunction with the use of these features, a hybrid phoneme model is introduced...
In this paper, we present the first step of a project that is able to perform both speech and singing synthesis controlled in real-time. Our aim is to develop a flexible application allowing performers to produce complex and versatile singing - as well as speech - articulations. Thus, we have adapted an existing speech synthesizer, the MBROLA software, to real-time singing constraints. The work presented...
Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames...
This paper describes the development, implementation and validation of an automatic speaker recognition system on an iPad tablet. A score normalization approach, referred as Nearest Neighbor Normalization (3N), is applied in order to improve the baseline speaker verification system. The system is evaluated on the MOBIO corpus and results show an absolute improvement of the HTER by more than 4% when...
In this paper, we present the details of a phonotactic language identification (LID) system developed for five Indian languages, English (Indian), Hindi, Malayalam, Tamil and Kan-nada. Since there are no publicly available speech databases for English, Malayalam and Kannada, we developed the database for each of the target languages by downloading the audio files from YouTube videos and removing the...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.