The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In order to train neural networks (NN) for text-to-speech synthesis (TTS), phonetic segmentation must be performed. The most accurate segmentation is performed manually, but the process of creating manual alignments is costly and time-consuming, so automatic procedures are preferable. In this paper, a simple alignment method based on models trained during hidden Markov Model (HMM) based TTS system...
This paper provides a voice transformation model that uses pitch data and Feed-forward Neural Networks on Line Spectral Frequency. The aim of this work is to achieve the transformation of a speech signal produced by a source speaker by modifying voice individuality parameters such that it appears to be spoken by a chosen target speaker, without modifying the message contents. Most of the previous...
Speech uttered by the human beings contains the information about speakers, languages and contents. Language of uttered speech can easily be identified by extracting the language specific information from it. Identification of language of speech is known as Language Identification (LID). Identification of language from speech is helpful in its translation, speech recognition and speech activated automatic...
One of the difficulties in sung speech recognition is the small distance in an acoustic space between phonemes in sung speech. Therefore we considered clustering the speech based on a pitch (fundamental frequency F0) and creating a larger distance between the phonemes. In addition, we considered a two-stage training method of DNN-HMM: the first stage is trained by using conventional acoustic features...
Model-based approaches to Speaker Verification (SV), such as Joint Factor Analysis (JFA), i-vector and relevance Maximum-a-Posteriori (MAP), have shown to provide state-of-the-art performance for text-dependent systems with fixed phrases. The performance of i-vector and JFA models has been further enhanced by estimating posteriors from Deep Neural Network (DNN) instead of Gaussian Mixture Model (GMM)...
Different modes of vibration of the vocal folds contribute significantly to the voice quality. The neutral mode phonation, often used in a modal voice, is one against which the other modes can be contrastively described, also called non-modal phonations. This paper investigates the impact of non-modal phonation on phonological posteriors, the probabilities of phonological features inferred from the...
State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes...
For acoustic modeling, the use of DNN has become popular due to its superior performance improvements observed in many automatic speech recognition (ASR) tasks. Typically, DNNs with deep (many layers) and wide (many hidden units per layer) architectures are chosen in order to achieve good gains. An issue with such approaches is that there is an explosion in the number of learnable parameters. Thus,...
In this paper, we analyze the feasibility of using single well-resourced language - English - as a source language for multilingual techniques in context of Stacked Bottle-Neck tandem system. The effect of amount of data and number of tied-states in the source language on performance of ported system is evaluated together with different porting strategies. Generally, increasing data amount and level-of-detail...
Automatic speech recognition (ASR) of code-switching speech requires careful handling of unexpected language switches that may occur in a single utterance. In this paper, we investigate the feasibility of using multilingually trained deep neural networks (DNN) for the ASR of Frisian speech containing code-switches to Dutch with the aim of building a robust recognizer that can handle this phenomenon...
This paper presents a text-dependent speaker verification using Mel-Frequency Cepstral Coefficients (MFCC) and Support Vector Machine (SVM). Mel-Frequency Cepstral Coefficients technique has been used to extract the characteristic from the recorded voice spoken by the user and SVM is used to classify the all models of the speakers and impostors. A Malay spoken digit database is utilized for the training...
Building synthetic child voices is considered a difficult task due to the challenges associated with data collection. As a result, speaker adaptation in conjunction with Hidden Markov Model (HMM)-based synthesis has become prevalent in this domain because the approach caters for limited amounts of data. An initial average voice model is trained using data from multiple speakers and adapted to resemble...
As emotion recognition from speech has matured to a degree where it becomes suitable for real-life applications, it is time for developing techniques for matching different types of emotional data with multi-dimensional and categories-based annotations. The categorical approach is usually applied for acted ‘full blown’ emotions and multi-dimensional annotation is often preferred for spontaneous real...
Phonetic Engine (PE) is a system that is used to determine the sequence of phones in a spoken utterance. In order to transcribe the speech database, International Phonetic Alphabet (IPA) is used. This work focuses on developing multilingual PE for four Indian languages namely, Bengali, Hindi, Urdu and Telugu. The number of languages can be increased to any number. For developing the PE, read speech...
Design a software system on smart phone platform. The purpose of this system is providing a reasonable method to evaluate the English accent of non-native speakers, based on the phoneme recognition and fluency assessment, taking advantage of Hidden Markov Model (HMM). Meanwhile, this paper would use the neural net algorithm to combine the objective scoring and experts' scoring to increase the accuracy...
In this paper, a novel sparse representation over learned and exemplar dictionaries is explored to estimate the speech information of stressed speech. Stressed speech contains speech and stress informations. The acoustic variabilities are induced due to presence of stress information, which results in degradation of the performance of speech recognition system. In this work, the acoustic variabilities...
In this paper we compare two state-of-the-art speech synthesis techniques (corpus- and HMM-based) in terms of expressive speech synthesis. Two corpora were composed with different speaking styles (broadcast news and literature reading) from the same female speaker. Our aim was to determine to what extent the different technologies reproduce these styles. The corpora and the synthetic expressive speech...
Classification of speech signal is one of the most vital problems in speech perception and spoken word recognition. Although, there have been many studies on the classification of speech signals but the results are still limited. In this paper, we propose an image based approach for speech signal classification based on the combination of Local Naïve Bayes Nearest Neighbor (LNBNN) and Scale-invariant...
The hybrid speech synthesis system which combines the hidden Markov model and unit selection method has been widely used and researched in both industry and academia recently due to its naturalness and expressiveness. However, the target duration, which is used to control the duration of selected candidate, is still predicted via the state-based duration model, whose performance is far from satisfactory...
Automatic Speech Recognition (ASR) is the process of converting the human speech which is in the form of acoustic waveform, into text. In this paper we discussed about building an automatic speech recognition system for Telugu news. A Telugu speech database is prepared along with the transcription, dictionary. Telugu speech files are collected from the Telugu TV news channels. Most of the selected...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.