The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In low resource Automatic Speech Recognition (ASR), one usually resorts to the Statistical Machine Translation (SMT) technique to learn transform rules to refine grapheme lexicon. To do this, we face two challenges. One is to generate grapheme sequences from the training data as the targets, which is paired with the original transcripts to train SMT models; the other is to effectively prune the learned...
Recent research on the TIMIT database suggests that longer-length acoustic units are better suited for modelling pronunciation variation and long-term temporal dependencies in speech than traditional phoneme-length units, yielding substantial improvements in recognition accuracy [9]. In this paper, we investigate whether similar improvements can be gained on another database, viz. excerpts from novels...
In this paper, in order to properly evaluate the relative importance of priors and observed data in the Bayesian framework, we propose an extended Gaussian mixture model (EGMM) and design the corresponding learning inference algorithms. First, we define the likelihood function of the EGMM and then propose the variational learning algorithm for this EGMM. Moreover, the proposed model and approach are...
In this paper, we analyze three QoE-based speech quality evaluation models: PESQ, NPESQ and POLQA models. PESQ (Perceptual evaluation of speech quality) is a well known objective speech quality assessment method for speech QoE evaluation. It is formed as the ITU-T P.862 Recommendations. NPESQ (New Perceptual Evaluation of Speech Quality) model is a new objective QoE model on evaluating the speech...
In this paper, we first described the automatic Spoken Chinese Test (SCT). With a large amount of native and non-native data collected for SCT, different training strategies for acoustic modeling were investigated. Evaluations were performed on native as well as non-native datasets. We discovered that directly combining native and non-native data to train acoustic models did not work well, and the...
None of the features commonly utilised in automatic emotion classification systems completely disassociate emotion-specific information from speaker-specific information. Consequently, this speaker-specific variability adversely affects the performance of the emotion classification system and in existing systems is frequently mitigated by some form of speaker normalisation. Speaker adaptation offers...
Recently, much work has been performed on CBIR (content based image retrieval) that treats images as single data domain. However, in our highly digitized society, information is being supplied in multiple domains where the data is linked across domains. For example, a web site does contain images, but it may also contain text, hyperlinks, documents, sound files, movies, and other domains of data....
This paper describes a Part of Speech (POS) tagger that has been developed for Romanian Text-to-Speech purposes. In our Text-to-Speech (TTS) system, the Part of Speech tagger is used to disambiguate the pronunciation of some homograph words, determine the semantic links between words, phrase breaks and intonation phrase boundaries and eventually design the intonation curves. The paper focuses on the...
Local business voice search is a popular application for mobile phones, where hands-free interaction and speed are critical to users. However, speech recognition accuracy is still not satisfactory when the number of businesses and locations is extended nationwide. For mobile users, searching a local business directory is often related to the fulfillment of specific tasks “on-the-move”, such as finding...
An acoustic-phonetics based word-independent technique which uses syllable context for classifying the lexical syllable stress of spoken English words is presented. Nucleus based clustering is remarkably successful in moving from word-dependent syllable stress classification which is intrinsically not scalable to word-independent classification. This however is not possible without an inherent drop...
Over the last few decades speech recognition has evolved and matured enough to be used in commercial applications. The applications include automatic dictation software, voice dialling, voice controlled navigation and simple data entry. Automatic Speech Recognition (ASR) deals with automatic conversion of acoustic signals of an utterance into text. In this work speech recognition system for Tamil...
The privacy of voice over IP (VoIP) systems is achieved by compressing and encrypting the sampled data. This paper investigates in detail the leakage of information from Skype, a widely used VoIP application. In this research, it has been demonstrated by using the dynamic time warping (DTW) algorithm, that sentences can be identified with an accuracy of 60%. The results can be further improved by...
In this paper a hierarchical structure is proposed for automatic gender identification (AGI). In this structure two clustering techniques are used. The first technique is divisive clustering for dividing speakers from each gender to some classes of speakers. The second clustering technique is agglomerative clustering for creating a hierarchical structure. Feature reduction is done by SOAP feature...
This paper addresses the problem of language modeling for LVCSR of Cantonese-English code-mixing utterances spoken in daily communications. In the absence of sufficient amount of code-mixing text data, translation-based and semantics-based mapping are applied on n-grams to better estimate the probability of low-frequency and unseen mixed-language n-grams events. In translation-based mapping scheme,...
Prosodic structure prediction plays a crucial role on the prosodic annotation of speech synthesis corpus as well as on improving the naturalness of synthesized speech. The paper studies Tibetan prosodic structure with Tibetan speech characteristics. Having analyzed a variety of variables that have an impact on Tibetan prosodic boundary, we obtain syllable boundary grammatical information, prosodic...
This paper demonstrates the potential of theoretically motivated learning methods in solving the problem of non-intrusive quality estimation for which the state-of-the-art is represented by ITU-T P.563 standard. To construct our estimator, we adopt the speech features from P.563, while we use a different mapping of features to form quality estimates. In contrast to P.563 which assumes distortion-classes...
This paper studies the influence of n-gram language models in the recognition of sung phonemes and words. We train uni-, bi-, and trigram language models for phonemes and bi- and trigrams for words. The word-level language model is estimated from a textual lyrics database. In the recognition we use a hidden Markov model based phonetic recognizer adapted to singing voice. The models were tested on...
The number of vehicles on the road as well as the human drive time is increasing significantly. Many drivers are increasing their attempts to multi-task while driving including eating, drinking, entertainment control etc. A relatively new domain has emerged over the last 5 years focused on increased technology in the vehicle based on: GPS navigation systems, traffic, weather warning systems, advanced...
Grapheme-to-phoneme (G2P) conversion plays an important role in speech synthesis. The main difficulty facing Chinese G2P conversion is that many Chinese characters are polyphonic, having more than one pronunciation. A Chinese G2P system must be able to pick the correct pronunciation from among several candidates. Contextual information on neighboring characters such as character n-grams, phonetic...
We measure the effects of a weak language model, estimated from as little as 100k words of text, on unsupervised acoustic model training and then explore the best method of using word confidences to estimate n-gram counts for unsupervised language model training. Even with 100k words of text and 10 hours of training data, unsupervised acoustic modeling is robust, with 50% of the gain recovered when...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.