The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper proposes a method to deal with the problem of extracting commentator's speech in audio stream of live sports games. First, a two-pass metric-based audio segmentation module is developed to segment the audio stream into short ones with homogeneous acoustic features. Then a model-based classification module is adopted to extract the speech segments. For robust audio classification, various...
In this paper, we present an approach that using articulatory features (AFs) derived from spectral features for speech emotion recognition. Also, we investigated the combination of AFs and spectral features. Systems based on AFs only and combined spectral-articulatory features are tested on the CASIA Mandarin emotional corpus. Experiments results show that AFs alone are not suitable for speech emotion...
This paper presents our Mandarin pronunciation quality assessment system for the examination of Putonghua Shuiping Kaoshi (PSK) and investigates some measures to improve the assessment accuracy. In this paper, a selective speaker adaptation method is studied. In the adaptation module, we select well pronounced speech as the adaptation data, and adopt Maximum Likelihood Linear Regression (MLLR) to...
In this paper, we study how to generate in-domain data for statistical language model adaptation in a Chinese voice search dialogue system. Given limited amount of in-domain data, we use unsupervised clustering to induce semantic classes and structures from the first part of test data. These structures are further augmented with domain information to generate large amount of in-domain data. Lastly...
This paper proposes a novel system to automatically determine the sports type of a sports game by conducting keywords spotting on short fragments (around 10 minutes) of a sports game. In this system, we first develop an audio segmentation module as a front-end to separate announcers' speech efficiently from the complex sports audio stream. Then we employ speech recognition technology on these speech...
Modern lifestyle has increased the risk of pathological voices problems. So the therapy of pathological people attracts more attention of people. Meanwhile, acoustic features have been used widely in the therapy of voice disordered people. Classification of Normal and Pathological people is also an auxiliary therapy operation. MFCC has been proved to be a useful feature with traditional classifier...
In this paper we develop an approach to automatic, data-driven generation of pronunciation dictionaries for keyword spotting(KWS) systems. In practical applications, KWS tasks often have to deal with keywords whose pronunciations can not be found in the dictionary. To solve this problem, we study how to derive pronunciations automatically from speech samples of keywords. Recognized sequences from...
The great success of Minimum Phone Error (MPE) training criterion in mono-language large vocabulary continuous speech recognition (LVCSR) tasks motivates us to apply it to bilingual LVCSR systems. In this paper, in conjunction with the previous respectable bilingual phoneme inventory construction techniques, we give a comprehensive investigation to the performance of MPE/fMPE on various Mandarin-English...
Voice search is the technology that enables users to access information using spoken queries. Automatic speech recognizer (ASR) is one of the key modules for voice search systems. However, the high error rate of the state-of-the-art large vocabulary continuous speech recognition (LVCSR) is the bottleneck for most voice search systems. In this paper, we first build a baseline system using language...
This paper presents an improvement for confidence measure estimation as posterior probabilities on lattices in speech recognition. An observation is presented that nontarget regions, i.e. non-speech part of a spoken utterance, of different lengths may lead to different levels of over optimistic confidence measures. This may be problematic in obtaining a consistent rejection performance at the same...
In this study, some research activities on expressive speech recognition and conversion will be introduced. A database consisting of five kinds of speech emotions (i.e. happiness, sadness, surprise, anger and neutral) is used. Not only those traditional features such as mfcc, plp, and pitch are studied, but also a new feature extraction method based on fisher's F-Ratio is proposed and reported. In...
This paper presents a novel bilingual model modification approach to improve nonnative speech recognition accuracy when the variations of accented pronunciations occur. Each state of baseline nonnative acoustic model is modified with several candidate states from the auxiliary acoustic model, which is trained on speakers' mother language. State mapping criterion and n-best candidates are investigated,...
In this paper, a novel statistical method based on conditional random fields (CRF) is proposed for hierarchical prosody structure prediction, which is a key module in speech synthesis systems. We will discuss how to build the prosody models for mandarin Chinese using conditional random fields in detail, including corpus preparation, feature selection, feature template design, model training and evaluation...
Maximum likelihood linear regression (MLLR) is a widely used technique for speaker adaptation in large vocabulary speech recognition system. Recently, using MLLR transforms as features for SVM based speaker recognition tasks has been proposed, achieving performance comparable to that obtained with cepstral features. In this paper, we focus on calculating the transforms based on a GMM universal background...
Eigenvoice speaker adaptation has been shown to be effective in recent years. In this paper, we propose to use eigenvoice coefficients as features for speaker recognition. We use a simplified version of probabilistic subspace adaptation (PSA) to estimate eigenvoice coefficients, and the coefficients are concatenated to construct supervectors of support vector machines. This approach significantly...
In this paper, a synchronous method based on state graph is proposed to calculate the evaluation feature for automatic scoring in computer-assisted language learning (CALL). The posterior probabilities of states are selected as the main feature. The score of hypothesized phonemes and words are estimated using the information of corresponding states. Traditional systems use two passes and two different...
This paper examines the system combination issue for syllable-confusion-network (SCN)-based Chinese spoken term detection (STD). System combination for STD usually leads to improvements in accuracy but suffers from increased index size or complicated index structure. This paper explores methods for efficient combination of a word-based system and a syllable-based system while keeping the compactness...
In this paper, we present a new modeling approach for speaker recognition, which uses a kind of novel phonotactic information as the feature for S VM modeling. Gaussian mixture models (GMMs) have been proven extremely successful for text- independent speaker recognition. The GMM universal background model (UBM) is a speaker-independent model, each component of which can be considered to be modeling...
In order to alleviate the limitation of "state output probability conditional independence" assumption held by Hidden Markov models (HMMs) in speech recognition, a discriminative semi-parametric trajectory model was proposed in recent years, in which both means and variances in the acoustic models are modeled as time-varying variables. The time- varying information is modeled as a weighted...
For a reading tutor, the reference content which the reader reads is known beforehand. This apriori information is very important in automatic detection of reading miscues. This paper proposed two methods to incorporate the reference information into LVCSR framework to improve the performance of miscue detection. The two methods both tune the n-gram Language Model (LM) probabilities dynamically in...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.