The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In speech recognition system, an improved multi-base neural network speech recognition model is proposed to solve the problem of long learning time and slow convergence rate of deep neural network. However, the improved model introduces a large number of parameters in the training process to make the model over-fitted in the test set, resulting in the deterioration of generalization ability and the...
This paper presents a review on few notable speech recognition models that are reported in the last decade. Firstly, the models are categorized into sparse models, learning models and domain - specific models. Subsequently, the characteristics of the models have been observed using speech constraints, algorithmic constraints and performance constraints. The performance of these models reported in...
This paper introduces a new back-end classifier for a speech recognition system that is based on artificial life (ALife). The ALife species being used for classification purposes are called wains, which were developed using the Créatúr framework. The speech recognition task used in the evaluation of the new classifier is that of isolated digit recognition. Performance of the proposed back-end classifier...
Speech recognition is widely applied to speech to text, speech to emotion, in order to make gadget and computer easier to use, or to help people with hearing disability. Feature extraction is one of significant step in the performance of speech recognition. Therefore, the proper selection is really needed. In this paper, we analyze feature extraction that can have good performance for Indonesian speech...
Speaker identification (SID) in cochannel speech, where two speakers are talking simultaneously over a single recording channel, is a challenging problem. Previous studies address this problem in the anechoic environment under the Gaussian mixture model (GMM) framework. On the other hand, cochannel SID in reverberant conditions has not been addressed. This paper studies cochannel SID in both anechoic...
We propose the prediction-adaptation-correction RNN (PAC-RNN), in which a correction DNN estimates the state posterior probability based on both the current frame and the prediction made on the past frames by a prediction DNN. The result from the main DNN is fed back to the prediction DNN to make better predictions for the future frames. In the PAC-RNN, we can consider that, given the new, current...
The recognition of contact names in mobile-device voice commands is a challenging problem. Some of the difficulties include potentially infinite vocabularies, low probability of contact tokens in the language model (LM), increased false triggering of contact voice commands when none are spoken, and very large and noisy contact name lists. In this paper we suggest solutions for each of these difficulties.
Hidden Markov Models (HMMs) are one of the most important techniques to model and classify sequential data. Maximum Likelihood (ML) and (parametric and non-parametric) Bayesian estimation of the HMM parameters suffers from local maxima and in massive datasets they can be specially time consuming. In this paper, we extend the spectral learning of HMMs, a moment matching learning technique free from...
The presence of Lombard Effect in speech is proven to have severe effects on the performance of speech systems, especially speaker recognition. Varying kinds of Lombard speech are produced by speakers under influence of varying noise types [1]. This study proposes a high-accuracy classifier using deep neural networks for detecting various kinds of Lombard speech against neutral speech, independent...
Recent research on the TIMIT database suggests that longer-length acoustic units are better suited for modelling pronunciation variation and long-term temporal dependencies in speech than traditional phoneme-length units, yielding substantial improvements in recognition accuracy [9]. In this paper, we investigate whether similar improvements can be gained on another database, viz. excerpts from novels...
This paper presents a speaker based Language Independent Isolated Speech Recognition System (LIISRS). The most popular feature extraction technique Mel Frequency Cepstral Coefficients (MFCC) is used for training the system. Representative specific features are identified using K-Means algorithm. Distortion measure is calculated using Euclidian distance function. Pitch contour characteristics are used...
In this paper, we propose a two-stage phone recognition system using articulatory and spectral features. In the first stage, articulatory features are predicted from spectral features using FeedForward Neural Networks (FFNNs). In the second stage, phone recognition is carried out using the predicted articulatory features and spectral features together. FFNNs and Hidden Markov Models are explored for...
The goal of this work is to improve phone recognition accuracy using combination of source and system features. As speech is produced by exciting time varying vocal tract system with time varying excitation, we want to explore both source and system components of speech production system for phone recognition. The excitation source information is derived by processing linear prediction residual of...
Unsupervised speaker adaptation of Deep Neural Network (DNN) is investigated for lecture transcription tasks, in which a single speaker gives a long speech and thus speaker adaptation is important. The proposed method selects similar speakers to the test data (test speaker) from the training database, which are used for retraining the baseline DNN. Several speaker characteristic features are defined...
Research on speech/music classification of digital audio has been both popular in academia, and increasingly utilized in industry. Most of the usual methods use carefully hand-crafted features with Gaussian Mixture Models. To get best performance, some of the features necessitate a long latency due to look ahead, or/and a long onset error. This paper aims to have a different approach to the problem...
Speech recognition systems are either based on parametric approach or non-parametric approach. Parametric based systems such as HMMs have been the dominant technology for speech recognition in the past decade. Despite a lot of advancements and enhancements in the design of these systems: key problems such as long term temporal dependence, etc. Has not yet been solved. Recently due to availability...
Computer assisted language learning (CALL) and, more specifically, computer assisted pronunciation training (CAPT) have received considerable attention in recent years. CAPT allows continuous feedback to the learner without requiring the sole attention of the teacher; it facilitates self study and encourages interactive use of the language in preference to rote learning. One of the important processes...
In the past decade a lot of research has gone into Automatic Speech Emotion Recognition(SER). The primary objective of SER is to improve man-machine interface. It can also be used to monitor the psycho physiological state of a person in lie detectors. In recent time, speech emotion recognition also find its applications in medicine and forensics. In this paper 7 emotions are recognized using pitch...
In speech recognition system, the Mel Frequency Cepstrum Coefficients (i.e. MFCC) feature extraction is an important process. It has also been wildly used in many applications. In this paper, we present the conventional MFCC feature extraction method and propose two novel versions of MFCC method that will combine the PCA technique and conventional MFCC feature extraction method. Finally, these three...
We survey evidence — orthographic distributional phonological and psycholinguistic — in favor of a model of Arabic speech sounds based on the CV unit and extensive use of the silent sukuun vowel. We then construct a small-vocabulary multi-speaker CV HMM similar to the phonemic HMMs based on tied triphones that are widely used in speech recognizers for English and other European languages. Using experimental...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.