The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Deep Neural Networks (DNN) are the dominant technique widely used in English and Chinese speech recognition currently. However, Tibetan speech recognition research starts late and mainly uses Hidden Markov Model (HMM). In this paper, We show a better method of replacing Gaussian Mixture Models (GMM) by DNN to Tibetan Lhasa dialect speech recognition system. The system contains seven layers of features...
In this paper, we continue our work on linear least squares based adaptation (LLS) for deep neural networks. We show that our previously proposed algorithm is a special case of an optimization algorithm called Alternating Direction Method of Multipliers (ADMM). We demonstrate that the adaptation algorithm can improve the performance on various deep neural networks including the bidirectional long...
Long short-term memory (LSTM) recurrent neural network based language models are known to improve speech recognition performance. However, significant effort is required to optimize network structures and training configurations. In this study, we automate the development process using evolutionary algorithms. In particular, we apply the covariance matrix adaptation-evolution strategy (CMA-ES), which...
Deep neural network (DNN) is trained according to a mini-batch optimization based on the stochastic gradient descent algorithm. Such a stochastic learning suffers from instability in parameter updating and may easily trap into local optimum. This study deals with the stability of stochastic learning by reducing the variance of gradients in optimization procedure. We upgrade the optimization from the...
This paper describes an enhancement strategy based on several perceptual-assessment criteria for dereverberation algorithms. The complete procedure is applied to an algorithm for reverberant speech enhancement based on single-channel blind spectral subtraction. This enhancement was implemented by combining different quality measures, namely the so-called QAreverb, the speech-to-reverberation modulation...
Linear Discriminant Analysis (LDA) has been applied successfully to speech recognition tasks, improving accuracy and robustness against some types of noise. However, it is well known that LDA suffers from some weaknesses if the distributions are not unimodal or when the mean of the distributions are shared. In this paper, we propose to take advantage of the nonlinear discriminant properties of the...
We propose strategies for a state-of-the-art keyword search (KWS) system developed by the SINGA team in the context of the 2014 NIST Open Keyword Search Evaluation (OpenKWS14) using conversational Tamil provided by the IARPA Babel program. To tackle low-resource challenges and the rich morphological nature of Tamil, we present highlights of our current KWS system, including: (1) Submodular optimization...
This article proposes and evaluates a Gaussian Mixture Model (GMM) represented as the last layer of a Deep Neural Network (DNN) architecture and jointly optimized with all previous layers using Asynchronous Stochastic Gradient Descent (ASGD). The resulting “Deep GMM” architecture was investigated with special attention to the following issues: (1) The extent to which joint optimization improves over...
In this paper we present an investigation of sequence-discriminative training of deep neural networks for automatic speech recognition. We evaluate different sequence-discriminative training criteria (MMI and MPE) and optimization algorithms (including SGD and Rprop) using the RASR toolkit. Further, we compare the training of the whole network with that of the output layer only. Technical details...
This paper proposes a statistical methodology based on evolving Fuzzy-rule-based (FRB) classifiers to develop dialog managers for spoken dialog systems. The dialog managers developed by means of our proposal select the next system action by considering a set of dynamic rules that are automatically obtained by means of the application of the FRB classification process. Our approach has the main advantage...
The discriminative optimization of decoding networks is important for minimizing speech recognition error. Recently, several methods have been reported that optimize decoding networks by extending weighted finite state transducer (WFST)-based decoding processes to a linear classification process. In this paper, we model decoding processes by using conditional random fields (CRFs). Since the maximum...
With the aim of achieving a computationally efficient optimization of kernel-based probabilistic models for various problems, such as sequential pattern recognition, we have already developed the kernel gradient matching pursuit method as an approximation technique for kernel-based classification. The conventional kernel gradient matching pursuit method approximates the optimal parameter vector by...
Automatic speech recognition (ASR) is an enabling technology for a wide range of information processing applications including speech translation, voice search (i.e., information retrieval with speech input), and conversational understanding. In these speech-centric applications, the output of ASR as “noisy” text is fed into down-stream processing systems to accomplish the designated tasks of translation,...
In agglutinative languages, selection of lexical unit is not obvious. Morpheme unit is usually adopted to ensure the sufficient coverage, but many morphemes are short, resulting in weak constraints and possible confusions. In this paper, we propose a discriminative approach to select lexical entries which will directly contribute to ASR error reduction. We define an evaluation function for each word...
Training fuzzy neural network (FNN) is an optimization task which is desired to find optimal centers of the membership function and weights. Traditional training algorithms have some drawbacks such as getting stuck in local minima and computational complexity. This work presents FNN trained by artificial bee colony (ABC) optimization algorithm which has good exploration and exploitation capabilities...
Speech translation (ST) is an enabling technology for cross-lingual oral communication. A ST system consists of two major components: an automatic speech recognizer (ASR) and a machine translator (MT). Nowadays, most ASR systems are trained and tuned by minimizing word error rate (WER). However, WER counts word errors at the surface level. It does not consider the contextual and syntactic roles of...
We introduce the Line Search A-Function (LSAF) technique that generalizes the Extended-Baum Welch technique in order to provide an effective optimization technique for a broader set of functions. We show how LSAF can be applied to functions of various probability density and distribution functions by demonstrating that these probability functions have an A-function. We also show that sparse representation...
This paper presents a novel method for reducing the dimensionality of kernel spaces. Recently, to maintain the convexity of training, log-linear models without mixtures have been used as emission probability density functions in hidden Markov models for automatic speech recognition. In that framework, nonlinearly-transformed high-dimensional features are used to achieve the nonlinear classification...
Log-linear acoustic models have been shown to be competitive with Gaussian mixture models in speech recognition. Their high training time can be reduced by feature selection. We compare a simple univariate feature selection algorithm with ReliefF - an efficient multivariate algorithm. An alternative to feature selection is ℓ1-regularized training, which leads to sparse models. We observe that this...
The metamodels is a technique that was developed to model a speaker's phoneme confusion-matrix and use this information to increase speech recognition accuracy for speakers with disordered and normal speech. Approaches to improve the performance of the metamodels, mainly focused on obtaining better estimates of the speaker's confusion-matrix, were studied. While some achieved significant improvements,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.