The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Hermite functions are an effective tool for improving the resolution of the single-window spectrogram. In this paper, we analyze the Hermite functions in the ambiguity domain and show that the higher order terms can introduce undesirable cross-terms in the multiwindow spectrogram. The optimal number of Hermite functions depends on the location and spread of signal auto-terms in the ambiguity domain...
It has been previously shown that, when both acoustic and articulatory training data are available, it is possible to improve phonetic recognition accuracy by learning acoustic features from this multi-view data with canonical correlation analysis (CCA). In contrast with previous work based on linear or kernel CCA, we use the recently proposed deep CCA, where the functional form of the feature mapping...
The robustness of phoneme classification to white Gaussian noise and pink noise in the acoustic waveform domain is investigated using support vector machines. We focus on the problem of designing kernels which are tuned to the physical properties of speech. For comparison, results are reported for the PLP representation of speech using standard kernels. We show that major improvements can be achieved...
Using the framework of Reproducing Kernel Hilbert Spaces, we develop a new sequence kernel that measures similarity between sequences of observations. We then apply it to a text-independent speaker verification task using the NIST 2004 Speaker Recognition Evaluation database. The results show that incorporating our new sequence kernel in an SVM training architecture not only yields performance significantly...
We compare in this paper diverse hierarchical and multi-class approaches for the speech/music segmentation task, based on Support Vector Machines, combined with a median filter post-processing. We show the effciency of kernel tuning through the novel Kernel Target Alignment criterion. Quantitative results provide an F-measure of 96.9%, that represents an error reduction of about 50% compared to the...
The nonlinear speech signal decomposition based on Volterra-Wiener functional series is described. The solution of phoneme recognition problem by means of measuring Wiener kernels is proposed.
In this work, a novel approach of linear transformation on speech subspace is used to preserve the properties of speech signal under stress condition. It is assumed that, there is another subspace called as speech subspace which exist and contains the properties of speech signal under neutral and stress conditions. Therefore, speech component of stress speech is determined by linear transformation...
Automatic Speech Emotion Recognition (SER) is a current research topic in the field of Human Computer Interaction (HCI) with wide range of applications. The speech features such as, Mel Frequency cepstrum coefficients (MFCC) and Mel Energy Spectrum Dynamic Coefficients (MEDC) are extracted from speech utterance. The LIBSVM is used as classifier to identify different emotional states such as anger,...
An important task in machine learning and natural language processing is to learn to recognize different types of human speech, including humor, sarcasm, insults, and profanity. In this paper we describe our method to produce test and training data sets to assist in this task. Our test data sets are taken from the domain of free, libre, and open source software (FLOSS) development communities. We...
The Nyström method is an efficient technique for scaling kernel learning to very large data sets with more than millions. Instead of computing kernel matrix, it is to approximate a kernel learning problem with a linear prediction problem. We propose an ensemble Nyström method for high dimensional prediction of conflict level from speech. The experiments have been conducted over SSPNet Conflict Corpus,...
Human machine interaction is one of the most burgeoning area of research in the field of information technology. To date a majority of research in this field has been conducted using unimodal and multimodal systems with asynchronous data. Because of the above, the improper synchronization, which has become a common problem, due to that, the system complexity increases and the system response time...
Steganography is a concept of hiding information in order for data to remain safe and unhandled by eve droppers. In this paper we are demonstrating a way to transmit data from sender to receiver without being handled by eve through a new technique of steganography. We are using an audio file for hiding our data as audio are very less judged to changes made to them. Audio files in wav form are represented...
Surgical video recording is widely used in operation rooms in order to analyze such as surgical procedures and intraoperative incident detection. Therefore, a number of useful operation video records are stored in the hospitals. It is considered that these video records contain significant information, so it is needed to utilize these video data. In awake craniotomy, which is one of the advanced neurological...
Recently, the Voice Activity Detection (VAD) algorithms based on machine learning techniques have shown impressive results in the area of speech recognition. In this paper, we present a case study and we discuss the performance of VAD based on Support Vector Machines (SVM) for Distributed Speech Recognition (DSR) system. In this case study, the speech and the non-speech frames are detected from the...
This paper proposes a novel approach for automatic estimation of four important traits of speakers, namely age, height, weight and smoking habit, from speech signals. In this method, each utterance is modeled using the i-vector framework which is based on the factor analysis on Gaussian Mixture Model (GMM) mean supervectors, and the Non-negative Factor Analysis (NFA) framework which is based on a...
Channel interference factor for the identification result is prevalent among the existing speaker recognition algorithms. In order to improve the accuracy of the algorithm, the paper utilizes the technique of latent factor analysis(LFA) to deal with the channel factors in the speaker's Gaussian Mixture Model(GMM). In the endpoint detection phase of speaker recognition, the algorithm introduces the...
Speech recognition is the important problem in pattern recognition research field. In this paper, the kernel ridge regression method is proposed to be applied to the MFCC feature vectors of the speech dataset available from IC Design lab at Faculty of Electricals-Electronics Engineering, University of Technology, Ho Chi Minh City. Experiment results show that the kernel ridge regression method outperforms...
The maximum likelihood linear regression (MLLR) technique is a well-known approach to parameter adaptation in hidden Markov model (HMM)-based systems. In this paper, we propose the maximum penalized likelihood kernel regression (MPLKR) approach as a novel adaptation technique for HMM-based speech synthesis. The proposed algorithm performs a nonlinear regression between the mean vector of the base...
This paper presents an anthropomorphic robotic bear for the exploration of human-robot interaction including verbal and non-verbal communications. This robot is implemented with a hybrid face composed of a mechanical faceplate with 10 DOFs and an LCD-display-equipped mouth. The facial emotions of the bear are designed based on the description of the Facial Action Coding System as well as some animal-like...
An automatic Language Identification (LID) is a system designed to recognize a language from a given spoken utterance. The spoken utterances are classified according to the pre-defined set of languages. In this paper a LID system is designed for two different languages namely English and French. The classification of an audio sample is done by extracting Mel frequency cepstral coefficients (MFCCs)...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.