The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Acoustic Event Detection plays an important role for computational acoustic scene analysis. Although we would face with a sound overlapping problem in a real situation, conventional methods do not consider the problem enough. In this paper, we propose a new overlapped acoustic event detection technique combined a source separation technique of Non-negative Matrix Factorization with shared basis vectors...
Recent studies have shown that phase information contains speaker-dependent characteristics and is effective for speaker recognition. In this paper, we summarize a robust phase feature extracted from Fourier spectrum (including pitch non-synchronized phase information and pseudo-pitchsynchronized phase information) and its application for speaker recognition for different speaking rate speech and...
We describe a scheme to translate spoken English lectures into Japanese consisting of a deep neural network based English automatic speech recognition system (ASR) and an English to Japanese phrase-based statistical machine translation system (SMT). The bad influence of speech misrecognition for the translation model is focused. For coping with bad influence caused by speech misrecognition, we utilized...
Phase information is ignored for almost all voice activity detection (VAD). To exploit full information in the original signal, this paper proposes a deep neural network (DNN) using magnitude and phase information (that is, phase aware DNN) to achieve better VAD performance. Mel-frequency cepstral coefficient (MFCC), power-normalized cepstral coefficients (PNCC), instantaneous frequency derivative...
The detection of human and spoofing (synthetic or converted) speech has started to receive an increasing amount of attention. In this paper, modified relative phase (MRP) information extracted from a Fourier spectrum is proposed for spoofing speech detection. Because original phase information is almost entirely lost in spoofing speech using current synthesis or conversion techniques, some phase information...
One of the difficulties in sung speech recognition is the small distance in an acoustic space between phonemes in sung speech. Therefore we considered clustering the speech based on a pitch (fundamental frequency F0) and creating a larger distance between the phonemes. In addition, we considered a two-stage training method of DNN-HMM: the first stage is trained by using conventional acoustic features...
Deep neural networks (DNN) have achieved significant success in the field of speech recognition. One of the main advantages of the DNN is automatic feature extraction without human intervention. Therefore, we incorporate a pseudo-filterbank layer to the bottom of DNN and train the whole filterbank layer and the following networks jointly, while most systems take pre-defined mel-scale filterbanks as...
Speech emotion recognition is a still challenging problem despite having been investigated over the last couple of decades. Conventional speech emotion recognition performance is low, but this may be improved by considering new features and an annotation method. In this paper, firstly we use glottal features for speech emotion recognition to improve its performance because the emotions are related...
This paper describes our scheme to translate spoken English lectures into Japanese consisting of an English automatic speech recognition system (ASR) that utilizes a deep neural network (DNN) and an English to Japanese phrase-based statistical machine translation system (SMT). We focused on domain adaptation of the acoustic and translation models. For domain adaptation of the translation model, frequently...
Recently, spoken dialog systems using speech recognition technology have been becoming popular. Such as chat-like dialog systems, these systems which do not have any specific purpose are called “Non-Task-oriented spoken dialog system”. In this study, we focused on non-task-oriented spoken dialog systems. We have developed a multi-party chat-like dialog system with different preferences between two...
Lyric recognition in singing is challenging because of a number of problems, including a lack of singing databases, superposed musical instruments and different spectral variations. First of all, we investigated the difference of spectral variations among read speech, spontaneous speech and sung speech and we found that sung speech recognition was the most difficult. Next, we consider Japanese lyric...
In speech recognition, it is preferable not to hypothesize the details, e.g., specific age and gender, of a target user. However, speaker independence is one of the things that degrades ASR performance. In this work, we propose a speaker adaptation method to recognize a short time utterance. There have been several studies on speaker-independent DNN-HMM in which i-vector is computed, and the additional...
This paper presents a Japanese spoken term detection method for spoken queries using a combination of word-based search and syllable-based N-gram search with in-vocabulary/out-of-vocabulary (IV/OOV) term classification. The N-gram index in a recognized syllable-based lattice for OOV terms, which assumes recognition errors such as substitution, insertion and deletion errors, incorporates a distance...
We investigated speech recognition methods for mixed speech and music that only remove music based on non-negative matrix factorization (NMF). In this paper, we introduced the Euclidean distance of logarithm spectrum DLOG as a distance measure for source separation, which may correspond to the distance measure for speech recognition, and compared it with such traditional distance measures as the Kullback-Leibler...
This paper presents our scheme to translate spoken English lectures into Japanese that consists of an English automatic speech recognition system (ASR) that utilizes a deep neural network (DNN) and an English to Japanese phrase-based statistical machine translation system (SMT). We utilized an existing Wall Street Journal corpus for our acoustic model and adapted it with MIT OpenCourseWare lectures...
There is an increasing use of sensor networks capable of sensing multimedia data including audio data. Unfortunately, public use of these is not allowed because they contain crucial privacy information such as person and location names. Person name extraction (PNE), which is a widely investigated research topic, is an effective technique to resolve this problem. However, there is an important difference...
Japanese is syllabic language. Additionally we have studied syllable-based GMM-HMM for Japanese speech recognition. In this paper, we investigate the differences of recognition accuracy using phoneme/syllable-based GMM-HMM and DNN (Deep Neural Network)-HMM. First, we present a comparison of syllable-based and phoneme-based DNN-HMM. Second, we train the tied state left-context dependent syllable DNN-HMM,...
This paper presents our attempt to create English automatic speech recognition (ASR) and English to Japanese statistical machine translation system (SMT). We used MIT OpenCourseWare lectures as our test lecture corpus. Wall Street Journal (WSJ) corpus adapted with MIT OpenCourseWare lectures was used as our acoustic model. MIT OpenCourseWare lecture transcriptions were utilized to create our language...
We have considered a speech recognition method for mixed sound, consisting of speech and music, that removes only the music based on vector quantization (VQ) and non-negative matrix factorization (NMF). Instead of conventional amplitude spectrum distance measure, MFCC distance measure which is not affected by the pitch is introduced. For isolated word recognition using the clean speech model, an improvement...
In conventional speaker identification methods based on mel-frequency cepstral coefficients (MFCCs), phase information is ignored. Recent studies have shown that phase information contains speaker dependent characteristics, and, pitch synchronous phase information is more suitable for speaker identification. In this paper, we verify the effectiveness of pitch synchronous phase information for speaker...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.