The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Perceptually distinguishing between Mandarin alveolar nasal coda [n] and velar [η] are difficult for Japanese natives in learning Chinese as a second language (CSL). Discovering relations between acoustic cues and perceptual responses is important for studying CSL acquisition and computer-aided pronunciation teaching. In order to investigate the influences of nasal coda's lengths on nasal perception...
For better understanding of the identification difficulties in Japanese geminate/singleton consonants for second language (L2) learners, a perceptual factor is newly introduced to supply the insufficiencies of conventional explanations solely using acoustic duration differences. To systematically explain speech-rate related serious errors of geminate/singleton identification in fast/slow speech, loudness...
It is usually considered that focus bears communicative function in discourse, each language has its own ways to realize focus. This paper compares the focus realization of Jinan dialect and Taiyuan dialect. It aims to investigate the similarity and difference of focus realization through examining the variations of mean F0, duration and intensity in both focused and unfocused conditions between these...
Context-dependent pronunciation, e.g. homographs, is a difficult grapheme-to-phoneme conversion (G2P) issue. It causes accuracy downgrade in speech synthesis and speech recognition. However, the context-dependent pronunciation issue is rarely considered in collecting pronunciation corpus for evaluating accuracy of G2P. Thus, this paper proposes a context-dependent pronunciation corpus using grapheme-phoneme...
This Plenary presents automatic speech recognition (ASR) as a task of artificial intelligence. The basis, the methodology, spectral processing, distance measures for speech, segmentation speech, spectral and temporal variability, application of Markov Models, noise robustness, Language Models for ASR, are presented.
In order for robots to work alongside humans in a range of domains, they will need to operate with a variety of social dynamics that each context will require. This paper builds on previous work with a parameterized turn-taking model, CADENCE, in which different parameter settings resulted in different social dynamics. In contrast to the static parameter settings of previous work, we now investigate...
An objective of an autonomous sociable robot is to meet the needs and preferences of a human user. However, this can sometimes be at the expense of the robot's own ability to understand social signals produced by the user. In particular, human preferences of distance (proxemics) to the robot can have significant impact on the performance rates of its automated speech and gesture recognition systems...
Statistic aspects of Marko Cheremshyna's idiolect is one of the main research focus of applied lingustic department. It includes letter frequency, word length, amount and percentage of words of different parts of speech, the most frequent content words and bigrams, the frequency of characters combination in text. In this article we are to outline the part of speech aspect of our research. Some statistic...
Vowel durations are most often utilized in studies addressing specific issues in phonetics. Thus far this has been hampered by a reliance on subjective, labor-intensive manual annotation. Our goal is to build an algorithm for automatic accurate measurement of vowel duration, where the input to the algorithm is a speech segment contains one vowel preceded and followed by consonants (CVC). Our algorithm...
Human emotional expression tends to evolve in a structured manner in the sense that certain emotional evolution patterns, i.e., anger to anger, are more probable than others, e.g., anger to happiness. Furthermore the perception of an emotional display can be affected by recent emotional displays. Therefore, the emotional content of past and future observations could offer relevant temporal context...
Conversational skills training are getting popular now a days but often very hard to get due to expense and lack of accessibility. In this paper, we present the idea of an automated conversational skills training assistant, which provides both realtime and post summary feedback while having a conversation with a virtual agent. Our exploratory effort shows the applicability of this system and significant...
Thanks to a remarkably great ability to show amusement and engagement, laughter is one of the most important social markers in human interactions. Laughing together can actually help to set up a positive atmosphere and favors the creation of new relationships. This paper presents a data collection of social interaction dialogs involving humor between a human participant and a robot. In this work,...
This paper focuses on using bigrams in a topic determination for speech synthesizer. It contains an explanation of a modular architecture for the speech synthesizer and importance of context analysis for customizing and quality enhancement of synthesized speech. The bigram carries information about context and in this work it is shown how to use them to improve the identification of the theme. At...
In this paper, we address the problematic of automatic detection of engagement in multi-party Human-Robot Interaction scenarios. The aim is to investigate to what extent are we able to infer the engagement of one of the entities of a group based solely on the cues of the other entities present in the interaction. In a scenario featuring 3 entities: 2 participants and a robot, we extract behavioural...
Speech are widely used to express one's emotion, intention, desire, etc. in social network communication, deriving abundant of internet speech data with different speaking styles. Such data provides a good resource for social multimedia research. However, regarding different styles are mixed together in the internet speech data, how to classify such data remains a challenging problem. In previous...
Conventional music coders, based on a modified discrete cosine transform (MDCT) suffer greatly when lowering their bit-rate and delay. In particular, tonal music signals are penalized by short analysis windows and the variable length coding of the quantized MDCT coefficients demands a significant amount of bits for coding the harmonic structure. For solving such an issue, the paper proposes a frequency-domain...
Deep neural network(DNN) has achieved a great success in automatic speech recognition(ASR), and it can be regarded as a joint model combining the nonlinear feature transformation and the log-linear classifier. Recently DNN is adopted as a regression model to enhance the distorted feature in noisy condition and the enhanced feature is utilized to improve the performance of DNN based ASR. Previous work...
Synthetic speech is speech signals generated by text-to-speech (TTS) and voice conversion (VC) techniques. They impose a threat to speaker verification (SV) systems as an attacker may make use of TTS or VC to synthesize a speakers voice to cheat the SV system. To address this challenge, we study the detection of synthetic speech using long term magnitude and phase information of speech. As most of...
In storytelling style, a storyteller generally uses prosodic variations with subtle speech nuances for the better apprehension of the listeners. It is achieved by emphasizing prominent words, using various emotions, mimicking voices and providing appropriate pauses. This work is a part of building the Story Text-to-Speech (TTS) [1] synthesis systems in Indian Languages, which aims at synthesizing...
In the current social, technological and economic context, customers make their decisions based mostly on the opinion of other consumers. On the other side, companies need quick feedback from their customers in order to adapt to their needs in real time. The effective connection between these two aspects relies on opinion mining tools, which automatically process consumers' reviews and opinions about...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.