Serwis Infona wykorzystuje pliki cookies (ciasteczka). Są to wartości tekstowe, zapamiętywane przez przeglądarkę na urządzeniu użytkownika. Nasz serwis ma dostęp do tych wartości oraz wykorzystuje je do zapamiętania danych dotyczących użytkownika, takich jak np. ustawienia (typu widok ekranu, wybór języka interfejsu), zapamiętanie zalogowania. Korzystanie z serwisu Infona oznacza zgodę na zapis informacji i ich wykorzystanie dla celów korzytania z serwisu. Więcej informacji można znaleźć w Polityce prywatności oraz Regulaminie serwisu. Zamknięcie tego okienka potwierdza zapoznanie się z informacją o plikach cookies, akceptację polityki prywatności i regulaminu oraz sposobu wykorzystywania plików cookies w serwisie. Możesz zmienić ustawienia obsługi cookies w swojej przeglądarce.
Generating diverse questions for given images is an important task for computational education, entertainment and AI assistants. Different from many conventional prediction techniques is the need for algorithms to generate a diverse set of plausible questions, which we refer to as creativity. In this paper we propose a creative algorithm for visual question generation which combines the advantages...
Image captioning often requires a large set of training image-sentence pairs. In practice, however, acquiring sufficient training pairs is always expensive, making the recent captioning models limited in their ability to describe objects outside of training corpora (i.e., novel objects). In this paper, we present Long Short-Term Memory with Copying Mechanism (LSTM-C) — a new architecture...
Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion,...
Reliable visual features that encode the articulator movements of speakers can dramatically improve the decoding accuracy of automatic speech recognition systems when combined with the corresponding acoustic signals. In this paper, a novel framework is proposed to utilize audio-visual speech not only during decoding but also for training better acoustic models. In this framework, a multi-stream hidden...
Introducing features that better represent the visual information of speakers during the speech production is still an open issue that highly affects the quality of the lip-reading and Audio Visual Speech Recognition (AVSR) tasks. In this paper, three different types of visual features from both the image-based and model-based ones are investigated inside a professional lip reading task. The simple...
Audio-visual speech recognition is a promising approach to tackling the problem of reduced recognition rates under adverse acoustic conditions. However, finding an optimal mechanism for combining multi-modal information remains a challenging task. Various methods are applicable for integrating acoustic and visual information in Gaussian-mixture-model-based speech recognition, e.g., via dynamic stream...
In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural network (DNN). Given an input text sentence and a set of expression tags, the VTTS is able to produce not only the audio speech, but also the accompanying facial movements. The expressions can either be one of the expressions in the training corpus or a blend of expressions from the training corpus....
Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification is very limited. In this work, we present an end-to-end...
Methods for action recognition have evolved considerably over the past years and can now automatically learn and recognize short term actions with satisfactory accuracy. Nonetheless, the recognition of complex activities - compositions of actions and scene objects - is still an open problem due to the complex temporal and composite structure of this category of events. Existing methods focus either...
This paper discusses the problem of one shot gesture recognition. This is relevant to the field of human-robot interaction, where the user's intentions are indicated through spontaneous gesturing (one shot) to the robot. The novelty of this work consists of learning the process that leads to the creation of a gesture, rather on the gesture itself. In our case, the context involves the way in which...
In this paper, we propose the deep neural network - switching Kalman filter (DNN-SKF) based frameworks for both single modal and multi-modal continuous affective dimension estimation. The DNN-SKF framework firstly models the complex nonlinear relationship between the input (audio, visual, or lexical) features and the affective dimensions via the non-recurrent DNN, then models the temporal dynamics...
Generating semantic description draws increasing attention recently. Describing objects with adaptive adjunct words make the sentence more informative. In this paper, we focus on the generation of descriptions for images according to the structural words we have generated such as a tetrad of <object, attribute, activity, scene>. We propose to use deep machine translation method to generate semantically...
We describe an end-to-end generative approach for the segmentation and recognition of human activities. In this approach, a visual representation based on reduced Fisher Vectors is combined with a structured temporal model for recognition. We show that the statistical properties of Fisher Vectors make them an especially suitable front-end for generative models such as Gaussian mixtures. The system...
In movement analysis frameworks, body pose may often be adequately represented in a simple, low-dimensional, and high-level space, while full body joints' locations constitute excessively redundant and complex information. We propose a method for estimating body pose in such high-level pose spaces, directly from a depth image and without relying on intermediate skeleton-based steps. Our method is...
The current study examines how adequate coordination among different cognitive processes including visual recognition, attention switching, action preparation and generation can be developed via learning of robots by introducing a novel model, the Visuo-Motor Deep Dynamic Neural Network (VMDNN). The proposed model is built on coupling of a dynamic vision network, a motor generation network, and a...
The term of “World Englishes” describes the current and real state of English and one of their main characteristics is a large diversity of pronunciation, called accents. We have developed two techniques of individual-based clustering of the diversity [1, 2] and educationally-effective visualization of the diversity [3]. Accent clustering requires a technique to quantify the accent gap between any...
Surveillance systems require advanced algorithms able to make decisions without a human operator or with minimal assistance from human operators. In this paper we propose a novel approach for dynamic topic modeling to detect abnormal behaviour in video sequences. The topic model describes activities and behaviours in the scene assuming behaviour temporal dynamics. The new inference scheme based on...
In this paper we propose an improvement of a human action recognition method that uses a string-based representation and a string edit distance to compare the observed action with reference actions in the training set. In particular, the original improvement is based on a specific formulation of the string edit distance that is more suited to take into account the problems related to noise and to...
This paper proposes methods of using restricted Boltzmann machines (RBM) to generate the sequence of lip images for visual speech synthesis. The aim of our proposed methods is to alleviate the over-smoothing effect of the conventional hidden Markov model (HMM) based statistical approach for lip synthesis. Two model structures using RBMs to model and generate lip movements are investigated in this...
In this paper, we address an exemplar-based hidden markov model (HMM) that represents the lip motion activity using visual cues for lipreading. The discriminative visual features including the geometric shape parameters and contour-constrained spatial histogram are selected for representing each lip frame. Then, a set of exemplars associated with the HMM is learned jointly to serve as a typical representation...
Podaj zakres dat dla filtrowania wyświetlonych wyników. Możesz podać datę początkową, końcową lub obie daty. Daty możesz wpisać ręcznie lub wybrać za pomocą kalendarza.