The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Convolutional neural network (CNN) based trackers have achieved significant performances in tracking recently. Most existing CNN-based trackers regard tracking as a classification or similarity searching problem. The two methods have their respective superiorities and limitations because of different supervised objectives. In this paper, we propose a multi-task CNN for visual tracking, not only fully...
We propose ‘Hide-and-Seek’, a weakly-supervised framework that aims to improve object localization in images and action localization in videos. Most existing weakly-supervised methods localize only the most discriminative parts of an object rather than all relevant parts, which leads to suboptimal performance. Our key idea is to hide patches in a training image randomly, forcing the network to seek...
In this paper, we address the problem of spatio-temporal person retrieval from videos using a natural language query, in which we output a tube (i.e., a sequence of bounding boxes) which encloses the person described by the query. For this problem, we introduce a novel dataset consisting of videos containing people annotated with bounding boxes for each second and with five natural language descriptions...
This article shares the obtained results in the teacher training phase for the analysis, development and publication of accessible courses making use of the learning management platform: ATutor. This training phase is carried out within the framework of a research project: “Didactic and technological development in teaching scenarios for the training of teachers who welcome diversity: factors for...
We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. Similar to comparison-based...
Recently, many have begun to believe that learning and training approaches known as learner-centered, active learning, and cooperative learning improve learning and practicing performance and are more effective than traditional lectures. Moreover, in addition to paper-based materials such as textbooks, face-to-face co-located communication frequently utilizes digital video and other visual reference...
Understanding the simultaneously very diverse and intricately fine-grained set of possible human actions is a critical open problem in computer vision. Manually labeling training videos is feasible for some action classes but doesnt scale to the full long-tailed distribution of actions. A promising way to address this is to leverage noisy data from web queries to learn new actions, using semi-supervised...
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem – unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a Watch, Listen, Attend and Spell...
This paper proposes efficient and powerful deep networks for action prediction from partially observed videos containing temporally incomplete action executions. Different from after-the-fact action recognition, action prediction task requires action labels to be predicted from these partially observed videos. Our approach exploits abundant sequential context information to enrich the feature representations...
Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances...
This paper presents a novel frame-pair based method for visual object tracking. Instead of adopting two-stream Convolutional Neural Networks (CNNs) to represent each frame, we stack frame pairs as the input, resulting in a single-stream CNN tracker with much fewer parameters. The proposed tracker can learn generic motion patterns of objects with much less annotated videos than previous methods. Besides,...
The successful deep convolutional neural networks for visual object recognition typically rely on a massive number of training images that are well annotated by class labels or object bounding boxes with great human efforts. Here we explore the use of the geographic metadata, which are automatically retrieved from sensors such as GPS and compass, in weakly-supervised learning techniques for landmark...
We present VidedWhisfer, a novel approach for unsupervised video representation learning, in which video sequence is treated as a self-supervision entity based on the observation that the sequence encodes video temporal dynamics (e.g., object movement and event evolution). Specifically, for each video sequence, we use a pre-learned visual dictionary to generate a sequence of high-level semantics,...
This paper presents a framework for saliency estimation and fixation prediction in videos. The proposed framework is based on a hierarchical feature representation obtained by stacking convolutional layers of independent subspace analysis (ISA) filters. The feature learning is thus unsupervised and independent of the task. To compute the saliency, we then employ a multiresolution saliency architecture...
Deep visual attention in computer vision has attracted much attention over the past years, which achieves great contributions especially in image classification, image caption and action recognition. However, due to taking BP training wholly or partially, they can not show the true power of attention in computational efficiency and focusing accuracy. Our intuition is that attention mechanism should...
Human Activity detection is an imperative area of research in computer vision. This paper focuses on activity recognition by construction personnel at the construction sites. The method uses bag of features (BOF) approach to detect an activity. Here we have considered five types of activities done at construction sites namely ladder climbing, brick laying, carpentry work, painting and plastering work...
Automatic transcriptions of consumer generated multi-media content such as “Youtube” videos still exhibit high word error rates. Such data typically occupies a very broad domain, has been recorded in challenging conditions, with cheap hardware and a focus on the visual modality, and may have been post-processed or edited.
We present an approach to automatically generating verbal commentaries for tennis games. We introduce a novel application that requires a combination of techniques from computer vision, natural language processing and machine learning. A video sequence is first analysed using state-of-the-art computer vision methods to track the ball, fit the detected edges to the court model, track the players, and...
Over the last decades, visual representations of data has been a commonly used medium to bolster human cognition in performance evaluation of professional athletes. However, the current approaches to these visualizations still build upon the paper based principles of initial designs with solid backgrounds. Due to this situation, same visualizations usually fail to provide explicit information about...
Motivated by the recent advances in human-robot interaction we present a new dataset, a suite of tools to handle it and state-of-the-art work on visual gestures and audio commands recognition. The dataset has been collected with an integrated annotation and acquisition web-interface that facilitates on-the-way temporal ground-truths for fast acquisition. The dataset includes gesture instances in which...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.