The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Referring expression is a kind of language expression that used for referring to particular objects. To make the expression without ambiguation, people often use attributes to describe the particular object. In this paper, we explore the role of attributes by incorporating them into both referring expression generation and comprehension. We first train an attribute learning model from visual objects...
Most present methods of saliency detection emphasize too much on the local contrast while ignore the global feature of image. The detailed characteristics of the image can be reflected based on the local comparison of image. However, the overall saliency of the image cannot be reflected. In this paper, a saliency detection model combined local and global features was proposed. Firstly, a local feature...
Previous works have suggested the role of scene information in directing gaze. The structure of a scene provides global contextual information that complements local object information in saliency prediction. In this study, we explore how scene envelopes such as openness, depth, and perspective affect visual attention in natural outdoor images. To facilitate this study, an eye tracking dataset is...
Human sketches are unique in being able to capture both the spatial topology of a visual object, as well as its subtle appearance details. Fine-grained sketch-based image retrieval (FG-SBIR) importantly leverages on such fine-grained characteristics of sketches to conduct instance-level retrieval of photos. Nevertheless, human sketches are often highly abstract and iconic, resulting in severe misalignments...
Together with the technology advancement, Computer Vision plays an important role in enhancing smart computing systems to help people overcome obstacles in their daily lives. One of the common troublesome problems is human memorization ability, especially memorizing things such as personal items. It is annoying for people to waste their time finding lost items manually by recall or notes. This motivates...
This paper presents a novel strategy addressing visual SLAM with enhancement of data association method. Hyper graph theory and transformation was incorporated within cooperative visual SLAM. The research presented a synthetic approach to fulfill a cooperative data association and fusion strategy for multiple UAVs equipped with stereo vision cameras encountered with indistinct imaging, where conventional...
Person Re-identification (re-id) aims to match people across non-overlapping camera views in a public space. It is a challenging problem because many people captured in surveillance videos wear similar clothes. Consequently, the differences in their appearance are often subtle and only detectable at the right location and scales. Existing re-id models, particularly the recently proposed deep learning...
Crowd event detection techniques aim at solving real-world surveillance problems, such as detecting crowd anomaly and tracking specific person in a highly dynamic crowd scene. In this paper, we proposed an innovate texture-based analysis method to model crowd dynamics and us it to distinguish the crowd behaviours. To describe complicated crowd scenes, homogeneous random features have been deployed...
This paper is concerned of the loop closure detection problem, which is one of the most critical parts for visual Simultaneous Localization and Mapping (SLAM) systems. Most of state-of-the-art methods use hand-crafted features and bag-of-visual-words (BoVW) to tackle this problem. Recent development in deep learning indicates that CNN features significantly outperform hand-crafted features for image...
This work aims to apply visual-attention modeling to attention-based video compression. During our comparison we found that eye-tracking data collected even from a single observer outperforms existing automatic models by a significant margin. Therefore, we offer a semiautomatic approach: using computer-vision algorithms and good initial estimation of eye-tracking data from just one observer to produce...
Visual relations, such as person ride bike and bike next to car, offer a comprehensive scene understanding of an image, and have already shown their great utility in connecting computer vision and natural language. However, due to the challenging combinatorial complexity of modeling subject-predicate-object relation triplets, very little work has been done to localize and predict visual relations...
We present a novel visual attention tracking technique based on Shared Attention modeling. By considering the viewer as a participant in the activity occurring in the scene, our model learns the loci of attention of the scene actors and use it to augment image salience. We go beyond image salience and instead of only computing the power of image regions to pull attention, we also consider the strength...
Interpretability of deep neural networks (DNNs) is essential since it enables users to understand the overall strengths and weaknesses of the models, conveys an understanding of how the models will behave in the future, and how to diagnose and correct potential problems. However, it is challenging to reason about what a DNN actually does due to its opaque or black-box nature. To address this issue,...
Predicting interestingness of media content remains an important, but challenging research subject. The difficulty comes first from the fact that, besides being a high-level semantic concept, interestingness is highly subjective and its global definition has not been agreed yet. This paper presents the use of up-to-date deep learning techniques for solving the task. We perform experiments with both...
Bilinear convolutional neural networks (BCNN) model, the state-of-the-art in fine-grained image recognition, fails in distinguishing the categories with subtle visual differences. We design a novel BCNN model guided by user click data (C-BCNN) to improve the performance via capturing both the visual and semantical content in images. Specially, to deal with the heavy noise in large-scale click data,...
In this paper, we introduce Key-Value Memory Networks to a multimodal setting and a novel key-addressing mechanism to deal with sequence-to-sequence models. The proposed model naturally decomposes the problem of video captioning into vision and language segments, dealing with them as key-value pairs. More specifically, we learn a semantic embedding (v) corresponding to each frame (k) in the video,...
Convolutional neural network (CNN) has drawn increasing interest in visual tracking owing to its powerfulness in feature extraction. Most existing CNN-based trackers treat tracking as a classification problem. However, these trackers are sensitive to similar distractors because their CNN models mainly focus on inter-class classification. To address this problem, we use self-structure information of...
We propose a novel online Attentional Recurrent Neural Network (ARNN) model for visual tracking, which exploits the feature maps of Convolutional Neural Network (CNN) inside a bounding box to identify whether this target is the one appeared in previous frames. Attention mechanism is adopted for both different parts of targets and different scales of object features. The former attention model is able...
In view of the traditional saliency detection method gets imprecise and vague region boundary, so that the detected object is not connected, the paper proposes image visual saliency feature extraction based on multi-scale tensor space. The method introduces the tensor space, using multiple low-level image features to construct the tensor space, after reducing dimension the image space structure and...
Pedestrian detection, as an important task in video surveillance and forensics applications, has been widely studied. However, its performance is unsatisfactory especially in the low resolution conditions. In realistic scenarios, the size of pedestrians in the images is often small, and detection can be challenging. To solve this problem, this paper proposes a novel resolution-score discriminative...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.