The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper presents a context-aware object proposal generation method for stereo images. Unlike existing methods which mostly rely on image-based or depth features to generate object candidates, we propose to incorporate additional geometric and high-level semantic context information into the proposal generation. Our method starts from an initial object proposal set, and encode objectness for each...
In this paper we propose an online multi-task learning algorithm for video concept detection. In particular, we extend the Efficient Lifelong Learning Algorithm (ELLA) in the following ways: a) we solve the objective function of ELLA using quadratic programming instead of solving the Lasso problem, b) we add a new label-based constraint that considers concept correlations, c) we use linear SVMs as...
A large number of images are available on online photo-sharing services along with rich meta-data, including tags, groups, and locations, etc. For associating two domains of different modalities, e.g. images and tags, Canonical Correlation Analysis (CCA) and its extended methods are used widely. We employ a more flexible graph embedding method called Cross-Domain Matching Correlation Analysis (CDMCA),...
Image classification is a general visual analysis task based on the image content coded by its representation. In this research, we proposed an image representation method that is based on the perceptual shape features and their spatial distributions. A natural language processing concept, N-gram, is adopted to generate a set of perceptual shape visual words for encoding image features. By combining...
Which parts of an image evoke emotions in an observer? To answer this question, we introduce a novel problem in computer vision — predicting an Emotion Stimuli Map (ESM), which describes pixel-wise contribution to evoked emotions. Building a new image database, EmotionROI, as a benchmark for predicting the ESM, we find that the regions selected by saliency and objectness detection do not correctly...
Emotional factors usually affect users' preferences for and evaluations of images. Although affective image analysis attracts increasing attention, there are still three major challenges remaining: 1) it is difficult to classify an image into a single emotion type since different regions within an image can represent different emotions; 2) there is a gap between low-level features and high-level emotions...
Following the exponential deployment of surveillance systems across a wide-spread region of geographic locations, detection and representation of events has become a critical element in automated surveillance systems. In this paper, we present an extensive ontology framework for representing complex semantic events. The proposed ontology builds on DOLCE ontology and relies on the linguistic and cognitive...
Convolutional Neural Networks (CNNs), which have nowadays dominated image analysis tasks, constitute feed-forward methods that model increasingly complex data structures and patterns along the subsequent hidden layers of the network. However, the common practice of using the activation features from the last network layer inevitably leads to a visual recognition bottleneck. This is due to the fact...
Recent advances in salient object detection have exploited the deep Convolutional Neural Network (CNN) to represent high-level semantic, however, due to the presence of convolutional and pooling layers, it is difficult for CNN to generate saliency map with sharp boundaries. In this paper, we propose multi-scale mask-based Fast R-CNN framework which generate saliency score of each region. Since the...
Crowd video retrieval is an important problem in surveillance video management in the era of big data, e.g., video indexing and browsing. In this paper, we address this issue from the motion-level perspective by using hand-drawn sketches as queries. Motion sketch based crowd video retrieval naturally suffers from challenges in motion-level video indexing and sketch representation. We tackle them by...
This paper presents a novel approach to detecting crowd groups and learning semantic regions with a Gestalt laws-based similarity. Different from the existing approaches based on optical flows or complete trajectories, our model adopts tracklets as the original input, because they carry more detailed information. Though those tracklets do not appear in the same duration, they are more robust to noise...
In this paper, we propose to use contexts of superpixels as a prior to improve semantic segmentation by the CRF framework. A graphical model is constructed on over-segmented images. Our main contribution is to take the concept of “superpixel embedding” into consideration, which is formalized as a potential item for optimizing the energy of the whole graph. We also introduce two ways of calculating...
Deep convolutional neural networks (DCNNs) have been employed in many computer vision tasks with great success due to their robustness in feature learning. One of the advantages of DCNNs is their representation robustness to object locations, which is useful for object recognition tasks. However, this also discards spatial information, which is useful when dealing with topological information of the...
In this paper we introduce a novel method for general semantic segmentation that can benefit from general semantics of Convolutional Neural Network (CNN). Our segmentation proposes visually and semantically coherent image segments. We use binary encoding of CNN features to overcome the difficulty of the clustering on the high-dimensional CNN feature space. These binary codes are very robust against...
Image annotation, or prediction of multiple tags for an image, is a challenging task. Most current algorithms are based on large sets of handcrafted features. Deep convolutional neural networks have recently outperformed humans in image classification, and these networks can be used to extract features highly predictive of an image's tags. In this study, we analyze semantic information in features...
We present an application of the Layer-wise Relevance Propagation (LRP) algorithm to state of the art deep convolutional neural networks and Fisher Vector classifiers to compare the image perception and prediction strategies of both classifiers with the use of visualized heatmaps. Layer-wise Relevance Propagation (LRP) is a method to compute scores for individual components of an input image, denoting...
Visual question answering (VQA) comes as a result of great development in computer vision and natural language processing, which requires deep understanding of images and questions and effective integration of them. Current works on VQA simply concatenated visual and textual features or compared them via dot product, which were unable to eliminate the semantic difference between them. We argue to...
The recent decade has witnessed remarkable developments of SIFT-based approaches for image retrieval. However, such approaches are inherently insufficient in handling the semantic gap and large viewpoint changes, leading to inferior performance. To address these limitations, this paper extends SIFT-based match kernels by integrating the match functions for SIFT and CNN features. Specifically, a thresholded...
We propose a salient object detection algorithm via multilevel features learning determined sparse reconstruction. There are three stages in our method. First, the test image are successively processed by a segmentation and semantic information generation procedures. Second, three kinds of features are extracted from semantic, global, and local levels for each superpixel to train a random forest regressor,...
In still images, multi-scale regions contain rich information of different granularity. However, only semantically meaningful regions provide auxiliary cues for action recognition. Moreover, regions at different scales contribute differently. Motivated by the two observations, we propose an approach that is composed of three components: 1) detecting semantic region candidates at multiple scales, 2)...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.