The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Human parsing has recently attracted a lot of research interests due to its huge application potentials. However existing datasets have limited number of images and annotations, and lack the variety of human appearances and the coverage of challenging cases in unconstrained environment. In this paper, we introduce a new benchmark Look into Person (LIP) that makes a significant advance in terms of...
Semantic sparsity is a common challenge in structured visual classification problems, when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic sparsity in situation recognition, the task of producing structured summaries of what is happening in images, including activities, objects and the roles objects...
Due to the importance of zero-shot learning, the number of proposed approaches has increased steadily recently. We argue that it is time to take a step back and to analyze the status quo of the area. The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark by unifying both the evaluation protocols and...
In this work, we study a poorly understood trade-off between accuracy and runtime costs for deep semantic video segmentation. While recent work has demonstrated advantages of learning to speed-up deep activity detection, it is not clear if similar advantages will hold for our very different segmentation loss function, which is defined over individual pixels across the frames. In deep video segmentation,...
The performance of image retrieval has been improved tremendously in recent years through the use of deep feature representations. Most existing methods, however, aim to retrieve images that are visually similar or semantically relevant to the query, irrespective of spatial configuration. In this paper, we develop a spatial-semantic image search technology that enables users to search for images with...
A Semantic Compositional Network (SCN) is developed for image captioning, in which semantic concepts (i.e., tags) are detected from the image, and the probability of each tag is used to compose the parameters in a long short-term memory (LSTM) network. The SCN extends each weight matrix of the LSTM to an ensemble of tag-dependent weight matrices. The degree to which each member of the ensemble is...
Querying with an example image is a simple and intuitive interface to retrieve information from a visual database. Most of the research in image retrieval has focused on the task of instance-level image retrieval, where the goal is to retrieve images that contain the same object instance as the query image. In this work we move beyond instance-level retrieval and consider the task of semantic image...
Fine-grained activity understanding in videos has attracted considerable recent attention with a shift from action classification to detailed actor and action understanding that provides compelling results for perceptual needs of cutting-edge autonomous systems. However, current methods for detailed understanding of actor and action have significant limitations: they require large amounts of finely...
We propose a novel framework named StyleNet to address the task of generating attractive captions for images and videos with different styles. To this end, we devise a novel model component, named factored LSTM, which automatically distills the style factors in the monolingual text corpus. Then at runtime, we can explicitly control the style in the caption generation process so as to produce attractive...
We investigate and improve self-supervision as a drop-in replacement for ImageNet pretraining, focusing on automatic colorization as the proxy task. Self-supervised training has been shown to be more promising for utilizing unlabeled data than other, traditional unsupervised learning methods. We build on this success and evaluate the ability of our self-supervised network in several contexts. On VOC...
We present a descriptor, called fully convolutional self-similarity (FCSS), for dense semantic correspondence. To robustly match points among different instances within the same object class, we formulate FCSS using local self-similarity (LSS) within a fully convolutional network. In contrast to existing CNN-based descriptors, FCSS is inherently insensitive to intra-class appearance variations because...
Semantic segmentation for remote sensing images is a critical process in the workflow of object-based image analysis. Recently, convolutional neural networks(CNNs) are powerful visual models that yield hierarchies of features. In this paper, we propose a deep convolutional encoder-decoder model for remote sensing images segmentation. Specifically, we rely on the encoder network to extract the high-level...
Automated annotation of urban areas from overhead imagery plays an essential role in many remote sensing applications. Generative Adversarial Nets (GANs) is one of the most effective ways to handle this problem. In this manuscript, two tricks were added in conditional GANs(cGANs) which learn the mapping from input image to output remote sensing image. All the experimental results demonstrated that...
In this paper, we focus on the classification of lidar point cloud data acquired via mobile laser scanning, whereby the classification relies on a context model based on a Conditional Random Field (CRF). We present two approximate inference algorithms based on belief propagation, as well as a graph-cut-based approach not yet applied in this context. To demonstrate the performance of our approach,...
New challenges in remote sensing impose the necessity of designing pixel classification methods that, once trained on a certain dataset, generalize to other areas of the earth. This may include regions where the appearance of the same type of objects is significantly different. In the literature it is common to use a single image and split it into training and test sets to train a classifier and assess...
Establishing up-to-date nationwide building maps is essential to understand urban dynamics, such as estimating population and urban planning and many other applications. However, an efficient and effective solution is yet to be developed. In this paper, for the first time we evaluate three state-of-the-art CNNs for detecting buildings across entire United States using aerial images. The three CNN...
Understanding temporal expressions is the important foundation of many NLP tasks. However, the varied representations of temporal expressions is difficulty in analysis and understanding. To parsing expressions, an effective classification method of temporal expressions is significant. A temporal expression may belong to one or more classes, but the classification usually requires manual annotation...
We introduce a novel loss max-pooling concept for handling imbalanced training data distributions, applicable as alternative loss layer in the context of deep neural networks for semantic image segmentation. Most real-world semantic segmentation datasets exhibit long tail distributions with few object categories comprising the majority of data and consequently biasing the classifiers towards them...
We aim to jointly estimate height and semantically label monocular aerial images. These two tasks are traditionally addressed separately in remote sensing, despite their strong correlation. Therefore, a model learning both height and classes jointly seems advantageous and so, we propose a multitask Convolutional Neural Network (CNN) architecture with two losses: one performing semantic labeling, and...
This paper aims at task-oriented action prediction, i.e., predicting a sequence of actions towards accomplishing a specific task under a certain scene, which is a new problem in computer vision research. The main challenges lie in how to model task-specific knowledge and integrate it in the learning procedure. In this work, we propose to train a recurrent longshort term memory (LSTM) network for handling...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.