The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The advent of high-tech journaling tools facilitates an image to be manipulated in a way that can easily evade state-of-the-art image tampering detection approaches. The recent success of the deep learning approaches in different recognition tasks inspires us to develop a high confidence detection framework which can localize manipulated regions in an image. Unlike semantic object segmentation where...
We address the problem of recognizing situations in images. Given an image, the task is to predict the most salient verb (action), and fill its semantic roles such as who is performing the action, what is the source and target of the action, etc. Different verbs have different roles (e.g. attacking has weapon), and each role can take on many possible values (nouns). We propose a model based on Graph...
A natural image usually conveys rich semantic content and can be viewed from different angles. Existing image description methods are largely restricted by small sets of biased visual paragraph annotations, and fail to cover rich underlying semantics. In this paper, we investigate a semi-supervised paragraph generative framework that is able to synthesize diverse and semantically coherent paragraph...
Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scenes with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our approach is a joint image pixel and word concept embeddings framework, where word concepts...
We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. Similar to comparison-based...
This paper proposes an automatic spatially-aware concept discovery approach using weakly labeled image-text data from shopping websites. We first fine-tune GoogleNet by jointly modeling clothing images and their corresponding descriptions in a visual-semantic embedding space. Then, for each attribute (word), we generate its spatiallyaware representation by combining its semantic word vector representation...
Existing video event classification approaches suffer from limited human-labeled semantic annotations. Weak semantic annotations can be harvested from Web-knowledge without involving any human interaction. However such weak annotations are noisy, thus can not be effectively utilized without distinguishing its reliability. In this paper, we propose a novel approach to automatically maximize the utility...
Cinemagraphs are a compelling way to convey dynamic aspects of a scene. In these media, dynamic and still elements are juxtaposed to create an artistic and narrative experience. Creating a high-quality, aesthetically pleasing cinemagraph requires isolating objects in a semantically meaningful way and then selecting good start times and looping periods for those objects to minimize visual artifacts...
A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person...
Rich and dense human labeled datasets are among the main enabling factors for the recent advance on visionlanguage understanding. Many seemingly distant annotations (e.g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes — and even the same set of images (e...
Textual-visual matching aims at measuring similarities between sentence descriptions and images. Most existing methods tackle this problem without effectively utilizing identity-level annotations. In this paper, we propose an identity-aware two-stage framework for the textual-visual matching problem. Our stage-1 CNN-LSTM network learns to embed cross-modal features with a novel Cross-Modal Cross-Entropy...
Leveraging class semantic descriptions and examples of known objects, zero-shot learning makes it possible to train a recognition model for an object class whose examples are not available. In this paper, we propose a novel zero-shot learning model that takes advantage of clustering structures in the semantic embedding space. The key idea is to impose the structural constraint that semantic representations...
Fully convolutional neural networks (FCNs) have shown outstanding performance in many dense labeling problems. One key pillar of these successes is mining relevant information from features in convolutional layers. However, how to better aggregate multi-level convolutional feature maps for salient object detection is underexplored. In this work, we present Amulet, a generic aggregating multi-level...
Pedestrian analysis plays a vital role in intelligent video surveillance and is a key component for security-centric computer vision systems. Despite that the convolutional neural networks are remarkable in learning discriminative features from images, the learning of comprehensive features of pedestrians for fine-grained tasks remains an open problem. In this study, we propose a new attentionbased...
This paper proposes a systematic framework for applying the Physics of Notations (PoN), a theory for the design of cognitively effective visual notations. The PoN consists of nine principles, but not all principles lend themselves equally to a clear and unambiguous operationalization. As a result, many visual notations designed according to the PoN apply it in different ways. The proposed framework...
We propose a visual SLAM (Simultaneous Localization And Mapping) system able to perform robustly in populated environments. The image stream from a moving RGB-D camera is the only input to the system. The computed map in real-time is composed of two layers: 1) The unpopulated geometrical layer, which describes the geometry of the bare scene as an occupancy grid where pieces of information corresponding...
This paper investigates the problem of visual analysis of complex plans, schemes and maps for solving of difficult formalized tasks. It is analyzed the dependence of the level of perception upon the visualization complexity. It is introduced the concept of the utility of the visual image, it is described by the behavior of the empirical utility function. We propose an optimization model of the utility,...
The most important aim of our study is to realize a multi-database for social sciences and environmental sciences. Our system connects heterogeneous databases about historical phenomena by using common spatiotemporal information and visualize the connected results onto 5D World Map (a set of chronologically ordered global maps). To actualize that, we created a news articles provision system using...
Scene recognition is an important and challenging problem in the field of computer vision owing to the variations in the same class and the similarities between different classes. This paper presents a novel approach that learns a reasonable dictionary from convolutional features to effectively describe the distinctive and shared properties in scene images. Substantial convolution operations in Deep...
Semantic gap, which is the difference between low-level image features and their high-level semantics, has become very popular and witnessed great interest in the last two decades. This paper deals with this problem and proposes a hybrid approach to learn image semantic concepts for modeling visual features in discriminative learning stage. It combines the advantages of human-in-the-loop and discriminative...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.