2017 IEEE International Conference on Computer Vision (ICCV)

chapter

Dense-Captioning Events in Videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, more

2017 IEEE International Conference on Computer Vision (ICCV) > 706 - 715

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing...

chapter

Areas of Attention for Image Captioning

Marco Pedersoli, Thomas Lucas, Cordelia Schmid, Jakob Verbeek

2017 IEEE International Conference on Computer Vision (ICCV) > 1251 - 1259

2017 IEEE International Conference on Computer Vision (ICCV)

We propose “Areas of Attention”, a novel attentionbased model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions. In contrast to previous attentionbased approaches that associate image regions only to the RNN state, our method allows a direct association between caption...

chapter

Recurrent Multimodal Interaction for Referring Image Segmentation

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, more

2017 IEEE International Conference on Computer Vision (ICCV) > 1280 - 1289

2017 IEEE International Conference on Computer Vision (ICCV)

In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities...

chapter

Spatio-Temporal Person Retrieval via Natural Language Queries

Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada

2017 IEEE International Conference on Computer Vision (ICCV) > 1462 - 1471

2017 IEEE International Conference on Computer Vision (ICCV)

In this paper, we address the problem of spatio-temporal person retrieval from videos using a natural language query, in which we output a tube (i.e., a sequence of bounding boxes) which encloses the person described by the query. For this problem, we introduce a novel dataset consisting of videos containing people annotated with bounding boxes for each second and with five natural language descriptions...

chapter

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, more

2017 IEEE International Conference on Computer Vision (ICCV) > 1829 - 1838

2017 IEEE International Conference on Computer Vision (ICCV)

Rich and dense human labeled datasets are among the main enabling factors for the recent advance on visionlanguage understanding. Many seemingly distant annotations (e.g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes — and even the same set of images (e...

chapter

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

Abhishek Das, Satwik Kottur, Jose M. F. Moura, Stefan Lee, more

2017 IEEE International Conference on Computer Vision (ICCV) > 2970 - 2979

2017 IEEE International Conference on Computer Vision (ICCV)

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative ‘image guessing’ game between two agents – Q-BOT and A-BOT– who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end – from pixels...

chapter

TALL: Temporal Activity Localization via Language Query

Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia

2017 IEEE International Conference on Computer Vision (ICCV) > 5277 - 5285

2017 IEEE International Conference on Computer Vision (ICCV)

This paper focuses on temporal localization of actions in untrimmed videos. Existing methods typically train classifiers for a pre-defined list of actions and apply them in a sliding window fashion. However, activities in the wild consist of a wide combination of actors, actions and objects; it is difficult to design a proper activity list that meets users’ needs. We propose to localize activities...

chapter

Localizing Moments in Video with Natural Language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, more

2017 IEEE International Conference on Computer Vision (ICCV) > 5804 - 5813

2017 IEEE International Conference on Computer Vision (ICCV)

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video...

chapter

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, more

2017 IEEE International Conference on Computer Vision (ICCV) > 5843 - 5851

2017 IEEE International Conference on Computer Vision (ICCV)

Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth...

INFONA - science communication portal

2017 IEEE International Conference on Computer Vision (ICCV)

Dense-Captioning Events in Videos

Areas of Attention for Image Captioning

Recurrent Multimodal Interaction for Referring Image Segmentation

Spatio-Temporal Person Retrieval via Natural Language Queries

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

TALL: Temporal Activity Localization via Language Query

Localizing Moments in Video with Natural Language

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Filter options

Publication date

Keywords

INFONA - science communication portal

2017 IEEE International Conference on Computer Vision (ICCV) $("#expandableTitles").expandable();

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options

2017 IEEE International Conference on Computer Vision (ICCV)