COBE: Contextualized Object Embeddings from Narrated Instructional Video
- URL: http://arxiv.org/abs/2007.07306v2
- Date: Thu, 29 Oct 2020 21:52:34 GMT
- Title: COBE: Contextualized Object Embeddings from Narrated Instructional Video
- Authors: Gedas Bertasius, Lorenzo Torresani
- Abstract summary: We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
- Score: 52.73710465010274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many objects in the real world undergo dramatic variations in visual
appearance. For example, a tomato may be red or green, sliced or chopped, fresh
or fried, liquid or solid. Training a single detector to accurately recognize
tomatoes in all these different states is challenging. On the other hand,
contextual cues (e.g., the presence of a knife, a cutting board, a strainer or
a pan) are often strongly indicative of how the object appears in the scene.
Recognizing such contextual cues is useful not only to improve the accuracy of
object detection or to determine the state of the object, but also to
understand its functional properties and to infer ongoing or upcoming
human-object interactions. A fully-supervised approach to recognizing object
states and their contexts in the real-world is unfortunately marred by the
long-tailed, open-ended distribution of the data, which would effectively
require massive amounts of annotations to capture the appearance of objects in
all their different forms. Instead of relying on manually-labeled data for this
task, we propose a new framework for learning Contextualized OBject Embeddings
(COBE) from automatically-transcribed narrations of instructional videos. We
leverage the semantic and compositional structure of language by training a
visual detector to predict a contextualized word embedding of the object and
its associated narration. This enables the learning of an object representation
where concepts relate according to a semantic language metric. Our experiments
show that our detector learns to predict a rich variety of contextual object
information, and that it is highly effective in the settings of few-shot and
zero-shot learning.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Text-driven Affordance Learning from Egocentric Vision [6.699930460835963]
We present a text-driven affordance learning approach for robots.
We aim to learn contact points and manipulation trajectories from an egocentric view following textual instruction.
Our approach robustly handles multiple affordances, serving as a new standard for affordance learning in real-world scenarios.
arXiv Detail & Related papers (2024-04-03T07:23:03Z) - OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - Learning Scene Context Without Images [2.8184014933789365]
We introduce a novel approach to teach scene contextual knowledge to machines using an attention mechanism.
A distinctive aspect of the proposed approach is its reliance solely on labels from image datasets to teach scene context.
We show how scene-wide relationships among different objects can be learned using a self-attention mechanism.
arXiv Detail & Related papers (2023-11-18T07:27:25Z) - Opening the Vocabulary of Egocentric Actions [42.94865322371628]
This paper proposes a novel open vocabulary action recognition task.
Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects.
We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets.
arXiv Detail & Related papers (2023-08-22T15:08:02Z) - Brief Introduction to Contrastive Learning Pretext Tasks for Visual
Representation [0.0]
We introduce contrastive learning, a subset of unsupervised learning methods.
The purpose of contrastive learning is to embed augmented samples from the same sample near to each other while pushing away those that are not.
We offer some strategies from contrastive learning that have recently been published and are focused on pretext tasks for visual representation.
arXiv Detail & Related papers (2022-10-06T18:54:10Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Language Models as Zero-shot Visual Semantic Learners [0.618778092044887]
We propose a Visual Se-mantic Embedding Probe (VSEP) to probe the semantic information of contextualized word embeddings.
The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner.
We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short.
arXiv Detail & Related papers (2021-07-26T08:22:55Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.