We Have So Much In Common: Modeling Semantic Relational Set Abstractions
in Videos
- URL: http://arxiv.org/abs/2008.05596v1
- Date: Wed, 12 Aug 2020 22:57:44 GMT
- Title: We Have So Much In Common: Modeling Semantic Relational Set Abstractions
in Videos
- Authors: Alex Andonian, Camilo Fosco, Mathew Monfort, Allen Lee, Rogerio Feris,
Carl Vondrick, and Aude Oliva
- Abstract summary: We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning.
We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos.
- Score: 29.483605238401577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying common patterns among events is a key ability in human and
machine perception, as it underlies intelligent decision making. We propose an
approach for learning semantic relational set abstractions on videos, inspired
by human learning. We combine visual features with natural language supervision
to generate high-level representations of similarities across a set of videos.
This allows our model to perform cognitive tasks such as set abstraction (which
general concept is in common among a set of videos?), set completion (which new
video goes well with the set?), and odd one out detection (which video does not
belong to the set?). Experiments on two video benchmarks, Kinetics and
Multi-Moments in Time, show that robust and versatile representations emerge
when learning to recognize commonalities among sets. We compare our model to
several baseline algorithms and show that significant improvements result from
explicitly learning relational abstractions with semantic supervision.
Related papers
- VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time? [19.313541287648473]
VELOCITI is a new benchmark building on complex movie clips to test perception and binding in video language models.
Our perception-based tests require discriminating video-caption pairs that share similar entities.
Our binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video.
arXiv Detail & Related papers (2024-06-16T10:42:21Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Object-Centric Representation Learning for Video Question Answering [27.979053252431306]
Video answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.
The task demands new capabilities to integrate processing, language understanding, binding abstract concepts to concrete visual artifacts.
We propose a new query-guided representation framework to turn a video into a relational graph of objects.
arXiv Detail & Related papers (2021-04-12T02:37:20Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Improved Actor Relation Graph based Group Activity Recognition [0.0]
The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc.
This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
arXiv Detail & Related papers (2020-10-24T19:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.