We Have So Much In Common: Modeling Semantic Relational Set Abstractions
in Videos
- URL: http://arxiv.org/abs/2008.05596v1
- Date: Wed, 12 Aug 2020 22:57:44 GMT
- Title: We Have So Much In Common: Modeling Semantic Relational Set Abstractions
in Videos
- Authors: Alex Andonian, Camilo Fosco, Mathew Monfort, Allen Lee, Rogerio Feris,
Carl Vondrick, and Aude Oliva
- Abstract summary: We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning.
We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos.
- Score: 29.483605238401577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying common patterns among events is a key ability in human and
machine perception, as it underlies intelligent decision making. We propose an
approach for learning semantic relational set abstractions on videos, inspired
by human learning. We combine visual features with natural language supervision
to generate high-level representations of similarities across a set of videos.
This allows our model to perform cognitive tasks such as set abstraction (which
general concept is in common among a set of videos?), set completion (which new
video goes well with the set?), and odd one out detection (which video does not
belong to the set?). Experiments on two video benchmarks, Kinetics and
Multi-Moments in Time, show that robust and versatile representations emerge
when learning to recognize commonalities among sets. We compare our model to
several baseline algorithms and show that significant improvements result from
explicitly learning relational abstractions with semantic supervision.
Related papers
- CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders [6.159948396712944]
CrossVideoMAE learns both video-level and frame-level richtemporal representations and semantic attributes.
Our method integrates mutualtemporal information from videos with spatial information from sampled frames.
This is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner.
arXiv Detail & Related papers (2025-02-08T06:15:39Z) - Enhancing Multi-Modal Video Sentiment Classification Through Semi-Supervised Clustering [0.0]
We aim to improve video sentiment classification by focusing on two key aspects: the video itself, the accompanying text, and the acoustic features.
We are developing a method that utilizes clustering-based semi-supervised pre-training to extract meaningful representations from the data.
arXiv Detail & Related papers (2025-01-11T08:04:39Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.
We introduce a new approach that models video-text as game players using multivariate cooperative game theory.
We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time? [19.313541287648473]
VELOCITI is a new benchmark building on complex movie clips to test perception and binding in video language models.
Our perception-based tests require discriminating video-caption pairs that share similar entities.
Our binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video.
arXiv Detail & Related papers (2024-06-16T10:42:21Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.