Open-Vocabulary Video Relation Extraction
- URL: http://arxiv.org/abs/2312.15670v1
- Date: Mon, 25 Dec 2023 09:29:34 GMT
- Title: Open-Vocabulary Video Relation Extraction
- Authors: Wentao Tian, Zheng Wang, Yuqian Fu, Jingjing Chen, Lechao Cheng
- Abstract summary: We introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets.
OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages.
We curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset.
- Score: 37.40717383505057
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A comprehensive understanding of videos is inseparable from describing the
action with its contextual action-object interactions. However, many current
video understanding tasks prioritize general action classification and overlook
the actors and relationships that shape the nature of the action, resulting in
a superficial understanding of the action. Motivated by this, we introduce
Open-vocabulary Video Relation Extraction (OVRE), a novel task that views
action understanding through the lens of action-centric relation triplets. OVRE
focuses on pairwise relations that take part in the action and describes these
relation triplets with natural languages. Moreover, we curate the Moments-OVRE
dataset, which comprises 180K videos with action-centric relation triplets,
sourced from a multi-label action classification dataset. With Moments-OVRE, we
further propose a crossmodal mapping model to generate relation triplets as a
sequence. Finally, we benchmark existing cross-modal generation models on the
new task of OVRE.
Related papers
- Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition [53.02634128715853]
Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars.
We propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR.
It unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view.
arXiv Detail & Related papers (2025-04-14T10:23:22Z) - DreamRelation: Relation-Centric Video Customization [33.65405972817795]
Video customization refers to the creation of personalized videos that depict user-specified relations between two subjects.
While existing methods can personalize subject appearances and motions, they still struggle with complex video customization.
We propose DreamRelation, a novel approach capturing a small set of videos, leveraging two key components: Decoupling Learning and Dynamics Enhancement.
arXiv Detail & Related papers (2025-03-10T17:58:03Z) - Multimodal Relational Triple Extraction with Query-based Entity Object Transformer [20.97497765985682]
Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge.
We propose Multimodal Entity-Object Triple Extraction, which aims to extract all triples (entity, relation, object region) from image-text pairs.
We also propose QEOT, a query-based model with a selective attention mechanism to dynamically explore the interaction and fusion of textual and visual information.
arXiv Detail & Related papers (2024-08-16T12:43:38Z) - Cross-Modal Reasoning with Event Correlation for Video Question
Answering [32.332251488360185]
We introduce the dense caption modality as a new auxiliary and distill event-correlated information from it to infer the correct answer.
We employ cross-modal reasoning modules for explicitly modeling inter-modal relationships and aggregating relevant information across different modalities.
We propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning.
arXiv Detail & Related papers (2023-12-20T02:30:39Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Relational Self-Attention: What's Missing in Attention for Video
Understanding [52.38780998425556]
We introduce a relational feature transform, dubbed the relational self-attention (RSA)
Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
arXiv Detail & Related papers (2021-11-02T15:36:11Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Rendezvous: Attention Mechanisms for the Recognition of Surgical Action
Triplets in Endoscopic Videos [12.725586100227337]
Action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities.
We introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels.
Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.
arXiv Detail & Related papers (2021-09-07T17:52:52Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.