Related papers: Video-adverb retrieval with compositional adverb-action embeddings

Video-adverb retrieval with compositional adverb-action embeddings

URL: http://arxiv.org/abs/2309.15086v1
Date: Tue, 26 Sep 2023 17:31:02 GMT
Title: Video-adverb retrieval with compositional adverb-action embeddings
Authors: Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata
Abstract summary: Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval that aligns video embeddings with their matching compositional adverb-action text embedding. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval.
Score: 59.45164042078649
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for unseen adverb-action compositions. Code and dataset splits are available at https://hummelth.github.io/ReGaDa/.

Related papers

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models [48.00262713744499]
VideoComp is a benchmark and learning framework for advancing video-text compositionality understanding. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences.
arXiv Detail & Related papers (2025-04-04T22:24:30Z)
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations. We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z)
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z)
Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition [54.938128496934695]
We propose a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types. Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips. We release two new datasets of object-behaviour-facts extracted from raw video clips.
arXiv Detail & Related papers (2023-07-09T09:04:26Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR) Existing methods rely on separate pre-training feature extractors for visual and textual understanding. We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video. Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS) To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z)
How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs [52.042261549764326]
We propose a method which recognizes adverbs across different actions. Our approach uses semi-supervised learning with multiple adverb pseudo-labels. We also show how adverbs can relate fine-grained actions.
arXiv Detail & Related papers (2022-03-23T11:53:41Z)
Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning. We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively. Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z)
Semantic Grouping Network for Video Captioning [11.777063873936598]
The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption. The continuous feedback from decoded words enables the SGN to dynamically update the video representation that adapts to the partially decoded caption. The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets.
arXiv Detail & Related papers (2021-02-01T13:40:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.