Video-adverb retrieval with compositional adverb-action embeddings
- URL: http://arxiv.org/abs/2309.15086v1
- Date: Tue, 26 Sep 2023 17:31:02 GMT
- Title: Video-adverb retrieval with compositional adverb-action embeddings
- Authors: Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata
- Abstract summary: Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding.
We propose a framework for video-to-adverb retrieval that aligns video embeddings with their matching compositional adverb-action text embedding.
Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval.
- Score: 59.45164042078649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieving adverbs that describe an action in a video poses a crucial step
towards fine-grained video understanding. We propose a framework for
video-to-adverb retrieval (and vice versa) that aligns video embeddings with
their matching compositional adverb-action text embedding in a joint embedding
space. The compositional adverb-action text embedding is learned using a
residual gating mechanism, along with a novel training objective consisting of
triplet losses and a regression target. Our method achieves state-of-the-art
performance on five recent benchmarks for video-adverb retrieval. Furthermore,
we introduce dataset splits to benchmark video-adverb retrieval for unseen
adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet
Adverbs datasets. Our proposed framework outperforms all prior works for the
generalisation task of retrieving adverbs from videos for unseen adverb-action
compositions. Code and dataset splits are available at
https://hummelth.github.io/ReGaDa/.
Related papers
- NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations.
We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition [54.938128496934695]
We propose a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types.
Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips.
We release two new datasets of object-behaviour-facts extracted from raw video clips.
arXiv Detail & Related papers (2023-07-09T09:04:26Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs [52.042261549764326]
We propose a method which recognizes adverbs across different actions.
Our approach uses semi-supervised learning with multiple adverb pseudo-labels.
We also show how adverbs can relate fine-grained actions.
arXiv Detail & Related papers (2022-03-23T11:53:41Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - Semantic Grouping Network for Video Captioning [11.777063873936598]
The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption.
The continuous feedback from decoded words enables the SGN to dynamically update the video representation that adapts to the partially decoded caption.
The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets.
arXiv Detail & Related papers (2021-02-01T13:40:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.