Video-adverb retrieval with compositional adverb-action embeddings
- URL: http://arxiv.org/abs/2309.15086v1
- Date: Tue, 26 Sep 2023 17:31:02 GMT
- Title: Video-adverb retrieval with compositional adverb-action embeddings
- Authors: Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata
- Abstract summary: Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding.
We propose a framework for video-to-adverb retrieval that aligns video embeddings with their matching compositional adverb-action text embedding.
Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval.
- Score: 59.45164042078649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieving adverbs that describe an action in a video poses a crucial step
towards fine-grained video understanding. We propose a framework for
video-to-adverb retrieval (and vice versa) that aligns video embeddings with
their matching compositional adverb-action text embedding in a joint embedding
space. The compositional adverb-action text embedding is learned using a
residual gating mechanism, along with a novel training objective consisting of
triplet losses and a regression target. Our method achieves state-of-the-art
performance on five recent benchmarks for video-adverb retrieval. Furthermore,
we introduce dataset splits to benchmark video-adverb retrieval for unseen
adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet
Adverbs datasets. Our proposed framework outperforms all prior works for the
generalisation task of retrieving adverbs from videos for unseen adverb-action
compositions. Code and dataset splits are available at
https://hummelth.github.io/ReGaDa/.
Related papers
- SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition [54.938128496934695]
We propose a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types.
Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips.
We release two new datasets of object-behaviour-facts extracted from raw video clips.
arXiv Detail & Related papers (2023-07-09T09:04:26Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs [52.042261549764326]
We propose a method which recognizes adverbs across different actions.
Our approach uses semi-supervised learning with multiple adverb pseudo-labels.
We also show how adverbs can relate fine-grained actions.
arXiv Detail & Related papers (2022-03-23T11:53:41Z) - Tell me what you see: A zero-shot action recognition method based on
natural language descriptions [3.136605193634262]
We propose using video captioning methods to extract semantic information from videos.
To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences.
We build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets.
arXiv Detail & Related papers (2021-12-18T17:44:07Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - Semantic Grouping Network for Video Captioning [11.777063873936598]
The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption.
The continuous feedback from decoded words enables the SGN to dynamically update the video representation that adapts to the partially decoded caption.
The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets.
arXiv Detail & Related papers (2021-02-01T13:40:56Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.