Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition
- URL: http://arxiv.org/abs/2307.04132v3
- Date: Wed, 27 Mar 2024 18:17:46 GMT
- Title: Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition
- Authors: Amrit Diggavi Seshadri, Alessandra Russo,
- Abstract summary: We propose a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types.
Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips.
We release two new datasets of object-behaviour-facts extracted from raw video clips.
- Score: 54.938128496934695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, following the intuition that adverbs describing scene-sequences are best identified by reasoning over high-level concepts of object-behavior, we propose the design of a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types. Importantly, while previous works for general scene adverb-recognition assume knowledge of the clips underlying action-types, our method is directly applicable in the more general problem setting where the action-type of a video-clip is unknown. Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips and propose novel symbolic and transformer based reasoning methods that operate over these extracted facts to identify adverb-types. Experiment results demonstrate that our proposed methods perform favourably against the previous state-of-the-art. Additionally, to support efforts in symbolic video-processing, we release two new datasets of object-behaviour-facts extracted from raw video clips - the MSR-VTT-ASP and ActivityNet-ASP datasets.
Related papers
- Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - Video-adverb retrieval with compositional adverb-action embeddings [59.45164042078649]
Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding.
We propose a framework for video-to-adverb retrieval that aligns video embeddings with their matching compositional adverb-action text embedding.
Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval.
arXiv Detail & Related papers (2023-09-26T17:31:02Z) - SOAR: Scene-debiasing Open-set Action Recognition [81.8198917049666]
We propose Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module.
The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning.
The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information.
arXiv Detail & Related papers (2023-09-03T20:20:48Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Tell me what you see: A zero-shot action recognition method based on
natural language descriptions [3.136605193634262]
We propose using video captioning methods to extract semantic information from videos.
To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences.
We build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets.
arXiv Detail & Related papers (2021-12-18T17:44:07Z) - Zero-Shot Action Recognition from Diverse Object-Scene Compositions [15.942187254262091]
This paper investigates the problem of zero-shot action recognition, in the setting where no training videos with seen actions are available.
For this challenging scenario, the current leading approach is to transfer knowledge from the image domain by recognizing objects in videos using pre-trained networks.
Where objects provide a local view on the content in videos, in this work we also seek to include a global view of the scene in which actions occur.
We find that scenes on their own are also capable of recognizing unseen actions, albeit more marginally than objects, and a direct combination of object-based and scene-based scores degrades the action recognition
arXiv Detail & Related papers (2021-10-26T08:23:14Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network
Language Model [26.78064626111014]
In building automatic speech recognition systems, we can exploit the contextual information provided by video metadata.
We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model.
Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata.
arXiv Detail & Related papers (2020-05-15T07:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.