Learning Action Changes by Measuring Verb-Adverb Textual Relationships
- URL: http://arxiv.org/abs/2303.15086v2
- Date: Tue, 23 May 2023 12:53:13 GMT
- Title: Learning Action Changes by Measuring Verb-Adverb Textual Relationships
- Authors: Davide Moltisanti, Frank Keller, Hakan Bilen, Laura Sevilla-Lara
- Abstract summary: We aim to predict an adverb indicating a modification applied to the action in a video.
We achieve state-of-the-art results on adverb prediction and antonym classification.
We focus on instructional recipes videos, curating a set of actions that exhibit meaningful visual changes when performed differently.
- Score: 40.596329888722714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of this work is to understand the way actions are performed in
videos. That is, given a video, we aim to predict an adverb indicating a
modification applied to the action (e.g. cut "finely"). We cast this problem as
a regression task. We measure textual relationships between verbs and adverbs
to generate a regression target representing the action change we aim to learn.
We test our approach on a range of datasets and achieve state-of-the-art
results on both adverb prediction and antonym classification. Furthermore, we
outperform previous work when we lift two commonly assumed conditions: the
availability of action labels during testing and the pairing of adverbs as
antonyms. Existing datasets for adverb recognition are either noisy, which
makes learning difficult, or contain actions whose appearance is not influenced
by adverbs, which makes evaluation less reliable. To address this, we collect a
new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional
recipes videos, curating a set of actions that exhibit meaningful visual
changes when performed differently. Videos in AIR are more tightly trimmed and
were manually reviewed by multiple annotators to ensure high labelling quality.
Results show that models learn better from AIR given its cleaner videos. At the
same time, adverb prediction on AIR is challenging, demonstrating that there is
considerable room for improvement.
Related papers
- Video-adverb retrieval with compositional adverb-action embeddings [59.45164042078649]
Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding.
We propose a framework for video-to-adverb retrieval that aligns video embeddings with their matching compositional adverb-action text embedding.
Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval.
arXiv Detail & Related papers (2023-09-26T17:31:02Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Verbs in Action: Improving verb understanding in video-language models [128.87443209118726]
State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding.
We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
arXiv Detail & Related papers (2023-04-13T17:57:01Z) - Do Trajectories Encode Verb Meaning? [22.409307683247967]
Grounded language models learn to connect concrete categories like nouns and adjectives to the world via images and videos.
In this paper, we investigate the extent to which trajectories (i.e. the position and rotation of objects over time) naturally encode verb semantics.
We find that trajectories correlate as-is with some verbs (e.g., fall), and that additional abstraction via self-supervised pretraining can further capture nuanced differences in verb meaning.
arXiv Detail & Related papers (2022-06-23T19:57:16Z) - How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs [52.042261549764326]
We propose a method which recognizes adverbs across different actions.
Our approach uses semi-supervised learning with multiple adverb pseudo-labels.
We also show how adverbs can relate fine-grained actions.
arXiv Detail & Related papers (2022-03-23T11:53:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.