How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs
- URL: http://arxiv.org/abs/2203.12344v1
- Date: Wed, 23 Mar 2022 11:53:41 GMT
- Title: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs
- Authors: Hazel Doughty and Cees G. M. Snoek
- Abstract summary: We propose a method which recognizes adverbs across different actions.
Our approach uses semi-supervised learning with multiple adverb pseudo-labels.
We also show how adverbs can relate fine-grained actions.
- Score: 52.042261549764326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to understand how actions are performed and identify subtle
differences, such as 'fold firmly' vs. 'fold gently'. To this end, we propose a
method which recognizes adverbs across different actions. However, such
fine-grained annotations are difficult to obtain and their long-tailed nature
makes it challenging to recognize adverbs in rare action-adverb compositions.
Our approach therefore uses semi-supervised learning with multiple adverb
pseudo-labels to leverage videos with only action labels. Combined with
adaptive thresholding of these pseudo-adverbs we are able to make efficient use
of the available data while tackling the long-tailed distribution.
Additionally, we gather adverb annotations for three existing video retrieval
datasets, which allows us to introduce the new tasks of recognizing adverbs in
unseen action-adverb compositions and unseen domains. Experiments demonstrate
the effectiveness of our method, which outperforms prior work in recognizing
adverbs and semi-supervised works adapted for adverb recognition. We also show
how adverbs can relate fine-grained actions.
Related papers
- Enhancing Metaphor Detection through Soft Labels and Target Word Prediction [3.7676096626244986]
We develop a prompt learning framework specifically designed for metaphor detection.
We also introduce a teacher model to generate valuable soft labels.
Experimental results demonstrate that our model has achieved state-of-the-art performance.
arXiv Detail & Related papers (2024-03-27T04:51:42Z) - Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - Video-adverb retrieval with compositional adverb-action embeddings [59.45164042078649]
Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding.
We propose a framework for video-to-adverb retrieval that aligns video embeddings with their matching compositional adverb-action text embedding.
Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval.
arXiv Detail & Related papers (2023-09-26T17:31:02Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Verbs in Action: Improving verb understanding in video-language models [128.87443209118726]
State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding.
We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
arXiv Detail & Related papers (2023-04-13T17:57:01Z) - Learning Action Changes by Measuring Verb-Adverb Textual Relationships [40.596329888722714]
We aim to predict an adverb indicating a modification applied to the action in a video.
We achieve state-of-the-art results on adverb prediction and antonym classification.
We focus on instructional recipes videos, curating a set of actions that exhibit meaningful visual changes when performed differently.
arXiv Detail & Related papers (2023-03-27T10:53:38Z) - Tell me what you see: A zero-shot action recognition method based on
natural language descriptions [3.136605193634262]
We propose using video captioning methods to extract semantic information from videos.
To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences.
We build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets.
arXiv Detail & Related papers (2021-12-18T17:44:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.