ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos
- URL: http://arxiv.org/abs/2411.15628v1
- Date: Sat, 23 Nov 2024 18:49:49 GMT
- Title: ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos
- Authors: Reza Ghoddoosian, Nakul Agarwal, Isht Dwivedi, Behzad Darisuh,
- Abstract summary: Action Concept Enhancement (ACE) improves concept understanding of vision-language models (VLMs)
ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss.
We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space.
- Score: 7.030989629685138
- License:
- Abstract: Vision-language models (VLMs) are capable of recognizing unseen actions. However, existing VLMs lack intrinsic understanding of procedural action concepts. Hence, they overfit to fixed labels and are not invariant to unseen action synonyms. To address this, we propose a simple fine-tuning technique, Action Concept Enhancement (ACE), to improve the robustness and concept understanding of VLMs in procedural action classification. ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss by stochastically replacing fixed labels during training. This creates new combinations of action labels over the course of fine-tuning and prevents overfitting to fixed action representations. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space. Our experiments on the ATA, IKEA and GTEA datasets demonstrate the efficacy of ACE in domains of cooking and assembly leading to significant improvements in zero-shot action classification while maintaining competitive performance on seen actions.
Related papers
- FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements.
Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition.
We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - Cross-Video Contextual Knowledge Exploration and Exploitation for
Ambiguity Reduction in Weakly Supervised Temporal Action Localization [23.94629999419033]
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels.
Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset.
Our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
arXiv Detail & Related papers (2023-08-24T07:19:59Z) - Action Sensitivity Learning for Temporal Action Localization [35.65086250175736]
We propose an Action Sensitivity Learning framework (ASL) to tackle the task of temporal action localization.
We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively.
Based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames.
arXiv Detail & Related papers (2023-05-25T04:19:14Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Elaborative Rehearsal for Zero-shot Action Recognition [36.84404523161848]
ZSAR aims to recognize target (unseen) actions without training examples.
It remains challenging to semantically represent action classes and transfer knowledge from seen data.
We propose an ER-enhanced ZSAR model inspired by an effective human memory technique Elaborative Rehearsal.
arXiv Detail & Related papers (2021-08-05T20:02:46Z) - Learning to Recognize Actions on Objects in Egocentric Video with
Attention Dictionaries [51.48859591280838]
We present EgoACO, a deep neural architecture for video action recognition.
It learns to pool action-context-object descriptors from frame level features.
Cap uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions.
arXiv Detail & Related papers (2021-02-16T10:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.