Semantic-aware Video Representation for Few-shot Action Recognition
- URL: http://arxiv.org/abs/2311.06218v1
- Date: Fri, 10 Nov 2023 18:13:24 GMT
- Title: Semantic-aware Video Representation for Few-shot Action Recognition
- Authors: Yutao Tang, Benjamin Bejar, Rene Vidal
- Abstract summary: We propose a simple yet effective Semantic-Aware Few-Shot Action Recognition (SAFSAR) model to address these issues.
We show that directly leveraging a 3D feature extractor combined with an effective feature-fusion scheme, and a simple cosine similarity for classification can yield better performance.
Experiments on five challenging few-shot action recognition benchmarks under various settings demonstrate that the proposed SAFSAR model significantly improves the state-of-the-art performance.
- Score: 1.6486717871944268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work on action recognition leverages 3D features and textual
information to achieve state-of-the-art performance. However, most of the
current few-shot action recognition methods still rely on 2D frame-level
representations, often require additional components to model temporal
relations, and employ complex distance functions to achieve accurate alignment
of these representations. In addition, existing methods struggle to effectively
integrate textual semantics, some resorting to concatenation or addition of
textual and visual features, and some using text merely as an additional
supervision without truly achieving feature fusion and information transfer
from different modalities. In this work, we propose a simple yet effective
Semantic-Aware Few-Shot Action Recognition (SAFSAR) model to address these
issues. We show that directly leveraging a 3D feature extractor combined with
an effective feature-fusion scheme, and a simple cosine similarity for
classification can yield better performance without the need of extra
components for temporal modeling or complex distance functions. We introduce an
innovative scheme to encode the textual semantics into the video representation
which adaptively fuses features from text and video, and encourages the visual
encoder to extract more semantically consistent features. In this scheme,
SAFSAR achieves alignment and fusion in a compact way. Experiments on five
challenging few-shot action recognition benchmarks under various settings
demonstrate that the proposed SAFSAR model significantly improves the
state-of-the-art performance.
Related papers
- Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context [0.0]
We present a novel approach to improve action recognition by exploiting the hierarchical organization of actions.
Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information.
We also conduct an ablation study to assess the impact of different methods for integrating contextual and hierarchical data on action recognition performance.
arXiv Detail & Related papers (2024-10-28T17:59:35Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations.
In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Interactive Fusion of Multi-level Features for Compositional Activity
Recognition [100.75045558068874]
We present a novel framework that accomplishes this goal by interactive fusion.
We implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction.
We evaluate our approach on two action recognition datasets, Something-Something and Charades.
arXiv Detail & Related papers (2020-12-10T14:17:18Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.