Task-Specific Alignment and Multiple Level Transformer for Few-Shot
Action Recognition
- URL: http://arxiv.org/abs/2307.01985v2
- Date: Fri, 1 Dec 2023 02:40:58 GMT
- Title: Task-Specific Alignment and Multiple Level Transformer for Few-Shot
Action Recognition
- Authors: Fei Guo, Li Zhu, YiWang Wang, Jing Sun
- Abstract summary: In recent years, some works have used the Transformer to deal with frames, then get the attention feature and the enhanced prototype, and the results are competitive.
We address these problems through an end-to-end method named "Task-Specific Alignment and Multiple-level Transformer Network (TSA-MLT)"
Our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something 2-something V2 datasets.
- Score: 11.700737340560796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the research field of few-shot learning, the main difference between
image-based and video-based is the additional temporal dimension. In recent
years, some works have used the Transformer to deal with frames, then get the
attention feature and the enhanced prototype, and the results are competitive.
However, some video frames may relate little to the action, and only using
single frame-level or segment-level features may not mine enough information.
We address these problems sequentially through an end-to-end method named
"Task-Specific Alignment and Multiple-level Transformer Network (TSA-MLT)". The
first module (TSA) aims at filtering the action-irrelevant frames for action
duration alignment. Affine Transformation for frame sequence in the time
dimension is used for linear sampling. The second module (MLT) focuses on the
Multiple-level feature of the support prototype and query sample to mine more
information for the alignment, which operates on different level features. We
adopt a fusion loss according to a fusion distance that fuses the L2 sequence
distance, which focuses on temporal order alignment, and the Optimal Transport
distance, which focuses on measuring the gap between the appearance and
semantics of the videos. Extensive experiments show our method achieves
state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive
result on the benchmark of Kinetics and something 2-something V2 datasets. Our
code is available at the URL: https://github.com/cofly2014/tsa-mlt.git
Related papers
- MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition [36.426688592783975]
MVP-Shot is a framework to learn and align semantic-related action features at multi-velocity levels.
MVFA module measures similarity between features from support and query videos with different velocity scales.
PST module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains.
arXiv Detail & Related papers (2024-05-03T13:10:16Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - SODFormer: Streaming Object Detection with Transformer Using Events and
Frames [31.293847706713052]
DA camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges.
We propose a novel streaming object detector with SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner.
arXiv Detail & Related papers (2023-08-08T04:53:52Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Tsanet: Temporal and Scale Alignment for Unsupervised Video Object
Segmentation [21.19216164433897]
Unsupervised Video Object (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance.
We propose a novel framework for UVOS that can address the aforementioned limitations of the two approaches.
We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-03-08T04:59:43Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Temporal-Relational CrossTransformers for Few-Shot Action Recognition [82.0033565755246]
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frames between the query and videos in the support set.
Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos.
A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order CrossTransformers.
arXiv Detail & Related papers (2021-01-15T15:47:35Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.