Related papers: A$^2$M$^2$-Net: Adaptively Aligned Multi-Scale Moment for Few-Shot Action Recognition

A$^2$M$^2$-Net: Adaptively Aligned Multi-Scale Moment for Few-Shot Action Recognition

URL: http://arxiv.org/abs/2509.17638v1
Date: Mon, 22 Sep 2025 11:44:14 GMT
Title: A$^2$M$^2$-Net: Adaptively Aligned Multi-Scale Moment for Few-Shot Action Recognition
Authors: Zilin Gao, Qilong Wang, Bingbing Zhang, Qinghua Hu, Peihua Li,
Abstract summary: A$2$M$2$-Net is able to handle the challenging temporal misalignment problem by establishing an adaptive alignment protocol for strong representation.<n>The experiments are conducted on five widely used FSAR benchmarks, and the results show our A$2$M$2$-Net achieves very competitive performance compared to state-of-the-arts.
Score: 56.79651392604733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Thanks to capability to alleviate the cost of large-scale annotation, few-shot action recognition (FSAR) has attracted increased attention of researchers in recent years. Existing FSAR approaches typically neglect the role of individual motion pattern in comparison, and under-explore the feature statistics for video dynamics. Thereby, they struggle to handle the challenging temporal misalignment in video dynamics, particularly by using 2D backbones. To overcome these limitations, this work proposes an adaptively aligned multi-scale second-order moment network, namely A$^2$M$^2$-Net, to describe the latent video dynamics with a collection of powerful representation candidates and adaptively align them in an instance-guided manner. To this end, our A$^2$M$^2$-Net involves two core components, namely, adaptive alignment (A$^2$ module) for matching, and multi-scale second-order moment (M$^2$ block) for strong representation. Specifically, M$^2$ block develops a collection of semantic second-order descriptors at multiple spatio-temporal scales. Furthermore, A$^2$ module aims to adaptively select informative candidate descriptors while considering the individual motion pattern. By such means, our A$^2$M$^2$-Net is able to handle the challenging temporal misalignment problem by establishing an adaptive alignment protocol for strong representation. Notably, our proposed method generalizes well to various few-shot settings and diverse metrics. The experiments are conducted on five widely used FSAR benchmarks, and the results show our A$^2$M$^2$-Net achieves very competitive performance compared to state-of-the-arts, demonstrating its effectiveness and generalization.

Related papers

Beyond a Single Perspective: Text Anomaly Detection with Multi-View Language Representations [48.7146621463489]
Text anomaly detection (TAD) plays a critical role in various language-driven real-world applications, including harmful content moderation, phishing detection, and spam review filtering.<n>While two-step "embedding-detector" TAD methods have shown state-of-the-art performance, their effectiveness is often limited by the use of a single embedding model and the lack of adaptability across diverse datasets and anomaly types.<n>We propose to exploit the embeddings from multiple pretrained language models and integrate them into $MCA2$, a multi-view TAD framework.
arXiv Detail & Related papers (2026-01-25T10:52:59Z)
Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards [13.663839318595505]
We seek to answer what it would take to simultaneously align a model across various domains spanning those with verifiable and non-verifiable rewards.<n>We propose a unified framework that standardizes process reward model (PRM) training across both verifiable and non-verifiable settings.<n> Experiments across math reasoning, value alignment, and multi-turn dialogue show that our framework improves performance across multiple objectives simultaneously.
arXiv Detail & Related papers (2025-10-01T17:54:15Z)
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model [39.24524388617938]
$mathbfMavors$ is a novel framework for holistic longvideo modeling.<n>Mavors encodes raw video content into latent representations through two core components.<n>The framework unifies image and video understanding by treating images as single-frame videos.
arXiv Detail & Related papers (2025-04-14T10:14:44Z)
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts [21.119495676190127]
In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)<n>We propose a novel and efficient method "Re-Routing in Test-Time (R2-T2)" that locally optimize the vector of routing weights in test-time.<n>R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.
arXiv Detail & Related papers (2025-02-27T18:59:32Z)
TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation [76.24766055944554]
We introduce a network named TD$2$-Net that aims at denoising and debiasing for dynamic SGG. TD$2$-Net outperforms the second-best competitors by 12.7 % on mean-Recall@10 for predicate classification.
arXiv Detail & Related papers (2024-01-23T04:17:42Z)
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications. Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities. We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z)
An end-to-end multi-scale network for action prediction in videos [31.967024536359908]
We develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner. Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101.
arXiv Detail & Related papers (2022-12-31T06:58:41Z)
PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet. Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z)
Exploring Motion and Appearance Information for Temporal Sentence Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding. We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations. Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.