A Multi-Modal Transformer Network for Action Detection
- URL: http://arxiv.org/abs/2305.19624v1
- Date: Wed, 31 May 2023 07:50:38 GMT
- Title: A Multi-Modal Transformer Network for Action Detection
- Authors: Matthew Korban, Scott T. Acton, Peter Youngs
- Abstract summary: This paper proposes a novel multi-modal transformer network for detecting actions in untrimmed videos.
We suggest an algorithm that corrects the motion distortion caused by camera movement.
Our proposed algorithm outperforms the state-of-the-art methods on two public benchmarks.
- Score: 15.104201344012347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel multi-modal transformer network for detecting
actions in untrimmed videos. To enrich the action features, our transformer
network utilizes a new multi-modal attention mechanism that computes the
correlations between different spatial and motion modalities combinations.
Exploring such correlations for actions has not been attempted previously. To
use the motion and spatial modality more effectively, we suggest an algorithm
that corrects the motion distortion caused by camera movement. Such motion
distortion, common in untrimmed videos, severely reduces the expressive power
of motion features such as optical flow fields. Our proposed algorithm
outperforms the state-of-the-art methods on two public benchmarks, THUMOS14 and
ActivityNet. We also conducted comparative experiments on our new instructional
activity dataset, including a large set of challenging classroom videos
captured from elementary schools.
Related papers
- Two-Stream temporal transformer for video action classification [47.53991869205973]
Motion representation plays an important role in video understanding and has many applications including encoder action recognition, robot and autonomous guidance or others.<n>Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications.
arXiv Detail & Related papers (2026-01-20T15:47:00Z) - MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation [44.524568858995586]
MotionRAG is a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos.<n>Our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference.
arXiv Detail & Related papers (2025-09-30T15:26:04Z) - MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z) - MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration [53.180212987726556]
We introduce MIORe and VAR-MIORe, two novel multi-task datasets that address critical limitations in current motion restoration benchmarks.<n>Our datasets capture a broad spectrum of motion scenarios, which include complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects.
arXiv Detail & Related papers (2025-09-08T15:34:31Z) - Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning [50.4776422843776]
Follow-Your-Motion is an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion.<n>We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.
arXiv Detail & Related papers (2025-06-05T16:18:32Z) - Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning [73.7808110878037]
This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++)<n>By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases.<n>Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks.
arXiv Detail & Related papers (2025-05-26T13:06:01Z) - EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models [73.96414072072048]
Existing motion transfer methods explored the motion representations of reference videos to guide generation.
We propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer.
Our experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability.
arXiv Detail & Related papers (2025-03-25T05:51:14Z) - Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models [3.052019331122618]
We propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon.
Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics.
Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios.
arXiv Detail & Related papers (2025-01-22T03:01:54Z) - Motion Inversion for Video Customization [31.607669029754874]
We present a novel approach for motion in generation, addressing the widespread gap in the exploration of motion representation within video models.
We introduce Motion Embeddings, a set of explicit, temporally coherent embeddings derived from given video.
Our contributions include a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method.
arXiv Detail & Related papers (2024-03-29T14:14:22Z) - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.
SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.
Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - CDN-MEDAL: Two-stage Density and Difference Approximation Framework for
Motion Analysis [3.337126420148156]
We propose a novel, two-stage method of change detection with two convolutional neural networks.
Our two-stage framework contains approximately 3.5K parameters in total but still maintains rapid convergence to intricate motion patterns.
arXiv Detail & Related papers (2021-06-07T16:39:42Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z) - Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames.
Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network.
Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z) - MotionSqueeze: Neural Motion Feature Learning for Video Understanding [46.82376603090792]
Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information.
In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features.
We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost.
arXiv Detail & Related papers (2020-07-20T08:30:14Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.