Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization
- URL: http://arxiv.org/abs/2512.11189v1
- Date: Fri, 12 Dec 2025 00:34:51 GMT
- Title: Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization
- Authors: Anh-Kiet Duong, Petra Gomez-Krämer,
- Abstract summary: We present our solution to the BinEgo-360 Challenge, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings.<n>Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals.<n>Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
- Score: 1.38120109831448
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
Related papers
- Taming Modality Entanglement in Continual Audio-Visual Segmentation [30.143320890304366]
We introduce a novel Continual Audio-Visual (CAVS) task, aiming to continuously segment new classes guided by audio.<n>Two critical challenges are identified: 1) multi-modal semantic drift and 2) co-occurrence confusion.<n>A Collision-based Multi-modal Rehearsal framework is designed to address these challenges.
arXiv Detail & Related papers (2025-10-20T07:23:36Z) - OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment [79.98946571424607]
We present OmniBridge, a unified framework that supports vision-language understanding, generation, and retrieval within a unified architecture.<n>To address the challenge of task interference, we propose a two-stage decoupled training strategy.<n>Experiments demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks.
arXiv Detail & Related papers (2025-09-23T13:57:55Z) - MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization [52.149337961205624]
We propose a framework that empowers both inter- and intra-task optimization for surgical triplet recognition.<n>For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$2$D) learning scheme that decomposes representations into task-shared and task-specific components.<n>For intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative ambiguities.
arXiv Detail & Related papers (2025-09-16T09:48:52Z) - TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving [22.22943635900334]
TEM3-Learning is a novel framework that jointly optimize driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition.<n>It achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed.
arXiv Detail & Related papers (2025-06-22T16:12:27Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - Action Recognition Using Temporal Shift Module and Ensemble Learning [0.0]
The paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the aclICPR 2024.<n>The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources.<n>Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes.
arXiv Detail & Related papers (2025-01-29T10:36:55Z) - The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024 [27.30100635072298]
TAL focuses on identifying and classifying actions within specific time intervals throughout a video sequence.
We employ a data augmentation technique by expanding the training dataset using overlapping labels from the Something-SomethingV2 dataset.
For feature extraction, we utilize state-of-the-art models, including UMT, VideoMAEv2 for video features, and BEATs and CAV-MAE for audio features.
arXiv Detail & Related papers (2024-10-08T01:07:21Z) - MmAP : Multi-modal Alignment Prompt for Cross-domain Multi-task Learning [29.88567810099265]
Multi-task learning is designed to train multiple correlated tasks simultaneously.
To tackle this challenge, we integrate the decoder-free vision-language model CLIP.
We propose Multi-modal Alignment Prompt (MmAP) for CLIP, which aligns text and visual modalities during fine-tuning process.
arXiv Detail & Related papers (2023-12-14T03:33:02Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.