Related papers: Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

URL: http://arxiv.org/abs/2501.18509v2
Date: Tue, 11 Mar 2025 12:34:08 GMT
Title: Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy
Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton,
Abstract summary: We argue that handling the dual challenge of temporal and class overlaps is too complex to be tackled by a single network.<n>We propose to decompose the task of detecting dense ambiguous actions into detecting dense unambiguous sub-concepts.<n>Our experiments demonstrate the superiority of our approach over state-of-the-art methods.
Score: 23.100602876056165
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dense action detection involves detecting multiple co-occurring actions while action classes are often ambiguous and represent overlapping concepts. We argue that handling the dual challenge of temporal and class overlaps is too complex to effectively be tackled by a single network. To address this, we propose to decompose the task of detecting dense ambiguous actions into detecting dense, unambiguous sub-concepts that form the action classes (i.e., action entities and action motions), and assigning these sub-tasks to distinct sub-networks. By isolating these unambiguous concepts, the sub-networks can focus exclusively on resolving a single challenge, dense temporal overlaps. Furthermore, simultaneous actions in a video often exhibit interrelationships, and exploiting these relationships can improve the method performance. However, current dense action detection networks fail to effectively learn these relationships due to their reliance on binary cross-entropy optimization, which treats each class independently. To address this limitation, we propose providing explicit supervision on co-occurring concepts during network optimization through a novel language-guided contrastive learning loss. Our extensive experiments demonstrate the superiority of our approach over state-of-the-art methods, achieving substantial improvements of 3.8% and 1.7% on average across all metrics on the challenging benchmark datasets, Charades and MultiTHUMOS.

Related papers

Low-rank Prompt Interaction for Continual Vision-Language Retrieval [47.323830129786145]
We propose the Low-rank Prompt Interaction to address the problem of multi-modal understanding.<n>Considering that the training parameters scale to the number of layers and tasks, we propose low-rank interaction-augmented decomposition.<n>We also adopt hierarchical low-rank contrastive learning to ensure robustness training.
arXiv Detail & Related papers (2025-01-24T10:00:47Z)
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator. We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation. Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z)
An Effective-Efficient Approach for Dense Multi-Label Action Detection [23.100602876056165]
It is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information.
arXiv Detail & Related papers (2024-06-10T11:33:34Z)
Leveraging triplet loss for unsupervised action segmentation [0.0]
We propose a fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches.
arXiv Detail & Related papers (2023-04-13T11:10:16Z)
DIR-AS: Decoupling Individual Identification and Temporal Reasoning for Action Segmentation [84.78383981697377]
Fully supervised action segmentation works on frame-wise action recognition with dense annotations and often suffers from the over-segmentation issue. We develop a novel local-global attention mechanism with temporal pyramid dilation and temporal pyramid pooling for efficient multi-scale attention. We achieve state-of-the-art accuracy, eg, 82.8% (+2.6%) on GTEA and 74.7% (+1.2%) on Breakfast, which demonstrates the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-04-04T20:27:18Z)
Learning Transferable Adversarial Robust Representations via Multi-view Consistency [57.73073964318167]
We propose a novel meta-adversarial multi-view representation learning framework with dual encoders. We demonstrate the effectiveness of our framework on few-shot learning tasks from unseen domains.
arXiv Detail & Related papers (2022-10-19T11:48:01Z)
Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training. We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z)
CTRN: Class-Temporal Relational Network for Action Detection [7.616556723260849]
We introduce an end-to-end network: Class-Temporal Network (CTRN) CTRN contains three key components: The Transform Representation Module, the Class-Temporal Module and the G-classifier. We evaluate CTR on three densely labelled datasets and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-10-26T08:15:47Z)
Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications. We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class. Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z)
Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video. WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event. We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z)
Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z)
Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation [87.1188556802942]
We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. We propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions. Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain.
arXiv Detail & Related papers (2021-05-17T13:42:09Z)
Decoupled and Memory-Reinforced Networks: Towards Effective Feature Learning for One-Step Person Search [65.51181219410763]
One-step methods have been developed to handle pedestrian detection and identification sub-tasks using a single network. There are two major challenges in the current one-step approaches. We propose a decoupled and memory-reinforced network (DMRNet) to overcome these problems.
arXiv Detail & Related papers (2021-02-22T06:19:45Z)
Multi-object Tracking with a Hierarchical Single-branch Network [31.680667324595557]
We propose an online multi-object tracking framework based on a hierarchical single-branch network. Our novel iHOIM loss function unifies the objectives of the two sub-tasks and encourages better detection performance. Experimental results on MOT16 and MOT20 datasets show that we can achieve state-of-the-art tracking performance.
arXiv Detail & Related papers (2021-01-06T12:14:58Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.