Object Aware Egocentric Online Action Detection
- URL: http://arxiv.org/abs/2406.01079v1
- Date: Mon, 3 Jun 2024 07:58:40 GMT
- Title: Object Aware Egocentric Online Action Detection
- Authors: Joungbin An, Yunsu Park, Hyolim Kang, Seon Joo Kim,
- Abstract summary: We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks.
Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
- Score: 23.504280692701272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advancements in egocentric video datasets like Ego4D, EPIC-Kitchens, and Ego-Exo4D have enriched the study of first-person human interactions, which is crucial for applications in augmented reality and assisted living. Despite these advancements, current Online Action Detection methods, which efficiently detect actions in streaming videos, are predominantly designed for exocentric views and thus fail to capitalize on the unique perspectives inherent to egocentric videos. To address this gap, we introduce an Object-Aware Module that integrates egocentric-specific priors into existing OAD frameworks, enhancing first-person footage interpretation. Utilizing object-specific details and temporal dynamics, our module improves scene understanding in detecting actions. Validated extensively on the Epic-Kitchens 100 dataset, our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements, marking an important step forward in adapting action detection systems to egocentric video analysis.
Related papers
- Ego3DT: Tracking Every 3D Object in Ego-centric Videos [20.96550148331019]
This paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video.
We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment.
We have also innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos.
arXiv Detail & Related papers (2024-10-11T05:02:31Z) - EAGLE: Egocentric AGgregated Language-video Engine [34.60423566630983]
We introduce the Eagle (Egocentric AGgregated Language-video Engine) model and the Eagle-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks.
Egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective.
arXiv Detail & Related papers (2024-09-26T04:17:27Z) - Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views [51.53089073920215]
Understanding egocentric human-object interaction (HOI) is a fundamental aspect of human-centric perception.
Existing methods primarily leverage observations of HOI to capture interaction regions from an exocentric view.
We present EgoChoir, which links object structures with interaction contexts inherent in appearance and head motion to reveal object affordance.
arXiv Detail & Related papers (2024-05-22T14:03:48Z) - X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization [56.75782714530429]
We propose a cross-modal adaptation framework, which we call X-MIC.
Our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space.
This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization.
arXiv Detail & Related papers (2024-03-28T19:45:35Z) - Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective [13.776455033015216]
We introduce a novel cross-view learning approach to action recognition.
First, we present a novel geometric-based constraint into the self-attention mechanism in Transformer.
Then, we propose a new cross-view self-attention loss learned on unpaired cross-view data to enforce the self-attention mechanism learning to transfer knowledge across views.
arXiv Detail & Related papers (2023-05-25T04:14:49Z) - Enhancing Next Active Object-based Egocentric Action Anticipation with
Guided Attention [45.60789439017625]
Short-term action anticipation (STA) in first-person videos is a challenging task.
We propose a novel approach that applies a guided attention mechanism between objects.
Our method, GANO, is a multi-modal, end-to-end, single transformer-based network.
arXiv Detail & Related papers (2023-05-22T11:56:10Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.