EgoPCA: A New Framework for Egocentric Hand-Object Interaction
Understanding
- URL: http://arxiv.org/abs/2309.02423v1
- Date: Tue, 5 Sep 2023 17:51:16 GMT
- Title: EgoPCA: A New Framework for Egocentric Hand-Object Interaction
Understanding
- Authors: Yue Xu, Yong-Lu Li, Zhemin Huang, Michael Xu Liu, Cewu Lu, Yu-Wing
Tai, Chi-Keung Tang
- Abstract summary: This paper proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA)
We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy.
We believe our data and the findings will pave a new way for Ego-HOI understanding.
- Score: 99.904140768186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the surge in attention to Egocentric Hand-Object Interaction (Ego-HOI),
large-scale datasets such as Ego4D and EPIC-KITCHENS have been proposed.
However, most current research is built on resources derived from third-person
video action recognition. This inherent domain gap between first- and
third-person action videos, which have not been adequately addressed before,
makes current Ego-HOI suboptimal. This paper rethinks and proposes a new
framework as an infrastructure to advance Ego-HOI recognition by Probing,
Curation and Adaption (EgoPCA). We contribute comprehensive pre-train sets,
balanced test sets and a new baseline, which are complete with a
training-finetuning strategy. With our new framework, we not only achieve
state-of-the-art performance on Ego-HOI benchmarks but also build several new
and effective mechanisms and settings to advance further research. We believe
our data and the findings will pave a new way for Ego-HOI understanding. Code
and data are available at https://mvig-rhos.com/ego_pca
Related papers
- MM-Ego: Towards Building Egocentric Multimodal LLMs [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.
We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.
We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks.
Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - Ego-Only: Egocentric Action Detection without Exocentric Transferring [37.89647493482049]
We present Ego-Only, the first approach that enables state-of-the-art action detection on egocentric (first-person) videos.
arXiv Detail & Related papers (2023-01-03T22:22:34Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z) - Ego-Exo: Transferring Visual Representations from Third-person to
First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties.
Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z) - Enhanced Self-Perception in Mixed Reality: Egocentric Arm Segmentation
and Database with Automatic Labelling [1.0149624140985476]
This study focuses on the egocentric segmentation of arms to improve self-perception in Augmented Virtuality.
We report results on different real egocentric hand datasets, including GTEA Gaze+, EDSH, EgoHands, Ego Youtube Hands, THU-Read, TEgO, FPAB, and Ego Gesture.
Results confirm the suitability of the EgoArm dataset for this task, achieving improvement up to 40% with respect to the original network.
arXiv Detail & Related papers (2020-03-27T12:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.