ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
- URL: http://arxiv.org/abs/2504.01472v1
- Date: Wed, 02 Apr 2025 08:24:35 GMT
- Title: ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
- Authors: Yuejiao Su, Yi Wang, Qiongyang Hu, Chuang Yang, Lap-Pui Chau,
- Abstract summary: This paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG)<n>Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding.<n>The Ego-IRGBench dataset includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions.
- Score: 16.338872733140832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.
Related papers
- Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z) - Egocentric Human-Object Interaction Detection: A New Benchmark and Method [14.765419467710812]
Ego-HOIBench is a new dataset to promote the benchmarking and development of Ego-HOI detection.<n>Our approach is lightweight and effective, and it can be easily applied to HOI baselines in a plug-and-play manner.
arXiv Detail & Related papers (2025-06-17T05:03:42Z) - EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z) - Generating Fine Details of Entity Interactions [17.130839907951877]
This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios.
We propose a decomposition-augmented refinement procedure to address interaction generation challenges.
Our approach, DetailScribe, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement.
arXiv Detail & Related papers (2025-04-11T17:24:58Z) - EgoLife: Towards Egocentric Life Assistant [60.51196061794498]
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses.<n>We conduct a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references.<n>This effort resulted in the EgoLife dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation.<n>We introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide
arXiv Detail & Related papers (2025-03-05T18:54:16Z) - Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives [12.709881592333995]
We introduce Hier-EgoPack, which advances EgoPack by enabling reasoning across diverse temporal granularities.<n>We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning.
arXiv Detail & Related papers (2025-02-04T17:03:49Z) - Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views [51.53089073920215]
Understanding egocentric human-object interaction (HOI) is a fundamental aspect of human-centric perception.
Existing methods primarily leverage observations of HOI to capture interaction regions from an exocentric view.
We present EgoChoir, which links object structures with interaction contexts inherent in appearance and head motion to reveal object affordance.
arXiv Detail & Related papers (2024-05-22T14:03:48Z) - EgoPet: Egomotion and Interaction Data from an Animal's Perspective [82.7192364237065]
We introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction.
EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles.
We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion.
arXiv Detail & Related papers (2024-04-15T17:59:47Z) - Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation.
We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits.
Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z) - Emergent Communication in Interactive Sketch Question Answering [38.38087954142305]
Vision-based emergent communication (EC) aims to learn to communicate through sketches and demystify the evolution of human communication.
We first introduce a novel Interactive Sketch Question Answering (ISQA) task, where two collaborative players are interacting through sketches to answer a question about an image in a multi-round manner.
Our experimental results including human evaluation demonstrate that multi-round interactive mechanism facilitates targeted and efficient communication between intelligent agents with decent human interpretability.
arXiv Detail & Related papers (2023-10-24T08:00:20Z) - Modeling Cross-view Interaction Consistency for Paired Egocentric
Interaction Recognition [16.094976277810556]
Paired egocentric interaction recognition (PEIR) is the task to collaboratively recognize the interactions between two persons with the videos in their corresponding views.
We propose to build the relevance between two views using biliear pooling, which capture the consistency of two views in feature-level.
Experimental results on dataset PEV shows the superiority of the proposed methods on the task PEIR.
arXiv Detail & Related papers (2020-03-24T05:05:34Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.