Related papers: POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

URL: http://arxiv.org/abs/2403.05856v1
Date: Sat, 9 Mar 2024 09:54:44 GMT
Title: POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World
Authors: Boshen Xu, Sipeng Zheng, Qin Jin
Abstract summary: Humans are good at translating third-person observations of hand-object interactions into an egocentric view. We propose a Prompt-Oriented View-agnostic learning framework, which enables this view adaptation with few egocentric videos.
Score: 59.545114016224254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at \url{https://github.com/xuboshen/pov_acmmm2023}.

Related papers

Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations [47.04855334955006]
We propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding.
arXiv Detail & Related papers (2025-03-25T14:33:32Z)
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z)
Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z)
Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos [27.209391862016574]
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning. We adapt models from web instructional videos with exocentric views to an egocentric view.
arXiv Detail & Related papers (2023-11-28T02:51:13Z)
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition [23.031934558964473]
We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem. Key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos. Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario.
arXiv Detail & Related papers (2023-08-22T15:10:42Z)
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time. To this end, we propose AE2, a self-supervised embedding approach with two key designs. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance [48.748738590964216]
We propose ViewRefer, a multi-view framework for 3D visual grounding. For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions. In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
arXiv Detail & Related papers (2023-03-29T17:59:10Z)
GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning [49.69279760597111]
Clustering is a ubiquitous tool in unsupervised learning. Most of the existing self-supervised representation learning methods typically cluster samples based on visually dominant features. We propose a principled way to combine two views. Specifically, we propose a novel clustering strategy where we use the initial cluster assignment of each view as prior to guide the final cluster assignment of the other view.
arXiv Detail & Related papers (2022-07-20T19:26:55Z)
Learning Implicit 3D Representations of Dressed Humans from Sparse Views [31.584157304372425]
We propose an end-to-end approach that learns an implicit 3D representation of dressed humans from sparse camera views. In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-04-16T10:20:26Z)
Generalized Multi-view Shared Subspace Learning using View Bootstrapping [43.027427742165095]
Key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks. We present a neural method based on multi-view correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training. Experiments on spoken word recognition, 3D object classification and pose-invariant face recognition demonstrate the robustness of view bootstrapping to model a large number of views.
arXiv Detail & Related papers (2020-05-12T20:35:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.