POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object
Interaction in the Multi-View World
- URL: http://arxiv.org/abs/2403.05856v1
- Date: Sat, 9 Mar 2024 09:54:44 GMT
- Title: POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object
Interaction in the Multi-View World
- Authors: Boshen Xu, Sipeng Zheng, Qin Jin
- Abstract summary: Humans are good at translating third-person observations of hand-object interactions into an egocentric view.
We propose a Prompt-Oriented View-agnostic learning framework, which enables this view adaptation with few egocentric videos.
- Score: 59.545114016224254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We humans are good at translating third-person observations of hand-object
interactions (HOI) into an egocentric view. However, current methods struggle
to replicate this ability of view adaptation from third-person to first-person.
Although some approaches attempt to learn view-agnostic representation from
large-scale video datasets, they ignore the relationships among multiple
third-person views. To this end, we propose a Prompt-Oriented View-agnostic
learning (POV) framework in this paper, which enables this view adaptation with
few egocentric videos. Specifically, We introduce interactive masking prompts
at the frame level to capture fine-grained action information, and view-aware
prompts at the token level to learn view-agnostic representation. To verify our
method, we establish two benchmarks for transferring from multiple third-person
views to the egocentric view. Our extensive experiments on these benchmarks
demonstrate the efficiency and effectiveness of our POV framework and prompt
tuning techniques in terms of view adaptation and view generalization. Our code
is available at \url{https://github.com/xuboshen/pov_acmmm2023}.
Related papers
- Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is.
We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.
During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z) - Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities
Using Web Instructional Videos [27.209391862016574]
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning.
We adapt models from web instructional videos with exocentric views to an egocentric view.
arXiv Detail & Related papers (2023-11-28T02:51:13Z) - Learning from Semantic Alignment between Unpaired Multiviews for
Egocentric Video Recognition [23.031934558964473]
We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem.
Key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos.
Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario.
arXiv Detail & Related papers (2023-08-22T15:10:42Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance [48.748738590964216]
We propose ViewRefer, a multi-view framework for 3D visual grounding.
For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions.
In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
arXiv Detail & Related papers (2023-03-29T17:59:10Z) - GOCA: Guided Online Cluster Assignment for Self-Supervised Video
Representation Learning [49.69279760597111]
Clustering is a ubiquitous tool in unsupervised learning.
Most of the existing self-supervised representation learning methods typically cluster samples based on visually dominant features.
We propose a principled way to combine two views. Specifically, we propose a novel clustering strategy where we use the initial cluster assignment of each view as prior to guide the final cluster assignment of the other view.
arXiv Detail & Related papers (2022-07-20T19:26:55Z) - Learning Implicit 3D Representations of Dressed Humans from Sparse Views [31.584157304372425]
We propose an end-to-end approach that learns an implicit 3D representation of dressed humans from sparse camera views.
In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-04-16T10:20:26Z) - Generalized Multi-view Shared Subspace Learning using View Bootstrapping [43.027427742165095]
Key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks.
We present a neural method based on multi-view correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training.
Experiments on spoken word recognition, 3D object classification and pose-invariant face recognition demonstrate the robustness of view bootstrapping to model a large number of views.
arXiv Detail & Related papers (2020-05-12T20:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.