Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos
- URL: http://arxiv.org/abs/2411.08753v2
- Date: Thu, 26 Dec 2024 15:49:20 GMT
- Title: Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos
- Authors: Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman,
- Abstract summary: Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is.
We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.
During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep.
- Score: 66.1935609072708
- License:
- Abstract: Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.
Related papers
- VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation [70.68566282567207]
VisionReward is a fine-grained and multi-dimensional reward model.
We decompose human preferences in images and videos into multiple dimensions.
Based on VisionReward, we develop a multi-objective preference learning algorithm.
arXiv Detail & Related papers (2024-12-30T16:24:09Z) - Switch-a-View: Few-Shot View Selection Learned from Edited Videos [71.01549400773197]
We introduce Switch-a-View, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video.
Key insight of our approach is how to train such a model from unlabeled--but human-edited--video samples.
arXiv Detail & Related papers (2024-12-24T12:16:43Z) - POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object
Interaction in the Multi-View World [59.545114016224254]
Humans are good at translating third-person observations of hand-object interactions into an egocentric view.
We propose a Prompt-Oriented View-agnostic learning framework, which enables this view adaptation with few egocentric videos.
arXiv Detail & Related papers (2024-03-09T09:54:44Z) - Learning from Semantic Alignment between Unpaired Multiviews for
Egocentric Video Recognition [23.031934558964473]
We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem.
Key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos.
Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario.
arXiv Detail & Related papers (2023-08-22T15:10:42Z) - Learning to Select Camera Views: Efficient Multiview Understanding at
Few Glances [59.34619548026885]
We propose a view selection approach that analyzes the target object or scenario from given views and selects the next best view for processing.
Our approach features a reinforcement learning based camera selection module, MVSelect, that not only selects views but also facilitates joint training with the task network.
arXiv Detail & Related papers (2023-03-10T18:59:10Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - Generalized Multi-view Shared Subspace Learning using View Bootstrapping [43.027427742165095]
Key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks.
We present a neural method based on multi-view correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training.
Experiments on spoken word recognition, 3D object classification and pose-invariant face recognition demonstrate the robustness of view bootstrapping to model a large number of views.
arXiv Detail & Related papers (2020-05-12T20:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.