Related papers: Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

URL: http://arxiv.org/abs/2411.08753v1
Date: Wed, 13 Nov 2024 16:31:08 GMT
Title: Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos
Authors: Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman,
Abstract summary: Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
Score: 66.1935609072708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive ``best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.

Related papers

Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos [71.01549400773197]
We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint. We then discover the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand.
arXiv Detail & Related papers (2024-12-24T12:16:43Z)
POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World [59.545114016224254]
Humans are good at translating third-person observations of hand-object interactions into an egocentric view. We propose a Prompt-Oriented View-agnostic learning framework, which enables this view adaptation with few egocentric videos.
arXiv Detail & Related papers (2024-03-09T09:54:44Z)
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition [23.031934558964473]
We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem. Key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos. Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario.
arXiv Detail & Related papers (2023-08-22T15:10:42Z)
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings. We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z)
Learning to Select Camera Views: Efficient Multiview Understanding at Few Glances [59.34619548026885]
We propose a view selection approach that analyzes the target object or scenario from given views and selects the next best view for processing. Our approach features a reinforcement learning based camera selection module, MVSelect, that not only selects views but also facilitates joint training with the task network.
arXiv Detail & Related papers (2023-03-10T18:59:10Z)
Multi-View Masked World Models for Visual Robotic Manipulation [132.97980128530017]
We train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints. We demonstrate the effectiveness of our method in a range of scenarios. We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization.
arXiv Detail & Related papers (2023-02-05T15:37:02Z)
Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video. Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input. On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z)
Generalized Multi-view Shared Subspace Learning using View Bootstrapping [43.027427742165095]
Key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks. We present a neural method based on multi-view correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training. Experiments on spoken word recognition, 3D object classification and pose-invariant face recognition demonstrate the robustness of view bootstrapping to model a large number of views.
arXiv Detail & Related papers (2020-05-12T20:35:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.