Related papers: Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

URL: http://arxiv.org/abs/2412.18386v3
Date: Tue, 22 Apr 2025 13:23:34 GMT
Title: Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos
Authors: Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman,
Abstract summary: We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video.<n>We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint.<n>We then discover the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand.
Score: 71.01549400773197
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled -- but human-edited -- video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages. Project: https://vision.cs.utexas.edu/projects/switch_a_view/.

Related papers

Active View Selection for Scene-level Multi-view Crowd Counting and Localization with Limited Labels [55.396639405563526]
Multi-view crowd counting and localization fuse the input multi-views for estimating the crowd number or locations on the ground.<n>Existing methods require massive labeled views and images, and lack the ability for cross-scene settings.<n>We propose an independent view selection method (IVS) that considers view and scene geometries in the view selection strategy.<n>We also put forward an active view selection method (AVS) that jointly optimize the view selection, labeling, and downstream tasks.
arXiv Detail & Related papers (2025-09-20T13:23:46Z)
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z)
Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences. To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets. We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks. Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z)
VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50. VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z)
Self-Supervised Video Representation Learning with Motion-Contrastive Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet) MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP) Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z)
ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints [47.54827916387143]
We propose ViewCLR, that learns self-supervised video representation invariant to camera viewpoint changes. We introduce a view-generator that can be considered as a learnable augmentation for any self-supervised pre-text tasks.
arXiv Detail & Related papers (2021-12-07T18:58:29Z)
Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video. Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input. On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z)
Broaden Your Views for Self-Supervised Video Learning [97.52216510672251]
We introduce BraVe, a self-supervised learning framework for video. In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content. We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks.
arXiv Detail & Related papers (2021-03-30T17:58:46Z)
Recognizing Actions in Videos from Unseen Viewpoints [80.6338404141284]
We show that current convolutional neural network models are unable to recognize actions from camera viewpoints not present in training data. We introduce a new dataset for unseen view recognition and show the approaches ability to learn viewpoint invariant representations.
arXiv Detail & Related papers (2021-03-30T17:17:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.