Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation
- URL: http://arxiv.org/abs/2410.13585v1
- Date: Thu, 17 Oct 2024 14:21:22 GMT
- Title: Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation
- Authors: Kuan-Ying Lee, Qian Zhou, Klara Nahrstedt,
- Abstract summary: We propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets.
By training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model's accuracy in the target domain.
- Score: 8.21260979799828
- License:
- Abstract: Multi-camera systems are indispensable in movies, TV shows, and other media. Selecting the appropriate camera at every timestamp has a decisive impact on production quality and audience preferences. Learning-based view recommendation frameworks can assist professionals in decision-making. However, they often struggle outside of their training domains. The scarcity of labeled multi-camera view recommendation datasets exacerbates the issue. Based on the insight that many videos are edited from the original multi-camera videos, we propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets. Promisingly, by training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model's accuracy in the target domain and bridge the accuracy gap between in-domain and never-before-seen domains.
Related papers
- Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is.
We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.
During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering [54.468355408388675]
We build a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images.
We apply a diversity-based sampling algorithm to optimize the camera selection.
We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments.
arXiv Detail & Related papers (2024-09-11T08:36:49Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - DVOS: Self-Supervised Dense-Pattern Video Object Segmentation [6.092973123903838]
In Dense Video Object (DVOS) scenarios, each video frame encompasses hundreds of small, dense and partially occluded objects.
We propose a semi-self-temporal approach for DVOS utilizing a diffusion-based method through multi-task learning.
To demonstrate the utility and efficacy of the proposed approach, we developed DVOS models for wheat head segmentation of handheld and drone-captured videos.
arXiv Detail & Related papers (2024-06-07T17:58:36Z) - Camera-Driven Representation Learning for Unsupervised Domain Adaptive
Person Re-identification [33.25577310265293]
We introduce a camera-driven curriculum learning framework that leverages camera labels to transfer knowledge from source to target domains progressively.
For each curriculum sequence, we generate pseudo labels of person images in a target domain to train a reID model in a supervised way.
We have observed that the pseudo labels are highly biased toward cameras, suggesting that person images obtained from the same camera are likely to have the same pseudo labels, even for different IDs.
arXiv Detail & Related papers (2023-08-23T04:01:56Z) - Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows [83.54243912535667]
We first collect a novel benchmark on this setting with four diverse scenarios including concerts, sports games, gala shows, and contests.
It contains 88-hour raw videos that contribute to the 14-hour edited videos.
We propose a new approach temporal and contextual transformer that utilizes clues from historical shots and other views to make shot transition decisions.
arXiv Detail & Related papers (2022-10-17T04:11:23Z) - Cross-View Cross-Scene Multi-View Crowd Counting [56.83882084112913]
Multi-view crowd counting has been previously proposed to utilize multi-cameras to extend the field-of-view of a single camera.
We propose a cross-view cross-scene (CVCS) multi-view crowd counting paradigm, where the training and testing occur on different scenes with arbitrary camera layouts.
arXiv Detail & Related papers (2022-05-03T15:03:44Z) - DRIV100: In-The-Wild Multi-Domain Dataset and Evaluation for Real-World
Domain Adaptation of Semantic Segmentation [9.984696742463628]
This work presents a new multi-domain dataset datasetnamefor benchmarking domain adaptation techniques on in-the-wild road-scene videos collected from the Internet.
The dataset consists of pixel-level annotations for 100 videos selected to cover diverse scenes/domains based on two criteria; human subjective judgment and an anomaly score judged using an existing road-scene dataset.
arXiv Detail & Related papers (2021-01-30T04:43:22Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - Dual-Triplet Metric Learning for Unsupervised Domain Adaptation in
Video-Based Face Recognition [8.220945563455848]
A new deep domain adaptation (DA) method is proposed to adapt the CNN embedding of a Siamese network using unlabeled tracklets captured with a new video cameras.
The proposed metric learning technique is used to train deep Siamese networks under different training scenarios.
arXiv Detail & Related papers (2020-02-11T05:06:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.