Related papers: Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation

Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation

URL: http://arxiv.org/abs/2410.13585v1
Date: Thu, 17 Oct 2024 14:21:22 GMT
Title: Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation
Authors: Kuan-Ying Lee, Qian Zhou, Klara Nahrstedt,
Abstract summary: We propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets. By training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model's accuracy in the target domain.
Score: 8.21260979799828
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-camera systems are indispensable in movies, TV shows, and other media. Selecting the appropriate camera at every timestamp has a decisive impact on production quality and audience preferences. Learning-based view recommendation frameworks can assist professionals in decision-making. However, they often struggle outside of their training domains. The scarcity of labeled multi-camera view recommendation datasets exacerbates the issue. Based on the insight that many videos are edited from the original multi-camera videos, we propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets. Promisingly, by training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model's accuracy in the target domain and bridge the accuracy gap between in-domain and never-before-seen domains.

Related papers

A Multi-annotated and Multi-modal Dataset for Wide-angle Video Quality Assessment [68.17798591554637]
Wide-angle video is prone to deformation, exposure and other distortions. This deficiency primarily stems from the absence of a specialized dataset for wide-angle videos. We construct the first Multi-annotated and multi-modal Wide-angle Video quality assessment dataset.
arXiv Detail & Related papers (2025-01-21T12:15:16Z)
Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt. Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z)
MVUDA: Unsupervised Domain Adaptation for Multi-view Pedestrian Detection [4.506083131558207]
We address multi-view pedestrian detection in a setting where labeled data is collected using a multi-camera setup different from the one used for testing. We propose an unsupervised domain adaptation (UDA) method that adapts the model to new rigs without requiring additional labeled data.
arXiv Detail & Related papers (2024-12-05T12:36:12Z)
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z)
Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering [54.468355408388675]
We build a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images. We apply a diversity-based sampling algorithm to optimize the camera selection. We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments.
arXiv Detail & Related papers (2024-09-11T08:36:49Z)
Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT) We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z)
DVOS: Self-Supervised Dense-Pattern Video Object Segmentation [6.092973123903838]
In Dense Video Object (DVOS) scenarios, each video frame encompasses hundreds of small, dense and partially occluded objects. We propose a semi-self-temporal approach for DVOS utilizing a diffusion-based method through multi-task learning. To demonstrate the utility and efficacy of the proposed approach, we developed DVOS models for wheat head segmentation of handheld and drone-captured videos.
arXiv Detail & Related papers (2024-06-07T17:58:36Z)
Camera-Driven Representation Learning for Unsupervised Domain Adaptive Person Re-identification [33.25577310265293]
We introduce a camera-driven curriculum learning framework that leverages camera labels to transfer knowledge from source to target domains progressively. For each curriculum sequence, we generate pseudo labels of person images in a target domain to train a reID model in a supervised way. We have observed that the pseudo labels are highly biased toward cameras, suggesting that person images obtained from the same camera are likely to have the same pseudo labels, even for different IDs.
arXiv Detail & Related papers (2023-08-23T04:01:56Z)
Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows [83.54243912535667]
We first collect a novel benchmark on this setting with four diverse scenarios including concerts, sports games, gala shows, and contests. It contains 88-hour raw videos that contribute to the 14-hour edited videos. We propose a new approach temporal and contextual transformer that utilizes clues from historical shots and other views to make shot transition decisions.
arXiv Detail & Related papers (2022-10-17T04:11:23Z)
Domain Adaptive Video Segmentation via Temporal Pseudo Supervision [46.38660541271893]
Video semantic segmentation can mitigate data labelling constraints by adapting from a labelled source domain toward an unlabelled target domain. We design temporal pseudo supervision (TPS), a simple and effective method that explores the idea of consistency training for representations effective from target videos. We show that TPS is simpler to implement, much more stable to train, and achieves superior video accuracy as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-06T00:36:14Z)
Cross-View Cross-Scene Multi-View Crowd Counting [56.83882084112913]
Multi-view crowd counting has been previously proposed to utilize multi-cameras to extend the field-of-view of a single camera. We propose a cross-view cross-scene (CVCS) multi-view crowd counting paradigm, where the training and testing occur on different scenes with arbitrary camera layouts.
arXiv Detail & Related papers (2022-05-03T15:03:44Z)
DRIV100: In-The-Wild Multi-Domain Dataset and Evaluation for Real-World Domain Adaptation of Semantic Segmentation [9.984696742463628]
This work presents a new multi-domain dataset datasetnamefor benchmarking domain adaptation techniques on in-the-wild road-scene videos collected from the Internet. The dataset consists of pixel-level annotations for 100 videos selected to cover diverse scenes/domains based on two criteria; human subjective judgment and an anomaly score judged using an existing road-scene dataset.
arXiv Detail & Related papers (2021-01-30T04:43:22Z)
Self-supervised Human Detection and Segmentation via Multi-view Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
Dual-Triplet Metric Learning for Unsupervised Domain Adaptation in Video-Based Face Recognition [8.220945563455848]
A new deep domain adaptation (DA) method is proposed to adapt the CNN embedding of a Siamese network using unlabeled tracklets captured with a new video cameras. The proposed metric learning technique is used to train deep Siamese networks under different training scenarios.
arXiv Detail & Related papers (2020-02-11T05:06:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.