Related papers: AUTV: Creating Underwater Video Datasets with Pixel-wise Annotations

AUTV: Creating Underwater Video Datasets with Pixel-wise Annotations

URL: http://arxiv.org/abs/2503.12828v1
Date: Mon, 17 Mar 2025 05:18:20 GMT
Title: AUTV: Creating Underwater Video Datasets with Pixel-wise Annotations
Authors: Quang Trung Truong, Wong Yuk Kwan, Duc Thanh Nguyen, Binh-Son Hua, Sai-Kit Yeung,
Abstract summary: We propose AUTV, a framework for synthesizing marine video data with pixel-wise annotations.<n>We demonstrate the effectiveness of this framework by constructing two video datasets.
Score: 27.609227883183713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Underwater video analysis, hampered by the dynamic marine environment and camera motion, remains a challenging task in computer vision. Existing training-free video generation techniques, learning motion dynamics on the frame-by-frame basis, often produce poor results with noticeable motion interruptions and misaligments. To address these issues, we propose AUTV, a framework for synthesizing marine video data with pixel-wise annotations. We demonstrate the effectiveness of this framework by constructing two video datasets, namely UTV, a real-world dataset comprising 2,000 video-text pairs, and SUTV, a synthetic video dataset including 10,000 videos with segmentation masks for marine objects. UTV provides diverse underwater videos with comprehensive annotations including appearance, texture, camera intrinsics, lighting, and animal behavior. SUTV can be used to improve underwater downstream tasks, which are demonstrated in video inpainting and video object segmentation.

Related papers

MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning [15.968772405167877]
Marine videos present significant challenges for video understanding.<n>Existing video captioning datasets often fail to generalize to the complexities of the marine environment.<n>We propose a two-stage marine object-oriented video captioning pipeline.
arXiv Detail & Related papers (2025-08-06T15:34:24Z)
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video [72.42376733537925]
ReCamMaster is a camera-controlled generative video re-rendering framework.<n>It reproduces the dynamic scene of an input video at novel camera trajectories.<n>Our method also finds promising applications in video stabilization, super-resolution, and outpainting.
arXiv Detail & Related papers (2025-03-14T17:59:31Z)
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control [47.34885131252508]
Video inpainting aims to restore corrupted video content.<n>We propose a novel dual-stream paradigm VideoPainter to process masked videos.<n>We also introduce a novel target region ID resampling technique that enables any-length video inpainting.
arXiv Detail & Related papers (2025-03-07T17:59:46Z)
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control [66.66226299852559]
VideoAnydoor is a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper.
arXiv Detail & Related papers (2025-01-02T18:59:54Z)
End-To-End Underwater Video Enhancement: Dataset and Model [6.153714458213646]
Underwater video enhancement (UVE) aims to improve the visibility and frame quality of underwater videos. Existing methods primarily focus on developing image enhancement algorithms to enhance each frame independently. This study represents the first comprehensive exploration of UVE to our knowledge.
arXiv Detail & Related papers (2024-03-18T06:24:46Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Make Pixels Dance: High-Dynamic Video Generation [13.944607760918997]
State-of-the-art video generation methods tend to produce video clips with minimal motions despite maintaining high fidelity. We introduce PixelDance, a novel approach that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation.
arXiv Detail & Related papers (2023-11-18T06:25:58Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
Render In-between: Motion Guided Video Synthesis for Action Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance. A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset. Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z)
Occlusion-Aware Video Object Inpainting [72.38919601150175]
This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos. Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation. For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
arXiv Detail & Related papers (2021-08-15T15:46:57Z)
Video Exploration via Video-Specific Autoencoders [60.256055890647595]
We present video-specific autoencoders that enables human-controllable video exploration. We observe that a simple autoencoder trained on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks.
arXiv Detail & Related papers (2021-03-31T17:56:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.