A Unified Image-Dense Annotation Generation Model for Underwater Scenes
- URL: http://arxiv.org/abs/2503.21771v1
- Date: Thu, 27 Mar 2025 17:59:43 GMT
- Title: A Unified Image-Dense Annotation Generation Model for Underwater Scenes
- Authors: Hongkai Lin, Dingkang Liang, Zhenghao Qi, Xiang Bai,
- Abstract summary: This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes.<n>It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations.<n>We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks.
- Score: 48.34534171882895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https: //github.com/HongkLin/TIDE.
Related papers
- FSSUWNet: Mitigating the Fragility of Pre-trained Models with Feature Enhancement for Few-Shot Semantic Segmentation in Underwater Images [4.19512807949895]
Few-Shot Semantic (FSS) has recently progressed in data-scarce domains.
We show that the existing FSS methods often struggle to generalize to underwater environments.
We propose FSSUWNet, a tailored FSS framework for underwater images with feature enhancement.
arXiv Detail & Related papers (2025-04-01T07:09:15Z) - Improving underwater semantic segmentation with underwater image quality attention and muti-scale aggregation attention [13.73105543582749]
UnderWater SegFormer (UWSegFormer) is a transformer-based framework for semantic segmentation of low-quality underwater images.
The proposed method has advantages in terms of segmentation completeness, boundary clarity, and subjective perceptual details when compared to SOTA methods.
arXiv Detail & Related papers (2025-03-30T12:47:56Z) - Unified Dense Prediction of Video Diffusion [91.16237431830417]
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts.<n>We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation.
arXiv Detail & Related papers (2025-03-12T12:41:02Z) - Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark [52.339936954958034]
The dynamic imbalance of the fore-background is a major challenge in video object counting.<n>We propose a density-embedded Efficient Masked Autoencoder Counting (E-MAC) framework in this paper.<n>In addition, we first propose a large video bird counting dataset, DroneBird, in natural scenarios for migratory bird protection.
arXiv Detail & Related papers (2024-11-20T06:08:21Z) - Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion [30.122666238416716]
We propose a novel pipeline for generating underwater images using accurate terrestrial depth data.
This approach facilitates the training of supervised models for underwater depth estimation.
We introduce a unique Depth2Underwater ControlNet, trained on specially prepared Underwater, Depth, Text data triplets.
arXiv Detail & Related papers (2023-12-19T08:56:33Z) - Metrically Scaled Monocular Depth Estimation through Sparse Priors for
Underwater Robots [0.0]
We formulate a deep learning model that fuses sparse depth measurements from triangulated features to improve the depth predictions.
The network is trained in a supervised fashion on the forward-looking underwater dataset, FLSea.
The method achieves real-time performance, running at 160 FPS on a laptop GPU and 7 FPS on a single CPU core.
arXiv Detail & Related papers (2023-10-25T16:32:31Z) - DeepAqua: Self-Supervised Semantic Segmentation of Wetland Surface Water
Extent with SAR Images using Knowledge Distillation [44.99833362998488]
We present DeepAqua, a self-supervised deep learning model that eliminates the need for manual annotations during the training phase.
We exploit cases where optical- and radar-based water masks coincide, enabling the detection of both open and vegetated water surfaces.
Experimental results show that DeepAqua outperforms other unsupervised methods by improving accuracy by 7%, Intersection Over Union by 27%, and F1 score by 14%.
arXiv Detail & Related papers (2023-05-02T18:06:21Z) - Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z) - Adaptive deep learning framework for robust unsupervised underwater image enhancement [3.0516727053033392]
One of the main challenges in deep learning-based underwater image enhancement is the limited availability of high-quality training data.<n>We propose a novel unsupervised underwater image enhancement framework that employs a conditional variational autoencoder (cVAE) to train a deep learning model.<n>We show that our proposed framework yields competitive performance compared to other state-of-the-art approaches in quantitative as well as qualitative metrics.
arXiv Detail & Related papers (2022-12-18T01:07:20Z) - Urban Scene Semantic Segmentation with Low-Cost Coarse Annotation [107.72926721837726]
coarse annotation is a low-cost but highly effective alternative for training semantic segmentation models.
We propose a coarse-to-fine self-training framework that generates pseudo labels for unlabeled regions of coarsely annotated data.
Our method achieves a significantly better performance vs annotation cost tradeoff, yielding a comparable performance to fully annotated data with only a small fraction of the annotation budget.
arXiv Detail & Related papers (2022-12-15T15:43:42Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach [3.0516727053033392]
This paper introduces a novel self-supervised learning approach for fish segmentation using Deep Learning.<n>Our model, trained without manual annotation, learns robust and generalizable representations by aligning features across augmented views.<n>We demonstrate its effectiveness on three challenging underwater video datasets: DeepFish, Seagrass, and YouTube-VOS.
arXiv Detail & Related papers (2022-06-11T01:20:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.