MVMO: A Multi-Object Dataset for Wide Baseline Multi-View Semantic
Segmentation
- URL: http://arxiv.org/abs/2205.15452v1
- Date: Mon, 30 May 2022 22:37:43 GMT
- Title: MVMO: A Multi-Object Dataset for Wide Baseline Multi-View Semantic
Segmentation
- Authors: Aitor Alvarez-Gila, Joost van de Weijer, Yaxing Wang, Estibaliz
Garrote
- Abstract summary: We present MVMO (Multi-View, Multi-Object dataset): a synthetic dataset of 116,000 scenes containing randomly placed objects of 10 distinct classes.
MVMO comprises photorealistic, path-traced image renders, together with semantic segmentation ground truth for every view.
- Score: 34.88648947680952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MVMO (Multi-View, Multi-Object dataset): a synthetic dataset of
116,000 scenes containing randomly placed objects of 10 distinct classes and
captured from 25 camera locations in the upper hemisphere. MVMO comprises
photorealistic, path-traced image renders, together with semantic segmentation
ground truth for every view. Unlike existing multi-view datasets, MVMO features
wide baselines between cameras and high density of objects, which lead to large
disparities, heavy occlusions and view-dependent object appearance. Single view
semantic segmentation is hindered by self and inter-object occlusions that
could benefit from additional viewpoints. Therefore, we expect that MVMO will
propel research in multi-view semantic segmentation and cross-view semantic
transfer. We also provide baselines that show that new research is needed in
such fields to exploit the complementary information of multi-view setups.
Related papers
- X-SAM: From Segment Anything to Any Segmentation [63.79182974315084]
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding.<n>We present X-SAM, a streamlined Multimodal Large Language Model framework that extends the segmentation paradigm from textitsegment anything to textitany segmentation.<n>We propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities.
arXiv Detail & Related papers (2025-08-06T17:19:10Z) - Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video [37.755852787082254]
We introduce MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content.<n>This dataset is the largest amodal segmentation and first amodal content dataset to date.<n>We include two new contributions to the deep learning for computer vision world.
arXiv Detail & Related papers (2025-07-01T00:36:56Z) - A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark [8.707197692292292]
We introduce NWPU-Refer, the largest and most diverse RRSIS dataset to date, comprising 15,003 high-resolution images (1024-2048px) spanning 30+ countries with 49,745 annotated targets.<n>We also propose the Multi-scale Referring Network (MRSNet), a novel framework tailored for the unique demands of RRSIS.
arXiv Detail & Related papers (2025-06-04T05:26:51Z) - MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation [14.144097766150397]
We present a dataset called Multi-target and Multi-granularity Reasoning (MMR)
MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects.
We propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation.
arXiv Detail & Related papers (2025-03-18T04:23:09Z) - Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.
We propose a novel framework based on multimodal retrieval-augmented generation (RAG)
RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z) - Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset.
We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation.
In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - Matching Anything by Segmenting Anything [109.2507425045143]
We propose MASA, a novel method for robust instance association learning.
MASA learns instance-level correspondence through exhaustive data transformations.
We show that MASA achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences.
arXiv Detail & Related papers (2024-06-06T16:20:07Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z) - 3M3D: Multi-view, Multi-path, Multi-representation for 3D Object
Detection [0.5156484100374059]
We propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection.
We update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view.
We show performance improvements on nuScenes benchmark dataset on top of our baselines.
arXiv Detail & Related papers (2023-02-16T11:28:30Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Learning Object-Centric Representations of Multi-Object Scenes from
Multiple Views [9.556376932449187]
Multi-View and Multi-Object Network (MulMON) is a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views.
We show that MulMON better-resolves spatial ambiguities than single-view methods.
arXiv Detail & Related papers (2021-11-13T13:54:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.