Related papers: MVMO: A Multi-Object Dataset for Wide Baseline Multi-View Semantic Segmentation

MVMO: A Multi-Object Dataset for Wide Baseline Multi-View Semantic Segmentation

URL: http://arxiv.org/abs/2205.15452v1
Date: Mon, 30 May 2022 22:37:43 GMT
Title: MVMO: A Multi-Object Dataset for Wide Baseline Multi-View Semantic Segmentation
Authors: Aitor Alvarez-Gila, Joost van de Weijer, Yaxing Wang, Estibaliz Garrote
Abstract summary: We present MVMO (Multi-View, Multi-Object dataset): a synthetic dataset of 116,000 scenes containing randomly placed objects of 10 distinct classes. MVMO comprises photorealistic, path-traced image renders, together with semantic segmentation ground truth for every view.
Score: 34.88648947680952
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present MVMO (Multi-View, Multi-Object dataset): a synthetic dataset of 116,000 scenes containing randomly placed objects of 10 distinct classes and captured from 25 camera locations in the upper hemisphere. MVMO comprises photorealistic, path-traced image renders, together with semantic segmentation ground truth for every view. Unlike existing multi-view datasets, MVMO features wide baselines between cameras and high density of objects, which lead to large disparities, heavy occlusions and view-dependent object appearance. Single view semantic segmentation is hindered by self and inter-object occlusions that could benefit from additional viewpoints. Therefore, we expect that MVMO will propel research in multi-view semantic segmentation and cross-view semantic transfer. We also provide baselines that show that new research is needed in such fields to exploit the complementary information of multi-view setups.

Related papers

X-SAM: From Segment Anything to Any Segmentation [63.79182974315084]
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding.<n>We present X-SAM, a streamlined Multimodal Large Language Model framework that extends the segmentation paradigm from textitsegment anything to textitany segmentation.<n>We propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities.
arXiv Detail & Related papers (2025-08-06T17:19:10Z)
Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video [37.755852787082254]
We introduce MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content.<n>This dataset is the largest amodal segmentation and first amodal content dataset to date.<n>We include two new contributions to the deep learning for computer vision world.
arXiv Detail & Related papers (2025-07-01T00:36:56Z)
A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark [8.707197692292292]
We introduce NWPU-Refer, the largest and most diverse RRSIS dataset to date, comprising 15,003 high-resolution images (1024-2048px) spanning 30+ countries with 49,745 annotated targets.<n>We also propose the Multi-scale Referring Network (MRSNet), a novel framework tailored for the unique demands of RRSIS.
arXiv Detail & Related papers (2025-06-04T05:26:51Z)
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation [14.144097766150397]
We present a dataset called Multi-target and Multi-granularity Reasoning (MMR) MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects. We propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation.
arXiv Detail & Related papers (2025-03-18T04:23:09Z)
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities. We propose a novel framework based on multimodal retrieval-augmented generation (RAG) RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z)
Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z)
1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations. We trained our model on a large-scale video object segmentation dataset. Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z)
Matching Anything by Segmenting Anything [109.2507425045143]
We propose MASA, a novel method for robust instance association learning. MASA learns instance-level correspondence through exhaustive data transformations. We show that MASA achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences.
arXiv Detail & Related papers (2024-06-06T16:20:07Z)
Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images. Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement. Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet) Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z)
Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM) This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z)
Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image. We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z)
3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection [0.5156484100374059]
We propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection. We update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view. We show performance improvements on nuScenes benchmark dataset on top of our baselines.
arXiv Detail & Related papers (2023-02-16T11:28:30Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views [9.556376932449187]
Multi-View and Multi-Object Network (MulMON) is a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views. We show that MulMON better-resolves spatial ambiguities than single-view methods.
arXiv Detail & Related papers (2021-11-13T13:54:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.