Related papers: ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation

ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation

URL: http://arxiv.org/abs/2111.10265v1
Date: Fri, 19 Nov 2021 15:11:25 GMT
Title: ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation
Authors: Laurynas Karazija, Iro Laina, Christian Rupprecht
Abstract summary: We present ClevrTex, designed as the next challenge to compare, evaluate and analyze algorithms. ClarTex features synthetic scenes with diverse shapes, textures and photo-mapped materials, created using physically based rendering techniques. We benchmark a large set of recent unsupervised multi-object segmentation models on ClevrTex and find all state-of-the-art approaches fail to learn good representations in the textured setting.
Score: 23.767094632640763
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There has been a recent surge in methods that aim to decompose and segment scenes into multiple objects in an unsupervised manner, i.e., unsupervised multi-object segmentation. Performing such a task is a long-standing goal of computer vision, offering to unlock object-level reasoning without requiring dense annotations to train segmentation models. Despite significant progress, current models are developed and trained on visually simple scenes depicting mono-colored objects on plain backgrounds. The natural world, however, is visually complex with confounding aspects such as diverse textures and complicated lighting effects. In this study, we present a new benchmark called ClevrTex, designed as the next challenge to compare, evaluate and analyze algorithms. ClevrTex features synthetic scenes with diverse shapes, textures and photo-mapped materials, created using physically based rendering techniques. It includes 50k examples depicting 3-10 objects arranged on a background, created using a catalog of 60 materials, and a further test set featuring 10k images created using 25 different materials. We benchmark a large set of recent unsupervised multi-object segmentation models on ClevrTex and find all state-of-the-art approaches fail to learn good representations in the textured setting, despite impressive performance on simpler data. We also create variants of the ClevrTex dataset, controlling for different aspects of scene complexity, and probe current approaches for individual shortcomings. Dataset and code are available at https://www.robots.ox.ac.uk/~vgg/research/clevrtex.

Related papers

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation [14.534308478766476]
We introduce ViCaS, a new dataset containing thousands of challenging videos. Our benchmark evaluates models on holistic/high-level understanding and language-guided, pixel-precise segmentation.
arXiv Detail & Related papers (2024-12-12T23:10:54Z)
3D Part Segmentation via Geometric Aggregation of 2D Visual Features [57.20161517451834]
Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios. Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts. To address these limitations, we propose COPS, a COmprehensive model for Parts that blends semantics extracted from visual concepts and 3D geometry to effectively identify object parts.
arXiv Detail & Related papers (2024-12-05T15:27:58Z)
AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark [43.76981659253837]
Differentiable rendering relies on a dense viewpoint coverage of the scene. Many challenges arise when only a few input views are available. A recurring problem in sparse rendering literature is the lack of an homogeneous, up-to-date, dataset and evaluation protocol. We introduce a new dataset that follows the setup of the DTU MVS dataset.
arXiv Detail & Related papers (2024-09-23T14:10:06Z)
MegaScenes: Scene-Level View Synthesis at Scale [69.21293001231993]
Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. We create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. We analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency.
arXiv Detail & Related papers (2024-06-17T17:55:55Z)
View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields [52.08335264414515]
We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene. Our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency.
arXiv Detail & Related papers (2024-05-30T04:14:58Z)
Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction [51.3632308129838]
We present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction. Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition. We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing.
arXiv Detail & Related papers (2024-03-28T11:12:33Z)
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering [126.00165445599764]
We present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering. Our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. We construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps.
arXiv Detail & Related papers (2023-07-19T17:58:03Z)
DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis [90.32352050266104]
DisCoScene is a 3Daware generative model for high-quality and controllable scene synthesis. It disentangles the whole scene into object-centric generative fields by learning on only 2D images with the global-local discrimination. We demonstrate state-of-the-art performance on many scene datasets, including the challenging outdoor dataset.
arXiv Detail & Related papers (2022-12-22T18:59:59Z)
Breaking the "Object" in Video Object Segmentation [36.20167854011788]
We present a dataset for Video Object under Transformations (VOST) It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long average and densely labeled with masks instance. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues.
arXiv Detail & Related papers (2022-12-12T19:22:17Z)
Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels. Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions. We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.