Related papers: ORIDa: Object-centric Real-world Image Composition Dataset

ORIDa: Object-centric Real-world Image Composition Dataset

URL: http://arxiv.org/abs/2506.08964v1
Date: Tue, 10 Jun 2025 16:36:54 GMT
Title: ORIDa: Object-centric Real-world Image Composition Dataset
Authors: Jinwoo Kim, Sangmin Han, Jinho Jeong, Jiwoo Choi, Dongyoung Kim, Seon Joo Kim,
Abstract summary: ORIDa is a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects.<n>To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition.
Score: 22.625099905896317
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models. However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios. We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects, each of which is presented across varied positions and scenes. ORIDa has two types of data: factual-counterfactual sets and factual-only scenes. The factual-counterfactual sets consist of four factual images showing an object in different positions within a scene and a single counterfactual (or background) image of the scene without the object, resulting in five images per scene. The factual-only scenes include a single image containing an object in a specific context, expanding the variety of environments. To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition. Extensive analysis and experiments highlight the value of ORIDa as a resource for advancing further research in object compositing.

Related papers

ObjEmbed: Towards Universal Multimodal Object Embeddings [74.39703419628829]
We present Embed, a novel individual object embedding model.<n>It decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings.<n>It supports a wide range of visual understanding tasks like visual retrieval, local image retrieval, and global image retrieval.
arXiv Detail & Related papers (2026-02-02T07:38:45Z)
Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations [112.29763628638112]
Object-X is a versatile multi-modal 3D representation framework.<n>It can encoding rich object embeddings and decoding them back into geometric and visual reconstructions.<n>It supports a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization.
arXiv Detail & Related papers (2025-06-05T09:14:42Z)
Learning Global Object-Centric Representations via Disentangled Slot Attention [38.78205074748021]
This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations. Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.
arXiv Detail & Related papers (2024-10-24T14:57:00Z)
360 in the Wild: Dataset for Depth Prediction and View Synthesis [66.58513725342125]
We introduce a large scale 360$circ$ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map.
arXiv Detail & Related papers (2024-06-27T05:26:38Z)
Zero-Shot Multi-Object Scene Completion [59.325611678171974]
We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-03-21T17:59:59Z)
DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis [90.32352050266104]
DisCoScene is a 3Daware generative model for high-quality and controllable scene synthesis. It disentangles the whole scene into object-centric generative fields by learning on only 2D images with the global-local discrimination. We demonstrate state-of-the-art performance on many scene datasets, including the challenging outdoor dataset.
arXiv Detail & Related papers (2022-12-22T18:59:59Z)
ImageSubject: A Large-scale Dataset for Subject Detection [9.430492045581534]
Main subjects usually exist in the images or videos, as they are the objects that the photographer wants to highlight. Detecting the main subjects is an important technique to help machines understand the content of images and videos. We present a new dataset with the goal of training models to understand the layout of the objects then to find the main subjects among them.
arXiv Detail & Related papers (2022-01-09T22:49:59Z)
ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations [52.226947570070784]
We present Object, a dataset of 100 objects that addresses both challenges with two key innovations. First, Object encodes the visual, auditory, and tactile sensory data for all objects, enabling a number of multisensory object recognition tasks. Second, Object employs a uniform, object-centric simulations, and implicit representation for each object's visual textures, tactile readings, and tactile readings, making the dataset flexible to use and easy to share.
arXiv Detail & Related papers (2021-09-16T14:00:59Z)
Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition [57.088328223220934]
Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world. In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene.
arXiv Detail & Related papers (2021-04-12T11:37:23Z)
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding [8.720130442653575]
Hypersim is a synthetic dataset for holistic indoor scene understanding. We generate 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
arXiv Detail & Related papers (2020-11-04T20:12:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.