Related papers: Object-Centric Data Synthesis for Category-level Object Detection

Object-Centric Data Synthesis for Category-level Object Detection

URL: http://arxiv.org/abs/2511.23450v1
Date: Fri, 28 Nov 2025 18:41:46 GMT
Title: Object-Centric Data Synthesis for Category-level Object Detection
Authors: Vikhyat Agarwal, Jiayi Cora Guo, Declan Hoban, Sissi Zhang, Nicholas Moran, Peter Cho, Srilakshmi Pattabiraman, Shantanu Joshi,
Abstract summary: We introduce the object-centric data setting, when limited data is available in the form of object-centric data (multi-view images or 3D models)<n>We evaluate the performance of four different data synthesis methods to finetune object detection models on novel object categories in this setting.
Score: 1.349100458364391
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capability to new object classes requires large amounts of annotated training data, which is costly and time-consuming to acquire, especially for long-tailed classes with insufficient representation in existing datasets. Here, we introduce the object-centric data setting, when limited data is available in the form of object-centric data (multi-view images or 3D models), and systematically evaluate the performance of four different data synthesis methods to finetune object detection models on novel object categories in this setting. The approaches are based on simple image processing techniques, 3D rendering, and image diffusion models, and use object-centric data to synthesize realistic, cluttered images with varying contextual coherence and complexity. We assess how these methods enable models to achieve category-level generalization in real-world data, and demonstrate significant performance boosts within this data-constrained experimental setting.

Related papers

Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching [19.730504197461144]
We present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object.
arXiv Detail & Related papers (2024-11-24T14:31:50Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion [110.84357383258818]
We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation. The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects. Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets.
arXiv Detail & Related papers (2023-06-07T17:57:45Z)
MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training. We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z)
FewSOL: A Dataset for Few-Shot Object Learning in Robotic Environments [21.393674766169543]
We introduce the Few-Shot Object Learning dataset for object recognition with a few images per object. We captured 336 real-world objects with 9 RGB-D images per object from different views. The evaluation results show that there is still a large margin to be improved for few-shot object classification in robotic environments.
arXiv Detail & Related papers (2022-07-06T05:57:24Z)
Fusing Local Similarities for Retrieval-based 3D Orientation Estimation of Unseen Objects [70.49392581592089]
We tackle the task of estimating the 3D orientation of previously-unseen objects from monocular images. We follow a retrieval-based strategy and prevent the network from learning object-specific features. Our experiments on the LineMOD, LineMOD-Occluded, and T-LESS datasets show that our method yields a significantly better generalization to unseen objects than previous works.
arXiv Detail & Related papers (2022-03-16T08:53:00Z)
Conditional Object-Centric Learning from Video [34.012087337046005]
We introduce a sequential extension to Slot Attention to predict optical flow for realistic looking synthetic scenes. We show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences.
arXiv Detail & Related papers (2021-11-24T16:10:46Z)
Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
Unsupervised Learning of 3D Object Categories from Videos in the Wild [75.09720013151247]
We focus on learning a model from multiple views of a large collection of object instances. We propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction. Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks.
arXiv Detail & Related papers (2021-03-30T17:57:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.