Related papers: SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding

SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding

URL: http://arxiv.org/abs/2510.09110v1
Date: Fri, 10 Oct 2025 08:04:30 GMT
Title: SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding
Authors: Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Ranjay Krishna,
Abstract summary: We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy.<n>It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting.<n>Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets.
Score: 44.30364980636359
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual grouping -- operationalized via instance segmentation, visual grounding, and object detection -- underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 $N_{\text{Acc}}$ on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 $AP_{\text{rare}}$ on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.

Related papers

Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z)
Synthetic data enables faster annotation and robust segmentation for multi-object grasping in clutter [9.092550803271005]
We propose a synthetic data generation method that minimizes human intervention and makes downstream image segmentation algorithms more robust. Experiments show that the proposed synthetic scene generation can diminish labelling time dramatically. Pick-and-place experiments demonstrate that segmentation trained on our hybrid dataset (98.9%, 70%) outperforms the real dataset and a publicly available dataset by (6.7%, 18.8%) and (2.8%, 10%) in terms of labelling and grasping success rate.
arXiv Detail & Related papers (2024-01-24T11:58:30Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
DataDAM: Efficient Dataset Distillation with Attention Matching [14.44010988811002]
Researchers have long tried to minimize training costs in deep learning by maintaining strong generalization across diverse datasets.<n>Emerging research on dataset aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset.<n>However, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data.
arXiv Detail & Related papers (2023-09-29T19:07:48Z)
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations. Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z)
CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data [2.554905387213586]
We present a visual localization system that learns to estimate camera poses in the real world with the help of synthetic data. To mitigate the data scarcity issue, we introduce TOPO-DataGen, a versatile synthetic data generation tool. We also introduce CrossLoc, a cross-modal visual representation learning approach to pose estimation.
arXiv Detail & Related papers (2021-12-16T18:05:48Z)
Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
PennSyn2Real: Training Object Recognition Models without Human Labeling [12.923677573437699]
We propose PennSyn2Real - a synthetic dataset consisting of more than 100,000 4K images of more than 20 types of micro aerial vehicles (MAVs) The dataset can be used to generate arbitrary numbers of training images for high-level computer vision tasks such as MAV detection and classification. We show that synthetic data generated using this framework can be directly used to train CNN models for common object recognition tasks such as detection and segmentation.
arXiv Detail & Related papers (2020-09-22T02:53:40Z)
Multi-Spectral Image Synthesis for Crop/Weed Segmentation in Precision Farming [3.4788711710826083]
We propose an alternative solution with respect to the common data augmentation methods, applying it to the problem of crop/weed segmentation in precision farming. We create semi-artificial samples by replacing the most relevant object classes (i.e., crop and weeds) with their synthesized counterparts. In addition to RGB data, we take into account also near-infrared (NIR) information, generating four channel multi-spectral synthetic images.
arXiv Detail & Related papers (2020-09-12T08:49:36Z)
Gradient-Induced Co-Saliency Detection [81.54194063218216]
Co-saliency detection (Co-SOD) aims to segment the common salient foreground in a group of relevant images. In this paper, inspired by human behavior, we propose a gradient-induced co-saliency detection method.
arXiv Detail & Related papers (2020-04-28T08:40:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.