Learning from Synthetic Data for Visual Grounding
- URL: http://arxiv.org/abs/2403.13804v2
- Date: Mon, 16 Dec 2024 14:53:21 GMT
- Title: Learning from Synthetic Data for Visual Grounding
- Authors: Ruozhen He, Ziyan Yang, Paola Cascante-Bonilla, Alexander C. Berg, Vicente Ordonez,
- Abstract summary: We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.
Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
- Score: 55.21937116752679
- License:
- Abstract: This paper extensively investigates the effectiveness of synthetic training data to improve the capabilities of vision-and-language models for grounding textual descriptions to image regions. We explore various strategies to best generate image-text pairs and image-text-box triplets using a series of pretrained models under different settings and varying degrees of reliance on real data. Through comparative analyses with synthetic, real, and web-crawled data, we identify factors that contribute to performance differences, and propose SynGround, an effective pipeline for generating useful synthetic data for visual grounding. Our findings show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models and offers the potential for arbitrarily large scale data generation. Particularly, data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively, across the RefCOCO+ and the Flickr30k benchmarks.
Related papers
- RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
A substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning.
We establish a Real-World Data Extraction pipeline to extract high-quality images and texts.
Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts.
We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales.
arXiv Detail & Related papers (2025-02-18T03:58:38Z) - Improving Object Detection by Modifying Synthetic Data with Explainable AI [3.0519884745675485]
We propose a novel conceptual approach to improve the performance of computer vision models trained on synthetic images.
We use robust Explainable AI (XAI) techniques to guide the modification of 3D models used to generate these images.
We show that synthetic data can improve detection of vehicles in orientations unseen in training by 4.6%.
arXiv Detail & Related papers (2024-12-02T13:24:43Z) - Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - Deep Domain Adaptation: A Sim2Real Neural Approach for Improving Eye-Tracking Systems [80.62854148838359]
Eye image segmentation is a critical step in eye tracking that has great influence over the final gaze estimate.
We use dimensionality-reduction techniques to measure the overlap between the target eye images and synthetic training data.
Our methods result in robust, improved performance when tackling the discrepancy between simulation and real-world data samples.
arXiv Detail & Related papers (2024-03-23T22:32:06Z) - Domain Adaptation of Synthetic Driving Datasets for Real-World
Autonomous Driving [0.11470070927586014]
Network trained with synthetic data for certain computer vision tasks degrade significantly when tested on real world data.
In this paper, we propose and evaluate novel ways for the betterment of such approaches.
We propose a novel method to efficiently incorporate semantic supervision into this pair selection, which helps in boosting the performance of the model.
arXiv Detail & Related papers (2023-02-08T15:51:54Z) - Is synthetic data from generative models ready for image recognition? [69.42645602062024]
We study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks.
We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks.
arXiv Detail & Related papers (2022-10-14T06:54:24Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic
Data [2.554905387213586]
We present a visual localization system that learns to estimate camera poses in the real world with the help of synthetic data.
To mitigate the data scarcity issue, we introduce TOPO-DataGen, a versatile synthetic data generation tool.
We also introduce CrossLoc, a cross-modal visual representation learning approach to pose estimation.
arXiv Detail & Related papers (2021-12-16T18:05:48Z) - Synthetic Data and Hierarchical Object Detection in Overhead Imagery [0.0]
We develop novel synthetic data generation and augmentation techniques for enhancing low/zero-sample learning in satellite imagery.
To test the effectiveness of synthetic imagery, we employ it in the training of detection models and our two stage model, and evaluate the resulting models on real satellite images.
arXiv Detail & Related papers (2021-01-29T22:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.