An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval
- URL: http://arxiv.org/abs/2503.22171v1
- Date: Fri, 28 Mar 2025 06:18:15 GMT
- Title: An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval
- Authors: Min Cao, ZiYin Zeng, YuXin Lu, Mang Ye, Dong Yi, Jinqiao Wang,
- Abstract summary: We conduct an empirical study to explore the potential of synthetic data for Text-Based Person Retrieval (TBPR) research.<n>We propose an inter-class image generation pipeline, in which an automatic prompt construction strategy is introduced.<n>We develop an intra-class image augmentation pipeline, in which the generative AI models are applied to further edit the images.
- Score: 51.10419281315848
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data plays a pivotal role in Text-Based Person Retrieval (TBPR) research. Mainstream research paradigm necessitates real-world person images with manual textual annotations for training models, posing privacy-sensitive and labor-intensive issues. Several pioneering efforts explore synthetic data for TBPR but still rely on real data, keeping the aforementioned issues and also resulting in diversity-deficient issue in synthetic datasets, thus impacting TBPR performance. Moreover, these works tend to explore synthetic data for TBPR through limited perspectives, leading to exploration-restricted issue. In this paper, we conduct an empirical study to explore the potential of synthetic data for TBPR, highlighting three key aspects. (1) We propose an inter-class image generation pipeline, in which an automatic prompt construction strategy is introduced to guide generative Artificial Intelligence (AI) models in generating various inter-class images without reliance on original data. (2) We develop an intra-class image augmentation pipeline, in which the generative AI models are applied to further edit the images for obtaining various intra-class images. (3) Building upon the proposed pipelines and an automatic text generation pipeline, we explore the effectiveness of synthetic data in diverse scenarios through extensive experiments. Additionally, we experimentally investigate various noise-robust learning strategies to mitigate the inherent noise in synthetic data. We will release the code, along with the synthetic large-scale dataset generated by our pipelines, which are expected to advance practical TBPR research.
Related papers
- A Comprehensive Survey of Synthetic Tabular Data Generation [27.112327373017457]
Tabular data is one of the most prevalent and critical data formats across diverse real-world applications.
It is often constrained by challenges such as data scarcity, privacy concerns, and class imbalance.
Synthetic data generation has emerged as a promising solution, leveraging generative models to learn the distribution of real datasets.
arXiv Detail & Related papers (2025-04-23T08:33:34Z) - Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks [5.0243930429558885]
This paper introduces Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers.
At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information.
The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases.
arXiv Detail & Related papers (2024-07-22T10:31:07Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - On Synthetic Data for Back Translation [66.6342561585953]
Back translation (BT) is one of the most significant technologies in NMT research fields.
We identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance.
We propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT.
arXiv Detail & Related papers (2023-10-20T17:24:12Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - Exploiting Multimodal Synthetic Data for Egocentric Human-Object
Interaction Detection in an Industrial Scenario [14.188006024550257]
EgoISM-HOI is a new multimodal dataset composed of synthetic EHOI images in an industrial environment with rich annotations of hands and objects.
Our study shows that exploiting synthetic data to pre-train the proposed method significantly improves performance when tested on real-world data.
To support research in this field, we publicly release the datasets, source code, and pre-trained models at https://iplab.dmi.unict.it/egoism-hoi.
arXiv Detail & Related papers (2023-06-21T09:56:55Z) - GIPSO: Geometrically Informed Propagation for Online Adaptation in 3D
LiDAR Segmentation [60.07812405063708]
3D point cloud semantic segmentation is fundamental for autonomous driving.
Most approaches in the literature neglect an important aspect, i.e., how to deal with domain shift when handling dynamic scenes.
This paper advances the state of the art in this research field.
arXiv Detail & Related papers (2022-07-20T09:06:07Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Synthetic Data and Hierarchical Object Detection in Overhead Imagery [0.0]
We develop novel synthetic data generation and augmentation techniques for enhancing low/zero-sample learning in satellite imagery.
To test the effectiveness of synthetic imagery, we employ it in the training of detection models and our two stage model, and evaluate the resulting models on real satellite images.
arXiv Detail & Related papers (2021-01-29T22:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.