Related papers: VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels

VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels

URL: http://arxiv.org/abs/2507.09008v1
Date: Fri, 11 Jul 2025 20:17:23 GMT
Title: VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels
Authors: Xiwei Xuan, Xiaoqi Wang, Wenbin He, Jorge Piazentin Ono, Liang Gou, Kwan-Liu Ma, Liu Ren,
Abstract summary: We introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models.<n>We show how VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels.
Score: 30.699079182148054
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The advances in multi-modal foundation models (FMs) (e.g., CLIP and LLaVA) have facilitated the auto-labeling of large-scale datasets, enhancing model performance in challenging downstream tasks such as open-vocabulary object detection and segmentation. However, the quality of FM-generated labels is less studied as existing approaches focus more on data quantity over quality. This is because validating large volumes of data without ground truth presents a considerable challenge in practice. Existing methods typically rely on limited metrics to identify problematic data, lacking a comprehensive perspective, or apply human validation to only a small data fraction, failing to address the full spectrum of potential issues. To overcome these challenges, we introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models. Targeting the complex and demanding domain of open-vocabulary image segmentation, VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels. Through detailed use cases on two benchmark datasets and expert reviews, we demonstrate VISTA's effectiveness from both quantitative and qualitative perspectives.

Related papers

Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z)
Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation [25.283739839182147]
We show that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks.<n>We propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information into CoT data.<n>We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats.
arXiv Detail & Related papers (2025-07-03T17:59:29Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
A Simple Background Augmentation Method for Object Detection with Diffusion Model [53.32935683257045]
In computer vision, it is well-known that a lack of data diversity will impair model performance. We propose a simple yet effective data augmentation approach by leveraging advancements in generative models. Background augmentation, in particular, significantly improves the models' robustness and generalization capabilities.
arXiv Detail & Related papers (2024-08-01T07:40:00Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
AttributionScanner: A Visual Analytics System for Model Validation with Metadata-Free Slice Finding [29.07617945233152]
Data slice finding is an emerging technique for validating machine learning (ML) models by identifying and analyzing subgroups in a dataset that exhibit poor performance.<n>This approach faces significant challenges, including the laborious and costly requirement for additional metadata.<n>We introduce AttributionScanner, an innovative human-in-the-loop Visual Analytics (VA) system, designed for metadata-free data slice finding.<n>Our system identifies interpretable data slices that involve common model behaviors and visualizes these patterns through an Attribution Mosaic design.
arXiv Detail & Related papers (2024-01-12T09:17:32Z)
Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment [38.574204922793626]
We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM. Our research indicates that, in the face of modality incompleteness, models succumb to overfitting the modality noise, and exhibit performance oscillations or declines at high rates of missing modality. We introduce UMAEA, a robust multi-modal entity alignment approach designed to tackle uncertainly missing and ambiguous visual modalities.
arXiv Detail & Related papers (2023-07-30T12:16:49Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
Visual Identification of Problematic Bias in Large Label Spaces [5.841861400363261]
Key challenge in scaling common fairness metrics to modern models and datasets is the requirement of exhaustive ground truth labeling. domain experts need to be able to extract and reason about bias throughout models and datasets to make informed decisions. We propose guidelines for designing visualizations for such large label spaces, considering both technical and ethical issues.
arXiv Detail & Related papers (2022-01-17T12:51:08Z)
Uncertainty-Aware Multi-View Representation Learning [53.06828186507994]
We devise a novel unsupervised multi-view learning approach, termed as Dynamic Uncertainty-Aware Networks (DUA-Nets) Guided by the uncertainty of data estimated from the generation perspective, intrinsic information from multiple views is integrated to obtain noise-free representations. Our model achieves superior performance in extensive experiments and shows the robustness to noisy data.
arXiv Detail & Related papers (2022-01-15T07:16:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.