VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models
- URL: http://arxiv.org/abs/2309.16211v2
- Date: Mon, 1 Apr 2024 02:49:49 GMT
- Title: VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models
- Authors: Zihao Zhu, Mingda Zhang, Shaokui Wei, Bingzhe Wu, Baoyuan Wu,
- Abstract summary: In the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them.
Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other domains.
We propose versatile data cleanser (VDC) leveraging the surpassing capabilities of multimodal large language models (MLLM) in cross-modal alignment and reasoning.
- Score: 46.72546879204724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The role of data in building AI systems has recently been emphasized by the emerging concept of data-centric AI. Unfortunately, in the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them. The presence of such dirty samples makes the DNNs vunerable and unreliable.Hence, it is critical to detect dirty samples to improve the quality and realiability of dataset. Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other domains.In this paper, we find a commonality of various dirty samples is visual-linguistic inconsistency between images and associated labels. To capture the semantic inconsistency between modalities, we propose versatile data cleanser (VDC) leveraging the surpassing capabilities of multimodal large language models (MLLM) in cross-modal alignment and reasoning.It consists of three consecutive modules: the visual question generation module to generate insightful questions about the image; the visual question answering module to acquire the semantics of the visual content by answering the questions with MLLM; followed by the visual answer evaluation module to evaluate the inconsistency.Extensive experiments demonstrate its superior performance and generalization to various categories and types of dirty samples. The code is available at \url{https://github.com/zihao-ai/vdc}.
Related papers
- Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - CableInspect-AD: An Expert-Annotated Anomaly Detection Dataset [14.246172794156987]
$textitCableInspect-AD$ is a high-quality dataset created and annotated by domain experts from Hydro-Qu'ebec, a Canadian public utility.
This dataset includes high-resolution images with challenging real-world anomalies, covering defects with varying severity levels.
We present a comprehensive evaluation protocol based on cross-validation to assess models' performances.
arXiv Detail & Related papers (2024-09-30T14:50:13Z) - MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs [38.93090238335506]
Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe pitfall in deep learning models trained on single modality data.
We introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct categories of spurious correlations.
Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases.
arXiv Detail & Related papers (2024-06-24T20:29:16Z) - Dual-View Data Hallucination with Semantic Relation Guidance for Few-Shot Image Recognition [49.26065739704278]
We propose a framework that exploits semantic relations to guide dual-view data hallucination for few-shot image recognition.
An instance-view data hallucination module hallucinates each sample of a novel class to generate new data.
A prototype-view data hallucination module exploits semantic-aware measure to estimate the prototype of a novel class.
arXiv Detail & Related papers (2024-01-13T12:32:29Z) - Deep Incomplete Multi-view Clustering with Cross-view Partial Sample and
Prototype Alignment [50.82982601256481]
We propose a Cross-view Partial Sample and Prototype Alignment Network (CPSPAN) for Deep Incomplete Multi-view Clustering.
Unlike existing contrastive-based methods, we adopt pair-observed data alignment as 'proxy supervised signals' to guide instance-to-instance correspondence construction.
arXiv Detail & Related papers (2023-03-28T02:31:57Z) - Uncertainty-Aware Multi-View Representation Learning [53.06828186507994]
We devise a novel unsupervised multi-view learning approach, termed as Dynamic Uncertainty-Aware Networks (DUA-Nets)
Guided by the uncertainty of data estimated from the generation perspective, intrinsic information from multiple views is integrated to obtain noise-free representations.
Our model achieves superior performance in extensive experiments and shows the robustness to noisy data.
arXiv Detail & Related papers (2022-01-15T07:16:20Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.