Related papers: How do LLMs Support Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation

How do LLMs Support Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation

URL: http://arxiv.org/abs/2404.13945v2
Date: Sun, 5 May 2024 16:40:20 GMT
Title: How do LLMs Support Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation
Authors: Liwen Wang, Yuanyuan Yuan, Ao Sun, Zongjie Li, Pingchuan Ma, Daoyuan Wu, Shuai Wang,
Abstract summary: Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, software testing requires diverse and controllable mutations over image semantics. The rapid development of multi-modal large language models (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods.
Score: 23.18635769949329
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse and controllable mutations over image semantics. The rapid development of multi-modal large language models (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods. Users can now freely describe desired mutations and let MLLMs generate the mutated images. However, the quality of MLLM-produced test inputs in VDL testing remains largely unexplored. We present the first study, aiming to assess MLLMs' adequacy from 1) the semantic validity of MLLM mutated images, 2) the alignment of MLLM mutated images with their text instructions (prompts), 3) the faithfulness of how different mutations preserve semantics that are ought to remain unchanged, and 4) the effectiveness of detecting VDL faults. With large-scale human studies and quantitative evaluations, we identify MLLM's promising potentials in expanding the covered semantics of image mutations. Notably, while SoTA MLLMs (e.g., GPT-4V) fail to support or perform worse in editing existing semantics in images (as in traditional mutations like rotation), they generate high-quality test inputs using "semantic-additive" mutations (e.g., "dress a dog with clothes"), which bring extra semantics to images; these were infeasible for past approaches. Hence, we view MLLM-based mutations as a vital complement to traditional mutations, and advocate future VDL testing tasks to combine MLLM-based methods and traditional image mutations for comprehensive and reliable testing.

Related papers

Demystifying the Visual Quality Paradox in Multimodal Large Language Models [49.154146792279946]
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses.<n>We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks.<n>We uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity.
arXiv Detail & Related papers (2025-06-18T17:14:07Z)
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts [17.31556625041178]
multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation.<n>We propose a new adversarial pipeline that MLLMs to generate high-risk disinformation.<n>We present the Artifact-aware Manipulation Diagnosis Diagnosis via MLLM framework.
arXiv Detail & Related papers (2025-05-23T04:58:27Z)
Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering [0.0]
We propose a method to identify neurons associated with specific knowledge using MiniGPT-4, a Transformer-based MLLM. Experiments on the image caption generation task showed that our method is able to locate knowledge with higher accuracy than existing methods.
arXiv Detail & Related papers (2025-03-29T02:16:15Z)
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [52.57696897619189]
We introduce the Human-Like Mask Modeling Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task.
arXiv Detail & Related papers (2025-03-11T17:08:54Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z)
Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models [36.81503322875839]
Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. This paper investigates representative MLLMs, focusing on their calibration across various scenarios. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios.
arXiv Detail & Related papers (2024-12-19T09:10:07Z)
Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse. Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z)
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection [19.79027968793026]
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects. Existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts. We propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception.
arXiv Detail & Related papers (2024-09-30T09:51:29Z)
Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation [63.064204206220936]
Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing to coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content.
arXiv Detail & Related papers (2024-08-27T14:40:16Z)
Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images. Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights. Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z)
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment [46.55045595936298]
Multimodal Large Language Models (MLLMs) have experienced significant advancement in visual understanding and reasoning. Their potential to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored.
arXiv Detail & Related papers (2024-03-16T08:30:45Z)
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models [84.78513908768011]
We propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA) MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR.
arXiv Detail & Related papers (2024-03-05T14:31:24Z)
Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision. We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z)
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z)
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
Investigating the Catastrophic Forgetting in Multimodal Large Language Models [43.89009178021342]
We introduce EMT: evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs. Almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. As fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability.
arXiv Detail & Related papers (2023-09-19T04:51:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.