SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context
Misinformation Detection
- URL: http://arxiv.org/abs/2403.03170v1
- Date: Tue, 5 Mar 2024 18:04:59 GMT
- Title: SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context
Misinformation Detection
- Authors: Peng Qi, Zehong Yan, Wynne Hsu, Mong Li Lee
- Abstract summary: Out-of-context (OOC) misinformation is one of the easiest and most effective ways to mislead audiences.
Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments.
We introduce SNIFFER, a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation.
- Score: 18.356648843815627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Misinformation is a prevalent societal issue due to its potential high risks.
Out-of-context (OOC) misinformation, where authentic images are repurposed with
false text, is one of the easiest and most effective ways to mislead audiences.
Current methods focus on assessing image-text consistency but lack convincing
explanations for their judgments, which is essential for debunking
misinformation. While Multimodal Large Language Models (MLLMs) have rich
knowledge and innate capability for visual reasoning and explanation
generation, they still lack sophistication in understanding and discovering the
subtle crossmodal differences. In this paper, we introduce SNIFFER, a novel
multimodal large language model specifically engineered for OOC misinformation
detection and explanation. SNIFFER employs two-stage instruction tuning on
InstructBLIP. The first stage refines the model's concept alignment of generic
objects with news-domain entities and the second stage leverages language-only
GPT-4 generated OOC-specific instruction data to fine-tune the model's
discriminatory powers. Enhanced by external tools and retrieval, SNIFFER not
only detects inconsistencies between text and image but also utilizes external
knowledge for contextual verification. Our experiments show that SNIFFER
surpasses the original MLLM by over 40% and outperforms state-of-the-art
methods in detection accuracy. SNIFFER also provides accurate and persuasive
explanations as validated by quantitative and human evaluations.
Related papers
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only.
Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z) - Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability.
It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs.
The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - Interpretable Detection of Out-of-Context Misinformation with Neural-Symbolic-Enhanced Large Multimodal Model [16.348950072491697]
Misinformation creators now more tend to use out-of- multimedia contents to deceive the public and fake news detection systems.
This new type of misinformation increases the difficulty of not only detection but also clarification, because every individual modality is close enough to true information.
In this paper we explore how to achieve interpretable cross-modal de-contextualization detection that simultaneously identifies the mismatched pairs and the cross-modal contradictions.
arXiv Detail & Related papers (2023-04-15T21:11:55Z) - Context-faithful Prompting for Large Language Models [51.194410884263135]
Large language models (LLMs) encode parametric knowledge about world facts.
Their reliance on parametric knowledge may cause them to overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks.
We assess and enhance LLMs' contextual faithfulness in two aspects: knowledge conflict and prediction with abstention.
arXiv Detail & Related papers (2023-03-20T17:54:58Z) - Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection [3.785123406103386]
We take advantage of language prompt to introduce effective and unbiased linguistic supervision into object detection.
We propose a new mechanism called multimodal knowledge learning (textbfMKL), which is required to learn knowledge from language supervision.
arXiv Detail & Related papers (2022-05-09T07:03:30Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.