Related papers: VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

URL: http://arxiv.org/abs/2602.04587v1
Date: Wed, 04 Feb 2026 14:12:55 GMT
Title: VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration
Authors: Jaeyoon Jung, Yejun Yoon, Seunghyun Yoon, Kunwoo Park,
Abstract summary: VILLAIN is a multimodal fact-checking system that verifies image-text claims.<n>Our system ranked first on the leaderboard across all evaluation metrics.
Score: 10.712719361607753
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.

Related papers

Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z)
MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering [44.41273615523289]
We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems.<n>Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents.<n>We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments.
arXiv Detail & Related papers (2025-11-15T10:14:59Z)
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z)
CMRAG: Co-modality-based visual document retrieval and question answering [21.016544020685668]
Co-Modality-based RAG (RAG) framework can leverage texts and images for more accurate retrieval and generation.<n>Our framework consistently outperforms single-modality-based RAG in multiple visual document question-answering (VDQA) benchmarks.
arXiv Detail & Related papers (2025-09-02T09:17:57Z)
Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering [60.062194349648195]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z)
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding [40.52017994491893]
MDocAgent is a novel RAG and multi-agent framework that leverages both text and image.<n>Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent.<n>Preliminary experiments on five benchmarks demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1%.
arXiv Detail & Related papers (2025-03-18T06:57:21Z)
VISA: Retrieval Augmented Generation with Visual Source Attribution [100.78278689901593]
Existing approaches in RAG primarily link generated content to document-level references.<n>We propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution.<n>To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain.
arXiv Detail & Related papers (2024-12-19T02:17:35Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
IMCI: Integrate Multi-view Contextual Information for Fact Extraction and Verification [19.764122035213067]
We propose to integrate multi-view contextual information (IMCI) for fact extraction and verification. Our experimental results on FEVER 1.0 shared task show that our IMCI framework makes great progress on both fact extraction and verification.
arXiv Detail & Related papers (2022-08-30T05:57:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.