Related papers: Solution for Meta KDD Cup'25: A Comprehensive Three-Step Framework for Vision Question Answering

Solution for Meta KDD Cup'25: A Comprehensive Three-Step Framework for Vision Question Answering

URL: http://arxiv.org/abs/2507.21520v1
Date: Tue, 29 Jul 2025 06:07:59 GMT
Title: Solution for Meta KDD Cup'25: A Comprehensive Three-Step Framework for Vision Question Answering
Authors: Zijian Zhang, Xiaocheng Zhang, Yang Zhou, Zhimin Lin, Peng Yan,
Abstract summary: This paper describes the solutions of all tasks in Meta KDD Cup'25 from BlackPearl team.<n>We use a single model for each task, with key methods including data augmentation, RAG, reranking, and fine-tuning.<n>Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.
Score: 7.481274094559558
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision Large Language Models (VLLMs) have improved multi-modal understanding and visual question answering (VQA), but still suffer from hallucinated answers. Multi-modal Retrieval-Augmented Generation (RAG) helps address these issues by incorporating external information, yet challenges remain in visual context comprehension, multi-source retrieval, and multi-turn interactions. To address these challenges, Meta constructed the CRAG-MM benchmark and launched the CRAG-MM Challenge at KDD Cup 2025, which consists of three tasks. This paper describes the solutions of all tasks in Meta KDD Cup'25 from BlackPearl team. We use a single model for each task, with key methods including data augmentation, RAG, reranking, and multi-task fine-tuning. Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.

Related papers

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z)
Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning [3.588567067449924]
We present a Collaborative Agent-Based Framework for Multi-Image Reasoning.<n>Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats.<n>We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge.
arXiv Detail & Related papers (2025-08-01T06:39:15Z)
Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG [3.9063541371093184]
This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM)<n>The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate.<n>Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification.
arXiv Detail & Related papers (2025-07-27T05:45:45Z)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z)
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z)
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [92.5712549836791]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)<n>We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z)
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models [27.45225442048711]
We introduce CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks. We also present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm.
arXiv Detail & Related papers (2024-10-21T16:30:29Z)
Winning Solution For Meta KDD Cup' 24 [6.471894753117029]
This paper describes the winning solutions of all tasks in Meta KDD Cup 24 from db3 team. The challenge is to build a RAG system from web sources and knowledge graphs. Our solution achieves 1st place on all three tasks, achieving a score of 28.4%, 42.7%, and 47.8%, respectively.
arXiv Detail & Related papers (2024-09-13T06:10:42Z)
MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering [0.43512163406552007]
We present a multi-adapter retrieval augmented generation system (MARAGS) for Meta's Comprehensive RAG (CRAG) competition for KDD CUP 2024. Our system achieved 2nd place for Task 1 as well as 3rd place on Task 2.
arXiv Detail & Related papers (2024-09-05T01:58:29Z)
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z)
Multi-Perspective Abstractive Answer Summarization [76.10437565615138]
Community Question Answering forums contain a rich resource of answers to a wide range of questions. The goal of multi-perspective answer summarization is to produce a summary that includes all perspectives of the answer. This work introduces a novel dataset creation method to automatically create multi-perspective, bullet-point abstractive summaries.
arXiv Detail & Related papers (2021-04-17T13:15:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.