Solution for Meta KDD Cup'25: A Comprehensive Three-Step Framework for Vision Question Answering
- URL: http://arxiv.org/abs/2507.21520v1
- Date: Tue, 29 Jul 2025 06:07:59 GMT
- Title: Solution for Meta KDD Cup'25: A Comprehensive Three-Step Framework for Vision Question Answering
- Authors: Zijian Zhang, Xiaocheng Zhang, Yang Zhou, Zhimin Lin, Peng Yan,
- Abstract summary: This paper describes the solutions of all tasks in Meta KDD Cup'25 from BlackPearl team.<n>We use a single model for each task, with key methods including data augmentation, RAG, reranking, and fine-tuning.<n>Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.
- Score: 7.481274094559558
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision Large Language Models (VLLMs) have improved multi-modal understanding and visual question answering (VQA), but still suffer from hallucinated answers. Multi-modal Retrieval-Augmented Generation (RAG) helps address these issues by incorporating external information, yet challenges remain in visual context comprehension, multi-source retrieval, and multi-turn interactions. To address these challenges, Meta constructed the CRAG-MM benchmark and launched the CRAG-MM Challenge at KDD Cup 2025, which consists of three tasks. This paper describes the solutions of all tasks in Meta KDD Cup'25 from BlackPearl team. We use a single model for each task, with key methods including data augmentation, RAG, reranking, and multi-task fine-tuning. Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.
Related papers
- Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning [3.588567067449924]
We present a Collaborative Agent-Based Framework for Multi-Image Reasoning.<n>Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats.<n>We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge.
arXiv Detail & Related papers (2025-08-01T06:39:15Z) - Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG [3.9063541371093184]
This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM)<n>The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate.<n>Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification.
arXiv Detail & Related papers (2025-07-27T05:45:45Z) - LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [92.5712549836791]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)<n>We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models [27.45225442048711]
We introduce CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks.
We also present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm.
arXiv Detail & Related papers (2024-10-21T16:30:29Z) - Winning Solution For Meta KDD Cup' 24 [6.471894753117029]
This paper describes the winning solutions of all tasks in Meta KDD Cup 24 from db3 team.
The challenge is to build a RAG system from web sources and knowledge graphs.
Our solution achieves 1st place on all three tasks, achieving a score of 28.4%, 42.7%, and 47.8%, respectively.
arXiv Detail & Related papers (2024-09-13T06:10:42Z) - MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering [0.43512163406552007]
We present a multi-adapter retrieval augmented generation system (MARAGS) for Meta's Comprehensive RAG (CRAG) competition for KDD CUP 2024.
Our system achieved 2nd place for Task 1 as well as 3rd place on Task 2.
arXiv Detail & Related papers (2024-09-05T01:58:29Z) - Multitask Multimodal Prompted Training for Interactive Embodied Task
Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories.
By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z) - Multi-Perspective Abstractive Answer Summarization [76.10437565615138]
Community Question Answering forums contain a rich resource of answers to a wide range of questions.
The goal of multi-perspective answer summarization is to produce a summary that includes all perspectives of the answer.
This work introduces a novel dataset creation method to automatically create multi-perspective, bullet-point abstractive summaries.
arXiv Detail & Related papers (2021-04-17T13:15:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.