Related papers: MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

URL: http://arxiv.org/abs/2510.13276v1
Date: Wed, 15 Oct 2025 08:22:03 GMT
Title: MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
Authors: Keyan Zhou, Zecheng Tang, Lingfeng Ming, Guanghao Zhou, Qiguang Chen, Dan Qiao, Zheming Yang, Libo Qin, Minghui Qiu, Juntao Li, Min Zhang,
Abstract summary: We introduce MMLongCite, a benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios.<n> MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos.<n>Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts.
Score: 60.01080454274115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

Related papers

LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges? [35.43518917055024]
LooGLE v2 is a novel benchmark designed to evaluate large language models' long context ability in real-world applications and scenarios.<n>Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code.<n> evaluation results show that even the best-performing model achieves only a 59.2% overall score on our benchmark.
arXiv Detail & Related papers (2025-10-26T06:14:19Z)
InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation [57.310236384112834]
In-context learning (ICL) is critical for large language models (LLMs) but its effectiveness is constrained by finite context windows.<n>We introduce InfiniteICL, a framework that parallels context and parameters in LLMs with short- and long-term memory.<n>We demonstrate that our method reduces context length by 90% while achieving 103% average performance of full-context prompting.
arXiv Detail & Related papers (2025-04-02T13:15:44Z)
Thus Spake Long-Context Large Language Model [70.49178031298953]
Long context is an important topic in Natural Language Processing (NLP)<n>It offers immense opportunities for Large Language Models (LLMs) giving LLMs the lifelong learning potential akin to humans.<n>In the past two years, the context length of LLMs has achieved a breakthrough extension to millions of tokens.<n>Research on long-context LLMs has expanded from length extrapolation to a comprehensive focus on architecture, infrastructure, training, and evaluation technologies.
arXiv Detail & Related papers (2025-02-24T13:19:33Z)
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models [62.698520962933195]
Large Vision-Language Models (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning.<n>We propose a novel training-free context pruning method that selectively removes less critical textual information.
arXiv Detail & Related papers (2024-10-25T17:59:09Z)
FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding [32.197113821638936]
We propose a novel integrated Long-Context Large Language Model (FltLM) FltLM incorporates a context filter with a soft mask mechanism, identifying and dynamically excluding irrelevant content to concentrate on pertinent information. Experimental results demonstrate that FltLM significantly outperforms supervised fine-tuning and retrieval-based methods in complex QA scenarios.
arXiv Detail & Related papers (2024-10-09T13:47:50Z)
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [86.93099925711388]
We propose textbfDetectiveQA, a dataset specifically designed for narrative reasoning within long contexts.<n>We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English.
arXiv Detail & Related papers (2024-09-04T06:28:22Z)
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA) Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z)
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts [65.04791072532106]
We present LoCoVQA, a benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs) LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts. This test assesses how well VLMs can ignore irrelevant information when answering queries.
arXiv Detail & Related papers (2024-06-24T17:58:03Z)
LooGLE: Can Long-Context Language Models Understand Long Contexts? [46.143956498529796]
LooGLE is a benchmark for large language models' long context understanding. It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
arXiv Detail & Related papers (2023-11-08T01:45:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.