Related papers: Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

URL: http://arxiv.org/abs/2407.13766v2
Date: Thu, 10 Oct 2024 21:03:15 GMT
Title: Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
Authors: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan,
Abstract summary: Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering. We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
Score: 63.296342841358815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs.

Related papers

MIEB: Massive Image Embedding Benchmark [12.080155288744594]
We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories.
arXiv Detail & Related papers (2025-04-14T17:54:28Z)
ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task [34.269081635534526]
We propose ToFu, a visual encoder-agnostic, training-free Token Fusion strategy for high-resolution, multi-image, tasks. We validate our approach on the well-established LLaVA-Interleave Bench, which covers challenging multi-image tasks.
arXiv Detail & Related papers (2025-03-06T14:00:59Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents [31.98555661903688]
Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding. They face limitations in real-world applications requiring complex reasoning over a large number of images. We introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, to evaluate LMM performance on large-scale visual document retrieval and understanding.
arXiv Detail & Related papers (2024-11-23T18:14:42Z)
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective [32.55432949789787]
Large Multimodal Models (LMMs) have demonstrated remarkable capabilities. We propose a straightforward automated evaluation pipeline that requires LMMs to generate an image-prompt from a given input image. We then employ text-to-image generative models to create a new image based on these generated prompts. Finally, we evaluate the performance of LMMs by comparing the original image with the generated one.
arXiv Detail & Related papers (2024-11-21T12:16:16Z)
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity [85.44800864697464]
We introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. We show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference.
arXiv Detail & Related papers (2024-09-20T10:50:21Z)
MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Current benchmarks fail to accurately reflect performance of different models. We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z)
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models [76.1999277491816]
Multimodal Multi-image Understanding (MMIU) is a comprehensive evaluation suite designed to assess Large Vision-Language Models (LVLMs) MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension.
arXiv Detail & Related papers (2024-08-05T17:56:41Z)
MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC) The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z)
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models [70.2997884478129]
We introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs.
arXiv Detail & Related papers (2024-07-10T17:59:43Z)
Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens. We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z)
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs [48.269363759989915]
The research focuses on two aspects: first, image-to-image matching, and second, multi-image-to-text matching. We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL.
arXiv Detail & Related papers (2024-01-05T00:26:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.