IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
- URL: http://arxiv.org/abs/2602.17687v1
- Date: Thu, 05 Feb 2026 21:57:43 GMT
- Title: IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
- Authors: Connor Shorten, Augustas Skaburskas, Daniel M. Jones, Charles Pierse, Roberto Esposito, John Trengrove, Etienne Dilocker, Bob van Luijt,
- Abstract summary: We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page.<n>Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems.<n>We analyze the limitations of unimodal text and image representations and identify question types that require one modality over the other.
- Score: 1.4427879901952518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.
Related papers
- PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing [3.889218009169166]
Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers.<n>We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments.
arXiv Detail & Related papers (2026-03-04T20:55:30Z) - MM-BRIGHT: A Multi-Task Multimodal Benchmark for Reasoning-Intensive Retrieval [18.53521844184766]
MM-BRIGHT is the first multimodal benchmark for reasoning-intensive retrieval.<n>Our dataset consists of 2,803 real-world queries spanning 29 diverse technical domains.
arXiv Detail & Related papers (2026-01-14T15:31:54Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling [58.251621637466904]
Muti-query Scene Text retrieval with Attention Recycling (MSTAR) is a box-free approach for scene text retrieval.<n>It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts.<n>Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset.
arXiv Detail & Related papers (2025-06-12T11:54:13Z) - VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval [56.12310817934239]
Cross-modal embeddings behave as bags of concepts and underrepresent structured visual relationships such as pose and viewpoint.<n>We propose Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that mitigates this limitation of cross-modal similarity alignment.<n>VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching.
arXiv Detail & Related papers (2025-05-26T17:59:33Z) - Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy [23.041812897803034]
The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions.<n>We introduce Imagined Proxy for CIR (IP-CIR), a training-free method that creates a proxy image aligned with the query image and text description.<n>Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image.
arXiv Detail & Related papers (2024-11-24T05:27:21Z) - TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.
By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions.
By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval.
CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z) - Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval.
We first examine the related concerns and why the focus is on image-text retrieval tasks.
We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.