MIRB: Mathematical Information Retrieval Benchmark
- URL: http://arxiv.org/abs/2505.15585v1
- Date: Wed, 21 May 2025 14:40:27 GMT
- Title: MIRB: Mathematical Information Retrieval Benchmark
- Authors: Haocheng Ju, Bin Dong,
- Abstract summary: We introduce MIRB (Mathematical Information Retrieval Benchmark) to assess the MIR capabilities of retrieval models.<n>MIRB includes four tasks: semantic statement retrieval, question-answer retrieval, premise retrieval, and formula retrieval, spanning a total of 12 datasets.<n>We evaluate 13 retrieval models on this benchmark and analyze the challenges inherent to MIR.
- Score: 4.587376749548757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mathematical Information Retrieval (MIR) is the task of retrieving information from mathematical documents and plays a key role in various applications, including theorem search in mathematical libraries, answer retrieval on math forums, and premise selection in automated theorem proving. However, a unified benchmark for evaluating these diverse retrieval tasks has been lacking. In this paper, we introduce MIRB (Mathematical Information Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB includes four tasks: semantic statement retrieval, question-answer retrieval, premise retrieval, and formula retrieval, spanning a total of 12 datasets. We evaluate 13 retrieval models on this benchmark and analyze the challenges inherent to MIR. We hope that MIRB provides a comprehensive framework for evaluating MIR systems and helps advance the development of more effective retrieval models tailored to the mathematical domain.
Related papers
- MMSearch-R1: Incentivizing LMMs to Search [49.889749277236376]
We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables on-demand, multi-turn search in real-world Internet environments.<n>Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty.
arXiv Detail & Related papers (2025-06-25T17:59:42Z) - Can we repurpose multiple-choice question-answering models to rerank retrieved documents? [0.0]
R* is a proof-of-concept model that harmonizes multiple-choice question-answering (MCQA) models for document reranking.<n>Through experimental validation, R* proves to improve retrieval accuracy and contribute to the field's advancement.
arXiv Detail & Related papers (2025-03-06T17:53:24Z) - Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning [32.850036320802474]
We introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle OOD issues.<n>By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup.<n>Our experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets.
arXiv Detail & Related papers (2025-02-20T08:40:09Z) - A Comprehensive Survey on Composed Image Retrieval [54.54527281731775]
Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query.<n>There is currently no comprehensive review of CIR to provide a timely overview of this field.<n>We synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR.
arXiv Detail & Related papers (2025-02-19T01:37:24Z) - REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark [16.55516587540082]
We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval.<n>We propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching.<n>Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing.
arXiv Detail & Related papers (2025-02-17T22:10:47Z) - IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios [14.336896748878921]
This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks.
The benchmark encompasses five retrieval tasks: query retrieval, title retrieval, part-of-paragraph retrieval, keyword retrieval, and summary retrieval.
Our contributions include: 1) the IRSC benchmark, 2) the SSCI and RCCI metrics, and 3) insights into the cross-lingual limitations of embedding models.
arXiv Detail & Related papers (2024-09-24T05:39:53Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification [62.894790379098005]
We propose a novel instruct-ReID task that requires the model to retrieve images according to the given image or language instructions.<n>Instruct-ReID is the first exploration of a general ReID setting, where existing 6 ReID tasks can be viewed as special cases by assigning different instructions.<n>We propose a novel baseline model, IRM, with an adaptive triplet loss to handle various retrieval tasks within a unified framework.
arXiv Detail & Related papers (2024-05-28T03:35:46Z) - BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives [2.3420045370973828]
We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO)
BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives.
arXiv Detail & Related papers (2024-02-21T22:22:30Z) - UniIR: Training and Benchmarking Universal Multimodal Information
Retrievers [76.06249845401975]
We introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities.
UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks.
We construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.
arXiv Detail & Related papers (2023-11-28T18:55:52Z) - Evaluating Generative Ad Hoc Information Retrieval [58.800799175084286]
generative retrieval systems often directly return a grounded generated text as a response to a query.
Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval.
arXiv Detail & Related papers (2023-11-08T14:05:00Z) - Retrieval-Enhanced Machine Learning [110.5237983180089]
We describe a generic retrieval-enhanced machine learning framework, which includes a number of existing models as special cases.
REML challenges information retrieval conventions, presenting opportunities for novel advances in core areas, including optimization.
REML research agenda lays a foundation for a new style of information access research and paves a path towards advancing machine learning and artificial intelligence.
arXiv Detail & Related papers (2022-05-02T21:42:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.