Related papers: MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning

MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning

URL: http://arxiv.org/abs/2502.09933v1
Date: Fri, 14 Feb 2025 06:05:12 GMT
Title: MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning
Authors: Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen,
Abstract summary: We propose MIR-Bench, the first many-shot in-context inductive reasoning benchmark.<n>We study many novel problems for inductive reasoning and many-shot ICL, including robustness against erroneous shots.
Score: 21.056519816264505
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Inductive Reasoning (IR), the ability to summarize rules from examples and apply on new ones, has long been viewed as a primal ability for general intelligence and widely studied by cognitive science and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually $<$10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations are mostly focused on classification (a very limited aspect of IR), and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context inductive reasoning benchmark that asks LLM to induce output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for inductive reasoning and many-shot ICL, including robustness against erroneous shots and the effect of Chain-of-Thought (CoT), and acquired insightful findings.

Related papers

MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance [5.192956837901584]
We introduce MDBench, a new dataset for evaluating large language mod-els (LLMs) on the task of multi-document reasoning.<n>We use a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets.<n>We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods.
arXiv Detail & Related papers (2025-06-17T19:14:30Z)
From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? [34.959850282872594]
We present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills.<n>AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers.<n> Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning.
arXiv Detail & Related papers (2025-06-09T23:56:41Z)
KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing [64.38243807002878]
We present KnowTrace, an elegant RAG framework to mitigate the context overload in large language models.<n>KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question.<n>It consistently surpasses existing methods across three multi-hop question answering benchmarks.
arXiv Detail & Related papers (2025-05-26T17:22:20Z)
Reasoning Capabilities and Invariability of Large Language Models [49.23570751696334]
We aim to provide a comprehensive analysis of Large Language Models' reasoning competence.<n>We introduce a new benchmark dataset with a series of simple reasoning questions demanding shallow logical reasoning.<n>An empirical analysis involving zero-shot and few-shot prompting across 24 LLMs of different sizes reveals that, while LLMs with over 70 billion parameters perform better in the zero-shot setting, there is still a large room for improvement.
arXiv Detail & Related papers (2025-05-01T18:12:30Z)
Argumentation Computation with Large Language Models : A Benchmark Study [6.0682923348298194]
Large language models (LLMs) have made significant advancements in neuro-symbolic computing. We aim to investigate the capability of LLMs in determining the extensions of various abstract argumentation semantics.
arXiv Detail & Related papers (2024-12-21T18:23:06Z)
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens. We use detective novels as data sources, which naturally have various reasoning elements. We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z)
Analyzing the Role of Semantic Representations in the Era of Large Language Models [104.18157036880287]
We investigate the role of semantic representations in the era of large language models (LLMs) We propose an AMR-driven chain-of-thought prompting method, which we call AMRCoT. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions.
arXiv Detail & Related papers (2024-05-02T17:32:59Z)
C-ICL: Contrastive In-context Learning for Information Extraction [54.39470114243744]
c-ICL is a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations. Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods.
arXiv Detail & Related papers (2024-02-17T11:28:08Z)
Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks. The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z)
LLMs for Relational Reasoning: How Far are We? [8.840750655261251]
Large language models (LLMs) have revolutionized many areas by achieving state-of-the-art performance on downstream tasks. Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems.
arXiv Detail & Related papers (2024-01-17T08:22:52Z)
LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning [61.7853049843921]
Chain-of-thought (CoT) prompting is a popular in-context learning approach for large language models (LLMs) This paper introduces a new approach named Latent Reasoning Skills (LaRS) that employs unsupervised learning to create a latent space representation of rationales.
arXiv Detail & Related papers (2023-12-07T20:36:10Z)
LooGLE: Can Long-Context Language Models Understand Long Contexts? [46.143956498529796]
LooGLE is a benchmark for large language models' long context understanding. It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
arXiv Detail & Related papers (2023-11-08T01:45:37Z)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection. It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z)
Concise and Organized Perception Facilitates Reasoning in Large Language Models [32.71672086718057]
We show that large language models (LLMs) exhibit failure patterns akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks. We propose a novel reasoning approach named Concise and Organized Perception (COP) COP carefully analyzes the given statements to identify the most pertinent information while eliminating redundancy efficiently.
arXiv Detail & Related papers (2023-10-05T04:47:49Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning [28.162661418161466]
Large language models (LLMs) have recently shown great potential for in-context learning. This paper investigates the reliance of LLMs on shortcuts or spurious correlations within prompts. We uncover a surprising finding that larger models are more likely to utilize shortcuts in prompts during inference.
arXiv Detail & Related papers (2023-05-26T20:56:30Z)
Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks. We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
Search-in-the-Chain: Interactively Enhancing Large Language Models with Search for Knowledge-intensive Tasks [121.74957524305283]
This paper proposes a novel framework named textbfSearch-in-the-Chain (SearChain) for the interaction between Information Retrieval (IR) and Large Language Model (LLM) Experiments show that SearChain outperforms state-of-the-art baselines on complex knowledge-intensive tasks.
arXiv Detail & Related papers (2023-04-28T10:15:25Z)
Large Language Models are Zero-Shot Reasoners [28.6899375595088]
Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples. We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
arXiv Detail & Related papers (2022-05-24T09:22:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.