Related papers: Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

URL: http://arxiv.org/abs/2508.19035v1
Date: Tue, 26 Aug 2025 13:54:17 GMT
Title: Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction
Authors: Congchi Yin, Tianyi Wu, Yankai Shu, Alex Gu, Yunhan Wang, Jun Shao, Xun Jiang, Piji Li,
Abstract summary: Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment.<n>This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning.<n>We introduce a novel evaluation paradigm, textitblack-box interaction, to tackle this challenge.
Score: 30.76377830825308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, \textit{black-box interaction}, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over 70\% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40\%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.

Related papers

Explaining Black-box Language Models with Knowledge Probing Systems: A Post-hoc Explanation Perspective [43.267605279424686]
Pre-trained Language Models (PLMs) are trained on large amounts of unlabeled data, yet they exhibit remarkable reasoning skills.<n>This paper proposes a novel Knowledge-guided Probing approach called KnowProb in a post-hoc explanation way.
arXiv Detail & Related papers (2025-08-23T09:41:59Z)
Context Is Not Comprehension [0.6445605125467572]
We introduce Verbose ListOps, a benchmark that embeds deterministic ListOps computations inside narrative camouflage.<n> Experiments show that models which solve raw ListOps with approximately 100% accuracy collapse on VLO after only 10,000 tokens.
arXiv Detail & Related papers (2025-06-05T11:41:05Z)
Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems [16.995977750934887]
Large language models (LLM) learn to identify a black-box function from passively observed versus actively collected data.<n>We show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference.<n>By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions.
arXiv Detail & Related papers (2025-05-23T14:37:36Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
CALM: Curiosity-Driven Auditing for Large Language Models [27.302357350862085]
We propose Curiosity-Driven Auditing for Large Language Models (CALM) to finetune an LLM as the auditor agent.<n>CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting.
arXiv Detail & Related papers (2025-01-06T13:14:34Z)
MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning [22.440669015518015]
We evaluate whether multi-modal large language models (MLLMs) possess abstract visual reasoning abilities. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns. We introduce MARVEL, a benchmark with 770 MLLMs composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.
arXiv Detail & Related papers (2024-04-21T09:15:02Z)
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models [61.8876114116716]
Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-related tasks.<n>However, their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored.<n>We introduce a benchmark, SHIELD, to evaluate MLLMs for face spoofing and forgery detection.
arXiv Detail & Related papers (2024-02-06T17:31:36Z)
Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z)
Democratizing Reasoning Ability: Tailored Learning from Large Language Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs. We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm. To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Language Models as Black-Box Optimizers for Vision-Language Models [62.80817942316398]
Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. We aim to develop a black-box approach to optimize VLMs through natural language prompts.
arXiv Detail & Related papers (2023-09-12T04:03:41Z)
LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles [22.119796373133298]
We propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human.
arXiv Detail & Related papers (2023-08-21T16:49:40Z)
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning. This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation. We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.