TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering
Abstract Overview
TopBench introduces a benchmark for tabular question answering where the answer is not explicitly stored in the table and must be inferred from historical patterns. The benchmark contains 779 samples built from 35 source tables across healthcare, finance, and daily consulting domains, organized into four sub-tasks: single-point prediction, decision making, treatment effect analysis, and ranking/filtering. The paper frames these problems as requiring both intent recognition from natural-language queries and predictive reasoning over potentially large tables, and evaluates models in both text-only and agentic code-execution settings. The authors also propose task-specific evaluation procedures for free-form reasoning and structured outputs, incorporating verification steps to reduce judge hallucination.
Novelty
The paper's main novelty is defining and benchmarking implicit predictive tabular QA, a setting that goes beyond standard table lookup or aggregation by requiring models to infer unobserved outcomes from natural-language requests. It is also distinctive in separating intent recognition from predictive modeling and in covering multiple predictive task types (single-point prediction, decision making, treatment effect analysis, and ranking/filtering) within one benchmark and evaluation framework.
Results
Experiments show that current LLMs remain fragile on these tasks, with most scores below 0.60; Gemini 3 Flash is among the strongest models, reaching 0.66 single-point accuracy and 0.65 decision and treatment scores in the agentic setting, while ranking/filtering remains difficult with best F1 of 0.58 and lowest NMAE of 0.26 (DeepSeek-V3.2-Instruct). Semantic hints can correct intent misalignment in several cases (e.g., Qwen3-Instruct single-point improves from 0.43 to 0.56), and a predict-only ensemble given gold structured inputs outperforms the best end-to-end agentic model (0.76 vs. 0.66 on single-point prediction), indicating that predictive modeling capacity remains a major bottleneck.
Key Points
- TopBench targets implicit predictive tabular QA, where models must infer missing outcomes rather than retrieve explicit table entries, addressing a gap in existing TQA benchmarks.
- The benchmark spans 779 samples from 35 tables across three domains and four predictive sub-tasks, with evaluation covering both natural-language reasoning and structured file outputs.
- Empirical results show that intent recognition and predictive modeling are both weak points for current LLMs, and task-specific prediction pipelines with gold inputs can substantially outperform end-to-end agents.