Related papers: CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models

CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models

URL: http://arxiv.org/abs/2510.18173v1
Date: Mon, 20 Oct 2025 23:51:28 GMT
Title: CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models
Authors: Ritam Upadhyay, Naman Ahuja, Rishabh Baral, Aparna Garimella, Vivek Gupta,
Abstract summary: We present CMT-Bench, a diagnostic benchmark built from live cricket commentary.<n>We find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes.
Score: 11.167804698594866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM Driven text-to-table (T2T) systems often rely on extensive prompt-engineering or iterative event extraction in code-parsable formats, which boosts scores but are computationally expensive and obscure how models actually reason over temporal evolving narratives to summarise key information. We present CMT-Bench, a diagnostic benchmark built from live cricket commentary that requires dynamic table generation across two evolving schemas under a dense, rule-governed policy. CMT-Bench is designed to probe robustness via three semantics-preserving dimensions: (i) extractive-cue ablation to separate extractive shortcuts from state tracking, (ii) temporal prefixing to test long-context stability, and (iii) entity-form perturbations (anonymization, outof-distribution substitutions, role-entangling paraphrases) to assess sensitivity to surface variation. Across diverse long-context stateof-the-art LLMs, we find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes. Complementary distributional tests confirm significant shifts in numeric error patterns, indicating drift in reasoning rather than mere noise. Our results show that current LLMs are brittle in dynamic Textto-table generation, motivating robustness-first evaluation as a prerequisite for developing efficient and scalable approaches for this task.

Related papers

Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z)
Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG [35.96258615258145]
We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty.<n>ETC consistently outperforms strong baselines while reducing retrieval frequency.<n>It is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines.
arXiv Detail & Related papers (2025-11-13T05:28:02Z)
PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs [0.21485350418225244]
We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that generates meaning paraphrases at evaluation time and aggregates results across multiple runs.<n>We validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed.<n>Our results are statistically robust over multiple runs and we extended our experiments to 3 datasets covering 10 languages.
arXiv Detail & Related papers (2025-10-08T07:37:19Z)
AXIS: Explainable Time Series Anomaly Detection with Large Language Models [33.68487894996624]
AXIS is a framework that conditions a frozen Large Language Models (LLMs) for nuanced time-series understanding.<n>LLMs operate on discrete tokens and struggle to directly process long, continuous signals.<n>We introduce a new benchmark featuring multi-format questions and rationales that supervise contextual grounding and pattern-level semantics.
arXiv Detail & Related papers (2025-09-29T07:24:22Z)
CALM: A Framework for Continuous, Adaptive, and LLM-Mediated Anomaly Detection in Time-Series Streams [0.42970700836450476]
This paper introduces CALM, a novel, end-to-end framework for real-time anomaly detection.<n> CALM is built on the Apache Beam distributed processing framework.<n>It implements a closed-loop, continuous fine-tuning mechanism that allows the anomaly detection model to adapt to evolving data patterns in near real-time.
arXiv Detail & Related papers (2025-08-29T00:27:35Z)
SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR [58.31068047426522]
Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference.<n>Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction.<n>We propose SUTA-LM, a simple yet effective extension of SUTA, with language model rescoring.<n> Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
arXiv Detail & Related papers (2025-06-10T02:50:20Z)
LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z)
Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z)
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [64.74765550805024]
Chain-of-Thought prompting elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs.<n>We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints.<n>SoT achieves token reductions of up to 84% with minimal accuracy loss across 18 reasoning datasets.
arXiv Detail & Related papers (2025-03-07T06:57:17Z)
Test-Time Alignment for Large Language Models via Textual Model Predictive Control [63.508812485566374]
Textual Model Predictive Control (TMPC) is a novel predictive planning framework adapted for aligning Large Language Models at inference time.<n>TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis.<n>Results demonstrate that TMPC consistently improves performance, highlighting the generality.
arXiv Detail & Related papers (2025-02-28T07:24:33Z)
TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models [14.411646409316624]
We introduce textbfHierarchical textbfText-textbfFree textbfAlignment (textbfTS-HTFA), a novel method for time-series forecasting.<n>We replace paired text data with adaptive virtual text based on QR decomposition word embeddings and learnable prompt.<n>Experiments on multiple time-series benchmarks demonstrate that HTFA achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-23T12:57:24Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.