Related papers: TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

URL: http://arxiv.org/abs/2509.22715v1
Date: Wed, 24 Sep 2025 08:05:32 GMT
Title: TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?
Authors: Jiho Park, Jongyoon Song, Minjin Choi, Kyuho Heo, Taehun Huh, Ji Won Kim,
Abstract summary: Large language models (LLMs) are increasingly integral as productivity assistants.<n>Existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities.<n>We introduce TRUEBench, a benchmark specifically designed for LLM-based productivity assistants.
Score: 11.400738388392654
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are increasingly integral as productivity assistants, but existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities. Current benchmarks often (i) lack sufficient multilinguality, (ii) fail to capture the implicit constraints inherent in user requests, and (iii) overlook the complexities of multi-turn dialogue. To address these critical gaps and provide a more realistic assessment, we introduce TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark)1, a novel benchmark specifically designed for LLM-based productivity assistants. TRUEBench distinguishes itself by featuring input prompts across 12 languages, incorporating intra-instance multilingual instructions, employing rigorous evaluation criteria to capture both explicit and implicit constraints, and including complex multi-turn dialogue scenarios with both accumulating constraints and context switches. Furthermore, to ensure reliability in evaluation, we refined constraints using an LLM validator. Extensive experiments demonstrate that TRUEBench presents significantly greater challenges than existing benchmarks; for instance, a strong model like OpenAI o1 achieved only a 69.07% overall pass rate. TRUEBench offers a demanding and realistic assessment of LLMs in practical productivity settings, highlighting their capabilities and limitations.

Related papers

AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context [10.769682566098695]
AACR-Bench is a comprehensive benchmark that provides full cross-file context across multiple programming languages.<n>Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects.
arXiv Detail & Related papers (2026-01-27T11:28:44Z)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z)
EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [64.70546873396624]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation [1.7268889851975326]
We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks.<n>Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases.
arXiv Detail & Related papers (2025-05-13T23:47:12Z)
Meeseeks: A Feedback-Driven, Iterative Self-Correction Benchmark evaluating LLMs' Instruction Following Capability [21.96694731466089]
We introduce Meeseeks, a fully automated instruction-following benchmark equipped with an integrated feedback mechanism.<n>Meeseeks identifies erroneous components in model responses and provides corresponding feedback accurately, thereby iteratively guiding the model toward self-correction.<n>We conducted comprehensive analysis from both macro and instance levels, uncovering numerous common issues prevalent in current state-of-the-art models.
arXiv Detail & Related papers (2025-04-30T13:28:19Z)
Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models [0.0]
I argue that inherent limitations with the benchmarking paradigm render benchmark performance highly unsuitable as a metric for generalisable competence over cognitive tasks.<n>I conclude that benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities.
arXiv Detail & Related papers (2025-02-20T07:13:29Z)
CFBench: A Comprehensive Constraints-Following Benchmark for LLMs [32.47057812403923]
CFBench is a large-scale Comprehensive Constraints Following Benchmark for Large Language Models.<n>It features 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks.<n>CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types.
arXiv Detail & Related papers (2024-08-02T09:03:48Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models. We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [78.60644407028022]
We introduce MINT, a benchmark that evaluates large language models' ability to solve tasks with multi-turn interactions. LLMs generally benefit from tools and language feedback, with performance gains of 1-8% for each turn of tool use. LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
arXiv Detail & Related papers (2023-09-19T15:25:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.