Related papers: TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

URL: http://arxiv.org/abs/2602.01675v1
Date: Mon, 02 Feb 2026 05:43:08 GMT
Title: TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
Authors: Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, Muzhao Tian, Zhengkang Guo, Chenyang Zhang, Shuaiyu Zhou, Zengjie Hu, Dailin Li, Jingwen Xu, Kaimin Wang, Wenhao Liu, Tianlong Li, Fengpeng Yue, Feng Hong, Cao Liu, Ke Zeng,
Abstract summary: TRIP-Bench is a long-horizon benchmark grounded in realistic travel-planning scenarios.<n> Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context.<n>Experiments show that even advanced models achieve at most 50% success on the easy split, with performance dropping below 10% on hard subsets.
Score: 12.553634759736601
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce \textbf{TRIP-Bench}, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50\% success on the easy split, with performance dropping below 10\% on hard subsets. We further propose \textbf{GTPO}, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.

Related papers

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios [32.58358574768901]
Real-world multimodal agents solve multi-step grounded in visual evidence.<n>Existing benchmarks mainly evaluate single-turn visual reasoning or specific tool skills.<n>We introduce AgentVista, a benchmark for generalist multimodal agents.
arXiv Detail & Related papers (2026-02-26T16:30:46Z)
WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints [43.573740013433394]
Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions.<n>We introduce textbfWorldTravel, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints.<n>To evaluate agents in realistic deployments, we develop textbfWorldTravel-Webscape, a multi-modal environment featuring over 2,000 rendered webpages.
arXiv Detail & Related papers (2026-02-09T08:03:30Z)
TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning [22.3041021610283]
Travel planning is a natural real-world task to test large language models (LLMs) planning and tool-use abilities.<n>TravelBench is a benchmark for fully real-world travel planning.
arXiv Detail & Related papers (2025-12-27T18:25:14Z)
TBT-Former: Learning Temporal Boundary Distributions for Action Localization [1.2461503242570642]
Temporal Boundary Transformer (TBT-Former) is a new architecture for temporal action localization.<n>Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem.<n>TBT-Former sets a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets.
arXiv Detail & Related papers (2025-12-01T05:38:13Z)
Multi-Agent Craftax: Benchmarking Open-Ended Multi-Agent Reinforcement Learning at the Hyperscale [53.08403177911567]
textitCraftax-MA is an extension of the popular open-ended RL environment, Craftax.<n>textitCraftax-Coop introduces heterogeneous agents, trading and more mechanics that require complex cooperation among agents for success.
arXiv Detail & Related papers (2025-11-07T01:09:36Z)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework [51.50565654314582]
Large language models can follow users' instructions throughout a dialogue spanning multiple topics.<n>Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience.<n>We propose a framework for assessing multi-turn instruction-following ability.
arXiv Detail & Related papers (2025-11-05T14:39:59Z)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications [20.065087936770215]
We introduce VitaBench, a benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings.<n>VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools.<n>Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks.
arXiv Detail & Related papers (2025-09-30T16:33:49Z)
MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z)
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios [63.67884284105684]
We introduce textbfUltraHorizon, a novel benchmark that measures the foundational capabilities essential for complex real-world challenges.<n>Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules.<n>Our experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores.
arXiv Detail & Related papers (2025-09-26T02:04:00Z)
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning [78.86567400365392]
We present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories.<n>To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation.<n>Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks.
arXiv Detail & Related papers (2025-09-15T03:24:08Z)
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning [29.605396813225386]
We show how reinforcement learning can be used to train agents for multi-turn interactive tasks.<n>Our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.
arXiv Detail & Related papers (2025-08-05T14:30:47Z)
TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models [5.6525926183880255]
We introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task.<n>In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds.<n>TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains.
arXiv Detail & Related papers (2025-06-02T05:47:50Z)
MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning [110.54752872873472]
MultiZoo is a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms. MultiBench is a benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
arXiv Detail & Related papers (2023-06-28T17:59:10Z)
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning [87.23266008930045]
MultiBench is a systematic and unified benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. It provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. It introduces impactful challenges for future research, including robustness to large-scale multimodal datasets and robustness to realistic imperfections.
arXiv Detail & Related papers (2021-07-15T17:54:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.