AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios
- URL: http://arxiv.org/abs/2601.20613v2
- Date: Fri, 30 Jan 2026 13:36:46 GMT
- Title: AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios
- Authors: Kaiyuan Chen, Qimin Wu, Taiyu Hou, Tianhao Tang, Xueyu Hu, Yuchen Hou, Bikun Li, Chengming Qian, Guoyin Wang, Haolin Chen, Haotong Tian, Haoye Zhang, Haoyu Bian, Hongbing Pan, Hongkang Zhang, Hongyi Zhou, Jiaqi Cai, Jiewu Rao, Jiyuan Ren, Keduan Huang, Lucia Zhu Huang, Mingyu Yuan, Naixu Guo, Qicheng Tang, Qinyan Zhang, Shuai Chen, Siheng Chen, Ting Ting Li, Xiaoxing Guo, Yaocheng Zuo, Yaoqi Guo, Yinan Wang, Yinzhou Yu, Yize Wang, Yuan Jiang, Yuan Tian, Yuanshuo Zhang, Yuxuan Liu, Yvette Yan Zeng, Zenyu Shan, Zihan Yin, Xiaobo Hu, Yang Liu, Yixin Ren, Yuan Gong,
- Abstract summary: The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
- Score: 49.90735676070039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.
Related papers
- AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z) - Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z) - QAgent: A modular Search Agent with Interactive Query Understanding [25.147900132089777]
Large language models excel at natural language tasks but are limited by their static parametric knowledge.<n>We propose a unified agentic RAG framework that employs a search agent for adaptive retrieval.<n> Experiments show QAgent excels at QA and serves as a plug-and-play module for real-world deployment.
arXiv Detail & Related papers (2025-10-09T16:08:05Z) - Open Agent Specification (Agent Spec): A Unified Representation for AI Agents [10.685555728094338]
We introduce Open Agent Specification (Agent Spec), a declarative language that defines AI agents and agentic.<n>Agent Spec defines a common set of components, control and data flow semantics, and schemas that allow an agent to be defined once and executed across different runtimes.
arXiv Detail & Related papers (2025-10-05T12:26:42Z) - Self-Challenging Language Model Agents [98.62637336505242]
We propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself.<n>The framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.
arXiv Detail & Related papers (2025-06-02T14:23:33Z) - Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration [63.90193684394165]
We introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation.<n>During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards.<n>During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step.
arXiv Detail & Related papers (2025-05-29T07:24:37Z) - AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z) - YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks [16.443149180969776]
Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks.<n>Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users.<n>Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks.
arXiv Detail & Related papers (2025-01-16T08:06:02Z) - SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs [9.117180930298813]
General-purpose AI agents struggle to efficiently utilize domain-specific knowledge and human expertise.<n>We introduce the Standard Operational Procedure-guided Agent ( SOP-agent), a novel framework for constructing domain-specific agents.<n> SOP-agent demonstrates excellent versatility, achieving performance superior to general-purpose agent frameworks.
arXiv Detail & Related papers (2025-01-16T06:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.