Related papers: Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

URL: http://arxiv.org/abs/2507.17131v2
Date: Fri, 10 Oct 2025 06:52:01 GMT
Title: Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance
Authors: Yufei He, Ruoyu Li, Alex Chen, Yue Liu, Yulin Chen, Yuan Sui, Cheng Chen, Yi Zhu, Luca Luo, Frank Yang, Bryan Hooi,
Abstract summary: Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change.<n>We propose the Adaptive Reflective Interactive Agent (ARIA) to continuously learn updated domain knowledge at test time.<n>ARIA is deployed within TikTok Pay serving over 150 million monthly active users.
Score: 58.21767225794469
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.

Related papers

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge [58.03692489021332]
$$-Knowledge is an extension of $$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs.<n>We show that $$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.
arXiv Detail & Related papers (2026-03-04T18:34:47Z)
From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants [0.0]
Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off.<n>This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts.
arXiv Detail & Related papers (2026-01-26T07:44:47Z)
Are We Evaluating the Edit Locality of LLM Model Editing Properly? [68.441768731381]
We find that existing specificity evaluation protocols are inadequate for this purpose.<n>Existing specificity metrics are weakly correlated with the strength of specificity regularizers.<n>We also find that current metrics lack sufficient sensitivity, rendering them ineffective at distinguishing the specificity performance of different methods.
arXiv Detail & Related papers (2026-01-24T07:07:21Z)
Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning [41.461840578204956]
Large Language Model (LLM)-based agents learn new tasks without catastrophic forgetting.<n>Agent-Dice is a parameter fusion framework based on directional consensus evaluation.<n>Experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance.
arXiv Detail & Related papers (2026-01-07T06:43:50Z)
Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems [18.83666259380603]
Large language models (LLMs) have been applied to dialog systems.<n>LLMs are prone to errors in knowledge-intensive scenarios.<n> approaches based on retrieval augmented generation (RAG) and agent have emerged to improve the factual accuracy.
arXiv Detail & Related papers (2025-06-28T11:26:31Z)
Active Test-time Vision-Language Navigation [60.69722522420299]
ATENA is a test-time active learning framework that enables a practical human-robot interaction via episodic feedback on uncertain navigation outcomes.<n>In particular, ATENA learns to increase certainty in successful episodes and decrease it in failed ones, improving uncertainty calibration.<n>In addition, we propose a self-active learning strategy that enables an agent to evaluate its navigation outcomes based on confident predictions.
arXiv Detail & Related papers (2025-06-07T02:24:44Z)
Agentic Knowledgeable Self-awareness [79.25908923383776]
KnowSelf is a data-centric approach that applies agents with knowledgeable self-awareness like humans.<n>Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge.
arXiv Detail & Related papers (2025-04-04T16:03:38Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
Efficient Open-world Reinforcement Learning via Knowledge Distillation and Autonomous Rule Discovery [5.680463564655267]
Rule-driven deep Q-learning agent (RDQ) as one possible implementation of framework. We show that RDQ successfully extracts task-specific rules as it interacts with the world. In experiments, we show that the RDQ agent is significantly more resilient to the novelties than the baseline agents.
arXiv Detail & Related papers (2023-11-24T04:12:50Z)
Self-Knowledge Guided Retrieval Augmentation for Large Language Models [59.771098292611846]
Large language models (LLMs) have shown superior performance without task-specific fine-tuning. Retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering. Self-Knowledge guided Retrieval augmentation (SKR) is a simple yet effective method which can let LLMs refer to the questions they have previously encountered.
arXiv Detail & Related papers (2023-10-08T04:22:33Z)
Decision Rule Elicitation for Domain Adaptation [93.02675868486932]
Human-in-the-loop machine learning is widely used in artificial intelligence (AI) to elicit labels from experts. In this work, we allow experts to additionally produce decision rules describing their decision-making. We show that decision rule elicitation improves domain adaptation of the algorithm and helps to propagate expert's knowledge to the AI model.
arXiv Detail & Related papers (2021-02-23T08:07:22Z)
Transferring Domain Knowledge with an Adviser in Continuous Tasks [0.0]
Reinforcement learning techniques are incapable of explicitly incorporating domain-specific knowledge into the learning process. We adapt the Deep Deterministic Policy Gradient (DDPG) algorithm to incorporate an adviser. Our experiments on OpenAi Gym benchmark tasks show that integrating domain knowledge through advisers expedites the learning and improves the policy towards better optima.
arXiv Detail & Related papers (2021-02-16T09:03:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.