Related papers: Beyond Syntax: Action Semantics Learning for App Agents

Beyond Syntax: Action Semantics Learning for App Agents

URL: http://arxiv.org/abs/2506.17697v1
Date: Sat, 21 Jun 2025 12:08:19 GMT
Title: Beyond Syntax: Action Semantics Learning for App Agents
Authors: Bohan Tang, Dezhao Luo, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao,
Abstract summary: Action Semantics Learning (ASL) is a learning framework where the learning objective is capturing the semantics of the ground truth actions.<n>ASL significantly improves the accuracy and generalisation of App agents over existing methods.
Score: 60.56331102288794
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with closed LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. With this insight, ASL employs a novel SEmantic Estimator (SEE) to compute a semantic reward to train the App agents in generating actions aligned with the semantics of ground truth actions, even when the syntactic forms differ. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments on offline and online smartphone App operation benchmarks show that ASL significantly improves the accuracy and generalisation of App agents over existing methods.

Related papers

LELANTE: LEveraging LLM for Automated ANdroid TEsting [6.112769800569302]
Existing testing approaches require developers to manually write scripts using tools such as Appium and Espresso to execute the corresponding test case.<n>We introduce LELANTE, a novel framework that utilizes large language models (LLMs) to automate test case execution without requiring pre-written scripts.<n>In experiments across 390 test cases spanning 10 popular Android applications, LELANTE achieved a 73% test execution success rate.
arXiv Detail & Related papers (2025-04-29T16:13:49Z)
PAFFA: Premeditated Actions For Fast Agents [19.576180667174366]
We introduce PAFFA, a method that makes LLMs faster and more accurate in completing tasks on the internet using a novel inference-time technique.<n>PAFFA drastically reduces inference time tokens by 87% while maintaining robust performance.<n>Unravel's ability to update its action library based on explorations allows generalization and adaptation to unseen websites.
arXiv Detail & Related papers (2024-12-10T22:51:31Z)
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z)
Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z)
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [31.509994889286183]
We introduce Language Agent Tree Search (LATS) -- the first general framework that synergizes the capabilities of language models (LMs) in reasoning, acting, and planning. A key feature of our approach is the incorporation of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism. LATS achieves state-of-the-art pass@1 accuracy (92.7%) for programming on HumanEval with GPT-4 and demonstrates gradient-free performance (average score of 75.9) comparable to gradient-based fine-tuning for web navigation on WebShop with GPT
arXiv Detail & Related papers (2023-10-06T17:55:11Z)
LASER: LLM Agent with State-Space Exploration for Web Navigation [57.802977310392755]
Large language models (LLMs) have been successfully adapted for interactive decision-making tasks like web navigation. Previous methods implicitly assume a forward-only execution mode for the model, where they only provide oracle trajectories as in-context examples. We propose to model the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task.
arXiv Detail & Related papers (2023-09-15T05:44:08Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
Guiding Pretraining in Reinforcement Learning with Large Language Models [133.32146904055233]
We describe a method that uses background knowledge from text corpora to shape exploration. This method, called ELLM, rewards an agent for achieving goals suggested by a language model. By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop.
arXiv Detail & Related papers (2023-02-13T21:16:03Z)
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps. We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.