Related papers: Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web

Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web

URL: http://arxiv.org/abs/2311.18751v2
Date: Mon, 5 Feb 2024 01:13:52 GMT
Title: Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Authors: Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur
Abstract summary: Language model agents (LMA) emerged as a promising paradigm on muti-step decision making tasks. Despite the promise, their performance on real-world applications is still underexplored. We show that while existing LMAs achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks.
Score: 74.76803612807949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.5%). While these highlight the promise of small-scale finetuned and transferred models for task compositionality, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment.

Related papers

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks [110.20297293596005]
Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. Existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs. We propose a novel RL algorithm, SWEET-RL, that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms.
arXiv Detail & Related papers (2025-03-19T17:55:08Z)
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning [76.82159851648711]
We propose a framework that dynamically improves the embedding model's representation learning for negative pairs. LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance. LLaVE can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance.
arXiv Detail & Related papers (2025-03-04T10:21:57Z)
Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset [0.16385815610837165]
This study evaluates the performance of Large Language Models (LLMs) on SemEval-2020 Task 4 dataset. The models are tested on two tasks: Task A (Commonsense Validation), where models determine whether a statement aligns with commonsense knowledge, and Task B (Commonsense Explanation) Results indicate that larger models outperform previous models, with LLaMA3-70B achieving the highest accuracy of 98.40% in Task A whereas, lagging behind previous models with 93.40% in Task B.
arXiv Detail & Related papers (2025-02-19T12:40:49Z)
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [90.86370957353911]
Chain-of-Reasoning (CoR) is a novel unified framework that integrates multiple reasoning paradigms. CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z)
Error-driven Data-efficient Large Multimodal Model Tuning [35.20400815089843]
Large Multimodal Models (LMMs) have demonstrated impressive performance across numerous academic benchmarks. We propose an error-driven data-efficient tuning framework that aims to efficiently adapt generic LMMs to newly emerging tasks.
arXiv Detail & Related papers (2024-12-20T08:07:11Z)
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks [57.89516354418451]
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) We employ a semi-automated task generation pipeline using Large Language Models (LLMs) We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution.
arXiv Detail & Related papers (2024-10-31T17:53:12Z)
Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts [0.86325068644655]
We employ task-specific datasets and prompts to fine-tune two pruned LLaMA models having 5 billion and 4 billion parameters. We propose a novel approach to fine-tune the LLaMA model under two primary constraints: task specificity and prompt effectiveness.
arXiv Detail & Related papers (2024-10-24T22:34:27Z)
Probing the Robustness of Theory of Mind in Large Language Models [6.7932860553262415]
We introduce a novel dataset of 68 tasks for probing ToM in LLMs. We evaluate the ToM performance of four SotA open source LLMs on our dataset and the dataset introduced by (Kosinski, 2023) We find a consistent tendency in all tested LLMs to perform poorly on tasks that require the realization that an agent has knowledge of automatic state changes in its environment.
arXiv Detail & Related papers (2024-10-08T18:13:27Z)
Law of the Weakest Link: Cross Capabilities of Large Language Models [102.91861246827797]
We show that Large Language Models (LLMs) exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. These results highlight the under-performance of LLMs in cross-capability tasks.
arXiv Detail & Related papers (2024-09-30T05:12:01Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)
MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning [43.512739869120125]
We propose MAML-en-LLM, a novel method for meta-training large language models (LLMs) MAML-en-LLM can learn truly generalizable parameters that not only perform well on disjointed tasks but also adapts to unseen tasks. We demonstrate that MAML-en-LLM outperforms baselines in settings with limited amount of training data on both seen and unseen domains.
arXiv Detail & Related papers (2024-05-19T04:49:42Z)
Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning [56.82041895921434]
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities. When used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4.
arXiv Detail & Related papers (2024-03-29T03:48:12Z)
Mixed Distillation Helps Smaller Language Model Better Reasoning [27.934081882868902]
We introduce Mixed Distillation (MD) framework, which capitalizes on the strengths of Program of Thought (PoT) and Chain of Thought (CoT) capabilities within large language models (LLMs) Our experimental results show that MD significantly enhances the single-path and multi-path reasoning ability of smaller models in various tasks.
arXiv Detail & Related papers (2023-12-17T14:28:28Z)
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation [136.7876524839751]
Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks. We propose Branch-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%.
arXiv Detail & Related papers (2023-10-23T17:29:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.