LongAgent: Scaling Language Models to 128k Context through Multi-Agent
Collaboration
- URL: http://arxiv.org/abs/2402.11550v2
- Date: Wed, 13 Mar 2024 07:16:42 GMT
- Title: LongAgent: Scaling Language Models to 128k Context through Multi-Agent
Collaboration
- Authors: Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi
Zhang, Xuanjing Huang
- Abstract summary: textscLongAgent is based on multi-agent collaboration and scales to a context of 128K.
Agent team instantiated with LLaMA-7B achieves significant improvements in tasks such as 128k-long text retrieval, multi-hop question answering, compared to GPT-4.
- Score: 47.861310541425766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated impressive performance in
understanding language and executing complex reasoning tasks. However, LLMs
with long context windows have been notorious for their expensive training
costs and high inference latency. Even the most advanced models such as GPT-4
and Claude2 often make mistakes when processing inputs of over $100k$ tokens, a
phenomenon also known as \textit{lost in the middle}. In this paper, we propose
\textsc{LongAgent}, a method based on multi-agent collaboration, which scales
LLMs (e.g., LLaMA) to a context of 128K and demonstrates potential superiority
in long-text processing compared to GPT-4. In \textsc{LongAgent}, a leader is
responsible for understanding user intent and directing team members to acquire
information from documents. Due to members' hallucinations, it is non-trivial
for a leader to obtain accurate information from the responses of dozens to
hundreds of members. To address this, we develop an \textit{inter-member
communication} mechanism to resolve response conflicts caused by hallucinations
through information sharing. Our experimental results indicate that
\textsc{LongAgent} offers a promising alternative for long-text processing. The
agent team instantiated with LLaMA-7B achieves significant improvements in
tasks such as 128k-long text retrieval, multi-hop question answering, compared
to GPT-4.
Related papers
- TRAPDOC: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents [4.753535328327316]
Over-reliance on large language models (LLMs) is emerging as a significant social issue.<n>We propose a method injecting imperceptible phantom tokens into documents, which causes LLMs to generate outputs that appear plausible to users but are in fact incorrect.<n>Based on this technique, we introduce TRAPDOC, a framework designed to deceive over-reliant LLM users.
arXiv Detail & Related papers (2025-05-30T07:16:53Z) - MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding [40.52017994491893]
MDocAgent is a novel RAG and multi-agent framework that leverages both text and image.
Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent.
Preliminary experiments on five benchmarks demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1%.
arXiv Detail & Related papers (2025-03-18T06:57:21Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models [73.13933847198395]
We propose a training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding.
The proposed LLM$times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output.
arXiv Detail & Related papers (2024-10-12T03:13:44Z) - Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding [28.191029786204624]
We introduce the Long Question Coreference Adaptation (LQCA) method to enhance the performance of large language models (LLMs)
This framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively.
Our code is public at https://github.com/OceannTwT/LQCA.
arXiv Detail & Related papers (2024-10-02T15:39:55Z) - GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models [58.08177466768262]
Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.
We introduce GraphReader, a graph-based agent system designed to handle long texts by structuring them into a graph and employing an agent to explore this graph autonomously.
Experimental results on the LV-Eval dataset reveal that GraphReader, using a 4k context window, consistently outperforms GPT-4-128k across context lengths from 16k to 256k by a large margin.
arXiv Detail & Related papers (2024-06-20T17:57:51Z) - Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering [9.86691461253151]
We introduce a novel method for attribution in contextual question answering, leveraging the hidden state representations of large language models (LLMs)
Our approach bypasses the need for extensive model retraining and retrieval model overhead, offering granular attributions and preserving the quality of generated answers.
We present Verifiability-granular, an attribution dataset which has token level annotations for LLM generations in the contextual question answering setup.
arXiv Detail & Related papers (2024-05-28T09:12:44Z) - Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom [4.142301960178498]
SwordsmanImp is the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature.
It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated.
Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions.
Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions.
arXiv Detail & Related papers (2024-04-30T12:43:53Z) - CuriousLLM: Elevating Multi-Document Question Answering with LLM-Enhanced Knowledge Graph Reasoning [0.9295048974480845]
We propose CuriousLLM, an enhancement that integrates a curiosity-driven reasoning mechanism into an LLM agent.<n>This mechanism enables the agent to generate relevant follow-up questions, thereby guiding the information retrieval process more efficiently.<n>Our experiments show that CuriousLLM significantly boosts LLM performance in multi-document question answering (MD-QA)
arXiv Detail & Related papers (2024-04-13T20:43:46Z) - A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts [35.68159165639245]
We propose ReadAgent, an agent system that increases effective context length up to 20x in our experiments.
Inspired by how humans interactively read long documents, we implement ReadAgent as a simple prompting system.
We evaluate ReadAgent against baselines using retrieval methods, using the original long contexts, and using the gist memories.
arXiv Detail & Related papers (2024-02-15T05:40:21Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models [58.54538318912159]
M4LE is a benchmark for evaluating the long-sequence capability of large language models (LLMs)
M4LE is based on a diverse NLP task pool comprising 36 NLP task types and 12 domains.
We conducted a systematic evaluation on 11 well-established LLMs, especially those optimized for long-sequence inputs.
arXiv Detail & Related papers (2023-10-30T03:11:30Z) - Building Cooperative Embodied Agents Modularly with Large Language
Models [104.57849816689559]
We address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments.
We harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework.
Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication.
arXiv Detail & Related papers (2023-07-05T17:59:27Z) - Can Large Language Models Transform Computational Social Science? [79.62471267510963]
Large Language Models (LLMs) are capable of performing many language processing tasks zero-shot (without training data)
This work provides a road map for using LLMs as Computational Social Science tools.
arXiv Detail & Related papers (2023-04-12T17:33:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.