An Effective Framework to Help Large Language Models Handle Numeric-involved Long-context Tasks
- URL: http://arxiv.org/abs/2411.10145v1
- Date: Fri, 15 Nov 2024 12:39:02 GMT
- Title: An Effective Framework to Help Large Language Models Handle Numeric-involved Long-context Tasks
- Authors: Yijiong Yu,
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long texts.
Their performance significantly degrades when it comes to numerical calculations in the long-context.
We propose a workflow which decomposes a numeric-involved long-context task into 4 low-level subtasks.
The results in 2 numeric-involved long-context benchmarks demonstrate our workflow can not only improve accuracy, but also significantly reduce the cost of API calls.
- Score: 0.0
- License:
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long texts and have almost perfect performance in traditional retrieval tasks. However, their performance significantly degrades when it comes to numerical calculations in the long-context. Numeric-involved long-context tasks typically cannot be addressed by current LLMs in normal settings due to their inherent limitations in simultaneously handling complex and massive information. Some CoT like prompting methods can improve accuracy but demands massive output tokens, which is costly and slow. To address this issue, we propose a workflow, which decompose a numeric-involved long-context task into 4 low-level subtasks: judging, extracting and processing with code and conclusion. The former 2 subtasks is relatively simple, which allows us to use smaller models for efficiently processing long context. When numerical calculations are required, we use code generated by LLMs to avoid the disadvantage of LLM not being good at calculations. The results in 2 numeric-involved long-context benchmarks demonstrate our workflow can not only improve accuracy, but also significantly reduce the cost of API calls.
Related papers
- Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? [36.83397306207386]
We evaluate the capabilities of 17 leading Large Language Models (LLMs)
Strikingly, many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance.
We find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows.
arXiv Detail & Related papers (2024-11-07T18:59:27Z) - Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks [0.8425561594225592]
This study introduces a novel framework for training smaller language models in function calling.
It focuses on specific logical and mathematical reasoning tasks.
The approach aims to improve performances of small-scale models for these tasks using function calling.
arXiv Detail & Related papers (2024-10-24T16:27:35Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.
We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.
We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - Steering Large Language Models between Code Execution and Textual Reasoning [22.279107036500083]
Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching.
The recently released OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution.
We propose three methods to better steer LLM code/text generation and achieve a notable improvement.
arXiv Detail & Related papers (2024-10-04T15:44:47Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z) - Learning to Perform Complex Tasks through Compositional Fine-Tuning of
Language Models [20.173322408302134]
compositional fine-tuning is an approach based on explicitly decomposing a target task into component tasks.
We show that CFT outperforms end-to-end learning even with equal amounts of data.
arXiv Detail & Related papers (2022-10-23T03:22:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.