Designing Deep Learning Frameworks for LLMs:Challenges, Expectations, and Opportunities
- URL: http://arxiv.org/abs/2506.13114v1
- Date: Mon, 16 Jun 2025 05:45:03 GMT
- Title: Designing Deep Learning Frameworks for LLMs:Challenges, Expectations, and Opportunities
- Authors: Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Jiacong Wu, An Guo, Jiawei Shen, Bingzhuo Li, Zhenyu Chen,
- Abstract summary: Large language models (LLMs) drive significant advancements in real industry applications.<n>LLMs rely on DL frameworks for efficient model construction, distributed execution, and optimized deployment.<n>Poor usability, limited functionality, and bugs in DL frameworks may hinder development efficiency and cause severe failures or resource waste.
- Score: 32.10519814607409
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) drive significant advancements in real industry applications. LLMs rely on DL frameworks for efficient model construction, distributed execution, and optimized deployment. Their large parameter scale and long execution cycles place extreme demands on DL frameworks in terms of scalability, stability, and efficiency. Therefore, poor usability, limited functionality, and subtle bugs in DL frameworks may hinder development efficiency and cause severe failures or resource waste. However, a fundamental question remains underinvestigated, i.e., What challenges do DL frameworks face in supporting LLMs? To seek an answer, we investigate these challenges through a large-scale analysis of issue reports from three major DL frameworks (MindSpore, PyTorch, TensorFlow) and eight associated LLM toolkits (e.g., Megatron). We construct a taxonomy of LLM-centric bugs, requirements, and user questions and enrich it through interviews with 11 LLM users and eight DL framework developers, uncovering key technical challenges and misalignments between user needs and developer priorities. Our contributions are threefold: (1) we develop a comprehensive taxonomy comprising four question themes (nine sub-themes), four requirement themes (15 sub-themes), and ten bug themes (45 sub-themes); (2) we assess the perceived importance and priority of these challenges based on practitioner insights; and (3) we identify five key findings across the LLM development and propose five actionable recommendations to improve the reliability, usability, and testability of DL frameworks. Our results highlight critical limitations in current DL frameworks and offer concrete guidance for advancing their support for the next generation of LLM construction and applications.
Related papers
- Exploring the Necessity of Reasoning in LLM-based Agent Scenarios [74.35956310688164]
We propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving.<n>Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes.<n>LRMs' enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies.
arXiv Detail & Related papers (2025-03-14T04:34:31Z) - LLMs are Also Effective Embedding Models: An In-depth Overview [41.34518828980852]
Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks.<n>Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs like GPT, LLaMA, and Mistral.
arXiv Detail & Related papers (2024-12-17T06:48:24Z) - LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework [0.0]
Large language models (LLMs) offer potential for dynamic topic refinement and discovery, yet their application often incurs high API costs.<n>To address these challenges, we propose the LLM-assisted Iterative Topic Augmentation framework (LITA)<n>LITA integrates user-provided seeds with embedding-based clustering and iterative refinement.
arXiv Detail & Related papers (2024-12-17T01:43:44Z) - Enhancing Financial Question Answering with a Multi-Agent Reflection Framework [0.0]
We propose a multi-agent framework incorporating a critic agent that reflects on the reasoning steps and final answers for each question.
Our results indicate that this framework significantly improves performance compared to single-agent reasoning.
Our framework performs on par with, and in some cases surpasses, larger single-agent LLMs such as LLaMA3.1-405B and GPT-4o-mini.
arXiv Detail & Related papers (2024-10-29T04:58:07Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Large Language Model Cascades with Mixture of Thoughts Representations
for Cost-efficient Reasoning [19.472937476936636]
Large language models (LLMs) have exhibited remarkable performance in a variety of tasks, but this strong performance often comes with the high expense of using paid API services.
In this paper, we are motivated to study building an LLM cascade to save the cost of using LLMs.
Our proposed cascades can achieve performance comparable to using solely the stronger LLM but require only 40% of its cost.
arXiv Detail & Related papers (2023-10-04T18:21:17Z) - CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [74.22729793816451]
Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability.
We propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization.
We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems.
arXiv Detail & Related papers (2023-05-23T17:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.