Related papers: HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions

HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions

URL: http://arxiv.org/abs/2511.18715v1
Date: Mon, 24 Nov 2025 03:13:45 GMT
Title: HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions
Authors: Shaoyin Ma, Jie Song, Huiqiong Wang, Li Sun, Mingli Song,
Abstract summary: HuggingR$4$ is a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection to efficiently select models.<n>It attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively.
Score: 50.61510609116118
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have made remarkable progress in their ability to interact with external interfaces. Selecting reasonable external interfaces has thus become a crucial step in constructing LLM agents. In contrast to invoking API tools, directly calling AI models across different modalities from the community (e.g., HuggingFace) poses challenges due to the vast scale (> 10k), metadata gaps, and unstructured descriptions. Current methods for model selection often involve incorporating entire model descriptions into prompts, resulting in prompt bloat, wastage of tokens and limited scalability. To address these issues, we propose HuggingR$^4$, a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection, to efficiently select models. Specifically, We first perform multiple rounds of reasoning and retrieval to get a coarse list of candidate models. Then, we conduct fine-grained refinement by analyzing candidate model descriptions, followed by reflection to assess results and determine if retrieval scope expansion is necessary. This method reduces token consumption considerably by decoupling user query processing from complex model description handling. Through a pre-established vector database, complex model descriptions are stored externally and retrieved on-demand, allowing the LLM to concentrate on interpreting user intent while accessing only relevant candidate models without prompt bloat. In the absence of standardized benchmarks, we construct a multimodal human-annotated dataset comprising 14,399 user requests across 37 tasks and conduct a thorough evaluation. HuggingR$^4$ attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively on GPT-4o-mini.

Related papers

LOCUS: Low-Dimensional Model Embeddings for Efficient Model Exploration, Comparison, and Selection [15.182368486530128]
We propose LOCUS, a method that produces low-dimensional vector embeddings that compactly represent a language model's capabilities across queries.<n>LOCUS is an attention-based approach that generates embeddings by a deterministic forward pass over query encodings and evaluation scores via an encoder model.<n>We train a correctness predictor that uses model embeddings and query encodings to achieve state-of-the-art routing accuracy on unseen queries.
arXiv Detail & Related papers (2026-01-28T22:09:42Z)
Adaptation of Embedding Models to Financial Filings via LLM Distillation [10.744318713371383]
This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation.<n>Our method yields an average of 27.7% improvement in MRR$textt@$5, 44.6% improvement in mean DCG$textt@$5 across 14 financial filing types measured over 21,800 query-document pairs.
arXiv Detail & Related papers (2025-12-08T22:43:14Z)
Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search [54.987957691350665]
Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query.<n>Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications.<n>We propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search.
arXiv Detail & Related papers (2025-08-28T08:51:51Z)
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z)
Structured List-Grounded Question Answering [11.109829342410265]
Document-grounded dialogue systems aim to answer user queries by leveraging external information. Previous studies have mainly focused on handling free-form documents, often overlooking structured data such as lists. This paper aims to enhance question answering systems for better interpretation and use of structured lists.
arXiv Detail & Related papers (2024-10-04T22:21:43Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.<n>We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
iSNEAK: Partial Ordering as Heuristics for Model-Based Reasoning in Software Engineering [11.166755101891402]
iSNEAK is an incremental human-in-the-loop AI problem solver. We propose the use of partial orderings and tools like iSNEAK to solve the information overload problem.
arXiv Detail & Related papers (2023-10-29T19:21:37Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models [1.0878040851637998]
With over 200, 000 models in the Hugging Face ecosystem, users grapple with selecting and optimizing models to suit multifaceted and data domains. Here, we propose a context-aware routing system, Tryage, that leverages a language model router for optimal selection of expert models from a model library.
arXiv Detail & Related papers (2023-08-22T17:48:24Z)
Earning Extra Performance from Restrictive Feedbacks [41.05874087063763]
We set up a challenge named emphEarning eXtra PerformancE from restriCTive feEDdbacks (EXPECTED) to describe this form of model tuning problems. The goal of the model provider is to eventually deliver a satisfactory model to the local user(s) by utilizing the feedbacks. We propose to characterize the geometry of the model performance with regard to model parameters through exploring the parameters' distribution.
arXiv Detail & Related papers (2023-04-28T13:16:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.