Related papers: LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

URL: http://arxiv.org/abs/2508.11280v2
Date: Mon, 25 Aug 2025 06:40:23 GMT
Title: LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought
Authors: Ruiyan Qi, Congding Wen, Weibo Zhou, Jiwei Li, Shangsong Liang, Lingbo Li,
Abstract summary: We propose Expert $textbfT$ree-$textbfo$f-$textbfT$hought (LETToT), a framework that leverages expert-derived reasoning structures.<n>Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines.
Score: 18.539462131974215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.

Related papers

FORESTLLM: Large Language Models Make Random Forest Great on Few-shot Tabular Learning [20.27406245916013]
We propose a novel framework that unifies the structural inductive biases of decision forests with the semantic reasoning capabilities of large language models (LLMs)<n>Our method is two-fold. First, we introduce a semantic splitting criterion in which the LLM evaluates candidate partitions based on their coherence over both labeled and unlabeled data, enabling the induction of more robust and generalizable tree structures under few-shot supervision.<n>Second, we propose a one-time in-context inference mechanism for leaf node stabilization, where the LLM distills the decision path and its supporting examples into a concise, deterministic prediction, replacing noisy empirical estimates with semantically informed outputs
arXiv Detail & Related papers (2026-01-16T14:08:51Z)
RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z)
Toward a unified framework for data-efficient evaluation of large language models [12.922829524961813]
LEGO-IRT is a unified and flexible framework for data-efficient large language models evaluation.<n>It supports both binary and continuous evaluation metrics.<n>We show that LEGO-IRT achieves stable capability estimates using just $3%$ of the total evaluation items.
arXiv Detail & Related papers (2025-10-05T06:13:50Z)
Towards a Comprehensive Scaling Law of Mixture-of-Experts [54.117786590884776]
We propose a comprehensive and precise joint MoE scaling law that considers all essential factors.<n>Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size.<n>Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.
arXiv Detail & Related papers (2025-09-28T06:35:34Z)
Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses [23.308803725940383]
DeCE is a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts)<n>We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding.
arXiv Detail & Related papers (2025-09-19T15:36:02Z)
mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning [74.97363626515236]
We propose a textbfMultilingual and Scalable Benchmark for textbfSkill-based textbfCommonsense textbfReasoning (textbfmSCoRe)<n>Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities.<n>Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense.
arXiv Detail & Related papers (2025-08-13T18:59:02Z)
Evaluating Large Language Models as Expert Annotators [17.06186816803593]
This paper investigates whether top-performing language models can serve as direct alternatives to human expert annotators.<n>We evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law.<n>Our empirical results reveal that individual LLMs equipped with inference-time techniques show only marginal or even negative performance gains.
arXiv Detail & Related papers (2025-08-11T10:19:10Z)
Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports [0.0]
We propose a fine-tuned vision-language model (VLM) based on Qwen2.5-VL-7B.<n>Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA.<n>Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% markdown TEDS score.
arXiv Detail & Related papers (2025-08-04T04:54:00Z)
SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization? [17.97981669263259]
Reasoning-enabled large language models (LLMs) excel in logical tasks, yet their utility for evaluating natural language generation remains unexplored.<n>This study systematically compares reasoning LLMs with non-reasoning counterparts across machine translation and text summarization evaluation tasks.
arXiv Detail & Related papers (2025-04-10T20:39:18Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Language Models can Self-Improve at State-Value Estimation for Better Search [23.61729554517216]
We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions by reasoning explicitly about state transitions.<n>We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.
arXiv Detail & Related papers (2025-03-04T18:58:11Z)
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates [51.633266497799745]
hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space.<n>We introduce three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs; and (iii) a brand new inference scaling system.
arXiv Detail & Related papers (2025-02-10T18:51:47Z)
NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z)
Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks [20.072783454089098]
This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness.<n>AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling.
arXiv Detail & Related papers (2024-10-11T00:56:37Z)
Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned Language Models [50.15455336684986]
We evaluate the effectiveness of LogProbs and basic prompting to measure semantic plausibility. We find that LogProbs offers a more reliable measure of semantic plausibility than direct zero-shot prompting. We conclude that, even in the era of prompt-based evaluations, LogProbs constitute a useful metric of semantic plausibility.
arXiv Detail & Related papers (2024-03-21T22:08:44Z)
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM [65.74934004876914]
Retrieval-enhanced language models (LMs) condition their predictions on text retrieved from large external datastores. One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model. We empirically measure the effectiveness of our approach on two English language modeling datasets.
arXiv Detail & Related papers (2022-10-28T02:57:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.