Related papers: Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation

Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation

URL: http://arxiv.org/abs/2410.20811v1
Date: Mon, 28 Oct 2024 07:59:34 GMT
Title: Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
Authors: Jaechang Kim, Jinmin Goh, Inseok Hwang, Jaewoong Cho, Jungseul Ok,
Abstract summary: We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it. CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations. GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality.
Score: 9.277840736103554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning-based expert models have reached superhuman performance in decision-making domains such as chess and Go. However, it is under-explored to explain or comment on given decisions although it is important for human education and model explainability. The outputs of expert models are accurate, but yet difficult to interpret for humans. On the other hand, large language models (LLMs) produce fluent commentary but are prone to hallucinations due to their limited decision-making capabilities. To bridge this gap between expert models and LLMs, we focus on chess commentary as a representative case of explaining complex decision-making processes through language and address both the generation and evaluation of commentary. We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it. CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations. GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality. Experimental results, validated by both human judges and GCC-Eval, demonstrate that CCC generates commentary that is accurate, informative, and fluent.

Related papers

Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations [65.11348389219887]
We introduce Dialectic-RAG (DRAG), a modular approach that evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives. We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models.
arXiv Detail & Related papers (2025-04-07T06:55:15Z)
Re-evaluating Open-ended Evaluation of Large Language Models [50.23008729038318]
We show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental. We propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy.
arXiv Detail & Related papers (2025-02-27T15:07:47Z)
Explore the Reasoning Capability of LLMs in the Chess Testbed [45.12891789312405]
We propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves.
arXiv Detail & Related papers (2024-11-11T01:42:56Z)
Large Language Models are Limited in Out-of-Context Knowledge Reasoning [65.72847298578071]
Large Language Models (LLMs) possess extensive knowledge and strong capabilities in performing in-context reasoning. This paper focuses on a significant aspect of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge.
arXiv Detail & Related papers (2024-06-11T15:58:59Z)
Aspect-based Sentiment Evaluation of Chess Moves (ASSESS): an NLP-based Method for Evaluating Chess Strategies from Textbooks [3.652509571098292]
This study investigates the feasibility of using a modified sentiment analysis method as a means for evaluating chess moves based on text. By extracting insights from move-action phrases, our approach aims to provide a more fine-grained and contextually aware chess move'-based sentiment classification.
arXiv Detail & Related papers (2024-05-10T14:23:43Z)
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence. We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena. As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
Large Language Models on the Chessboard: A Study on ChatGPT's Formal Language Comprehension and Complex Reasoning Skills [4.138999291282392]
This paper probes the performance of ChatGPT, a sophisticated language model by OpenAI. We assess ChatGPT's understanding of the chessboard, adherence to chess rules, and strategic decision-making abilities. Our study also reveals ChatGPT's propensity for a coherent strategy in its gameplay and a noticeable uptick in decision-making assertiveness.
arXiv Detail & Related papers (2023-08-29T08:36:30Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z)
Improving Chess Commentaries by Combining Language Models with Symbolic Reasoning Engines [31.87260568733666]
We show how to combine symbolic reasoning engines with controllable language models to generate chess commentaries. We conduct experiments to demonstrate that our approach generates commentaries preferred by human judges over previous baselines.
arXiv Detail & Related papers (2022-12-15T23:38:31Z)
Designing an Automatic Agent for Repeated Language based Persuasion Games [32.20930723085839]
We consider a repeated sender (expert) -- receiver (decision maker) game. Sender is fully informed about the state of the world and aims to persuade the receiver to accept a deal by sending one of several possible natural language reviews. We design an automatic expert that plays this repeated game, aiming to achieve the maximal payoff.
arXiv Detail & Related papers (2021-05-11T12:25:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.