Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
- URL: http://arxiv.org/abs/2410.20811v1
- Date: Mon, 28 Oct 2024 07:59:34 GMT
- Title: Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
- Authors: Jaechang Kim, Jinmin Goh, Inseok Hwang, Jaewoong Cho, Jungseul Ok,
- Abstract summary: We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it.
CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations.
GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality.
- Score: 9.277840736103554
- License:
- Abstract: Deep learning-based expert models have reached superhuman performance in decision-making domains such as chess and Go. However, it is under-explored to explain or comment on given decisions although it is important for human education and model explainability. The outputs of expert models are accurate, but yet difficult to interpret for humans. On the other hand, large language models (LLMs) produce fluent commentary but are prone to hallucinations due to their limited decision-making capabilities. To bridge this gap between expert models and LLMs, we focus on chess commentary as a representative case of explaining complex decision-making processes through language and address both the generation and evaluation of commentary. We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it. CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations. GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality. Experimental results, validated by both human judges and GCC-Eval, demonstrate that CCC generates commentary that is accurate, informative, and fluent.
Related papers
- Enhancing Commentary Strategies for Imperfect Information Card Games: A Study of Large Language Models in Guandan Commentary [5.1244906826828736]
We introduce a novel commentary method that combine Reinforcement Learning (RL) and large language models (LLMs)
Our system leverages RL to generate intricate card-playing scenarios and employs LLMs to generate corresponding commentary text.
We showcase the substantial enhancement in performance achieved by the proposed commentary framework when applied to open-source LLMs.
arXiv Detail & Related papers (2024-06-23T11:58:26Z) - Large Language Models are Limited in Out-of-Context Knowledge Reasoning [65.72847298578071]
Large Language Models (LLMs) possess extensive knowledge and strong capabilities in performing in-context reasoning.
This paper focuses on a significant aspect of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge.
arXiv Detail & Related papers (2024-06-11T15:58:59Z) - Aspect-based Sentiment Evaluation of Chess Moves (ASSESS): an NLP-based Method for Evaluating Chess Strategies from Textbooks [3.652509571098292]
This study investigates the feasibility of using a modified sentiment analysis method as a means for evaluating chess moves based on text.
By extracting insights from move-action phrases, our approach aims to provide a more fine-grained and contextually aware chess move'-based sentiment classification.
arXiv Detail & Related papers (2024-05-10T14:23:43Z) - Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence.
We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena.
As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Large Language Models on the Chessboard: A Study on ChatGPT's Formal
Language Comprehension and Complex Reasoning Skills [4.138999291282392]
This paper probes the performance of ChatGPT, a sophisticated language model by OpenAI.
We assess ChatGPT's understanding of the chessboard, adherence to chess rules, and strategic decision-making abilities.
Our study also reveals ChatGPT's propensity for a coherent strategy in its gameplay and a noticeable uptick in decision-making assertiveness.
arXiv Detail & Related papers (2023-08-29T08:36:30Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - Improving Chess Commentaries by Combining Language Models with Symbolic
Reasoning Engines [31.87260568733666]
We show how to combine symbolic reasoning engines with controllable language models to generate chess commentaries.
We conduct experiments to demonstrate that our approach generates commentaries preferred by human judges over previous baselines.
arXiv Detail & Related papers (2022-12-15T23:38:31Z) - Designing an Automatic Agent for Repeated Language based Persuasion
Games [32.20930723085839]
We consider a repeated sender (expert) -- receiver (decision maker) game.
Sender is fully informed about the state of the world and aims to persuade the receiver to accept a deal by sending one of several possible natural language reviews.
We design an automatic expert that plays this repeated game, aiming to achieve the maximal payoff.
arXiv Detail & Related papers (2021-05-11T12:25:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.