Related papers: FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs

FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs

URL: http://arxiv.org/abs/2411.14054v1
Date: Thu, 21 Nov 2024 11:59:13 GMT
Title: FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs
Authors: Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Sunghee Jung, Myeongcheol Shin,
Abstract summary: This study investigates language models' generative capabilities in tool-use dialogs. We categorize the models' outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection. Using this benchmark, we evaluate several language models that support function calling.
Score: 4.406769771178207
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study investigates language models' generative capabilities in tool-use dialogs. We categorize the models' outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.

Related papers

Teaching a Language Model to Speak the Language of Tools [0.0]
This work presents a methodology for adapting existing language models to enable robust tool use in any target language.<n>The research introduces TUCAN, which achieves up to 28.75% improvement in function-calling accuracy over base models.
arXiv Detail & Related papers (2025-06-29T20:47:27Z)
Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt [7.096646842716599]
We introduce language hooks, a novel framework for augmenting language models with new capabilities. We benchmark our method against state-of-the-art baselines, find that it outperforms task-aware approaches.
arXiv Detail & Related papers (2024-12-08T15:16:17Z)
Evaluating Large Language Models in Semantic Parsing for Conversational Question Answering over Knowledge Graphs [6.869834883252353]
This paper evaluates the performance of large language models that have not been explicitly pre-trained on this task. Our results demonstrate that large language models are capable of generating graph queries from dialogues.
arXiv Detail & Related papers (2024-01-03T12:28:33Z)
Dialogue Quality and Emotion Annotations for Customer Support Conversations [7.218791626731783]
This paper presents a holistic annotation approach for emotion and conversational quality in the context of bilingual customer support conversations. It provides a unique and valuable resource for the development of text classification models.
arXiv Detail & Related papers (2023-11-23T10:56:14Z)
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z)
GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog. We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups. A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z)
Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning [27.92734269206744]
InstructDial is an instruction tuning framework for dialogue. It consists of a repository of 48 diverse dialogue tasks in a unified text-to-text format created from 59 openly available dialogue datasets. Our analysis reveals that InstructDial enables good zero-shot performance on unseen datasets and tasks such as dialogue evaluation and intent detection, and even better performance in a few-shot setting.
arXiv Detail & Related papers (2022-05-25T11:37:06Z)
Vector Representations of Idioms in Conversational Systems [1.6507910904669727]
We utilize the Potentialatic Expression (PIE)-English idioms corpus for the two tasks that we investigate. We achieve state-of-the-art (SoTA) result of 98% macro F1 score on the classification task by using the SoTA T5 model. The results show that the model trained on the idiom corpus generates more fitting responses to prompts containing idioms 71.9% of the time.
arXiv Detail & Related papers (2022-05-07T14:50:05Z)
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models. It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z)
Quality Assurance of Generative Dialog Models in an Evolving Conversational Agent Used for Swedish Language Practice [59.705062519344]
One proposed solution involves AI-enabled conversational agents for person-centered interactive language practice. We present results from ongoing action research targeting quality assurance of proprietary generative dialog models trained for virtual job interviews.
arXiv Detail & Related papers (2022-03-29T10:25:13Z)
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm [0.0]
We discuss methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language. We introduce the idea of a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks.
arXiv Detail & Related papers (2021-02-15T05:27:55Z)
Plug-and-Play Conversational Models [62.77150879036442]
We introduce an approach that does not require further computation at decoding time, while also does not require any fine-tuning of a large language model. We demonstrate, through extensive automatic and human evaluation, a high degree of control over the generated conversational responses with regard to multiple desired attributes.
arXiv Detail & Related papers (2020-10-09T03:17:51Z)
Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.