Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection
- URL: http://arxiv.org/abs/2405.09279v1
- Date: Wed, 15 May 2024 11:55:14 GMT
- Title: Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection
- Authors: Dylan Phelps, Thomas Pickard, Maggie Mi, Edward Gow-Smith, Aline Villavicencio,
- Abstract summary: This work looks at the performance of a range of LLMs on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE.
We find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales.
- Score: 2.2724928083094196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Contextual Code Switching for Machine Translation using Language Models [1.4866655830571935]
Large language models (LLMs) have exerted a considerable impact on diverse language-related tasks in recent years.
We present an extensive study on the code switching task specifically for the machine translation task comparing multiple LLMs.
Our results indicate that despite the LLMs having promising results in the certain tasks, the models with relatively lesser complexity outperform the multilingual large language models in the machine translation task.
arXiv Detail & Related papers (2023-12-20T16:40:33Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Scalable Performance Analysis for Vision-Language Models [26.45624201546282]
Joint vision-language models have shown great performance over a diverse set of tasks.
Our paper introduces a more scalable solution that relies on already annotated benchmarks.
We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs.
arXiv Detail & Related papers (2023-05-30T06:40:08Z) - EvEntS ReaLM: Event Reasoning of Entity States via Language Models [24.077262847151232]
Nominally, Large Language models (LLM) have been exposed to procedural knowledge about how objects interact, yet our benchmarking shows they fail to reason about the world.
In particular, our results indicate that our prompting technique is especially useful for unseen attributes (out-of-domain) or when only limited data is available.
arXiv Detail & Related papers (2022-11-10T07:48:01Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - Multi Task Learning For Zero Shot Performance Prediction of Multilingual
Models [12.759281077118567]
Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages.
We build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task learning problem.
arXiv Detail & Related papers (2022-05-12T14:47:03Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.