Evaluation of large language models using an Indian language LGBTI+
lexicon
- URL: http://arxiv.org/abs/2310.17787v1
- Date: Thu, 26 Oct 2023 21:32:24 GMT
- Title: Evaluation of large language models using an Indian language LGBTI+
lexicon
- Authors: Aditya Joshi, Shruta Rawat, Alpana Dange
- Abstract summary: Large language models (LLMs) are typically evaluated on the basis of task-based benchmarks such as MMLU.
This paper presents a methodology for evaluation of LLMs using an LGBTI+ lexicon in Indian languages.
- Score: 3.2047868962340327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are typically evaluated on the basis of
task-based benchmarks such as MMLU. Such benchmarks do not examine responsible
behaviour of LLMs in specific contexts. This is particularly true in the LGBTI+
context where social stereotypes may result in variation in LGBTI+ terminology.
Therefore, domain-specific lexicons or dictionaries may be useful as a
representative list of words against which the LLM's behaviour needs to be
evaluated. This paper presents a methodology for evaluation of LLMs using an
LGBTI+ lexicon in Indian languages. The methodology consists of four steps:
formulating NLP tasks relevant to the expected behaviour, creating prompts that
test LLMs, using the LLMs to obtain the output and, finally, manually
evaluating the results. Our qualitative analysis shows that the three LLMs we
experiment on are unable to detect underlying hateful content. Similarly, we
observe limitations in using machine translation as means to evaluate natural
language understanding in languages other than English. The methodology
presented in this paper can be useful for LGBTI+ lexicons in other languages as
well as other domain-specific lexicons. The work done in this paper opens
avenues for responsible behaviour of LLMs, as demonstrated in the context of
prevalent social perception of the LGBTI+ community.
Related papers
- Generating bilingual example sentences with large language models as lexicography assistants [2.6550899846546527]
We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels.
We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility.
arXiv Detail & Related papers (2024-10-04T06:45:48Z) - LLMs' Understanding of Natural Language Revealed [0.0]
Large language models (LLMs) are the result of a massive experiment in bottom-up, data-driven reverse engineering of language at scale.
We will focus on testing LLMs for their language understanding capabilities, their supposed forte.
arXiv Detail & Related papers (2024-07-29T01:21:11Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study [3.0059120458540383]
We consider the evaluation of the lexical richness of the text generated by conversational Large Language Models (LLMs) and how it depends on the model parameters.
The results show how lexical richness depends on the version of ChatGPT and some of its parameters, such as the presence penalty, or on the role assigned to the model.
arXiv Detail & Related papers (2024-02-11T13:41:17Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - How Proficient Are Large Language Models in Formal Languages? An In-Depth Insight for Knowledge Base Question Answering [52.86931192259096]
Knowledge Base Question Answering (KBQA) aims to answer natural language questions based on facts in knowledge bases.
Recent works leverage the capabilities of large language models (LLMs) for logical form generation to improve performance.
arXiv Detail & Related papers (2024-01-11T09:27:50Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Can Large Language Models Transform Computational Social Science? [79.62471267510963]
Large Language Models (LLMs) are capable of performing many language processing tasks zero-shot (without training data)
This work provides a road map for using LLMs as Computational Social Science tools.
arXiv Detail & Related papers (2023-04-12T17:33:28Z) - The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters
for Implicature Resolution by LLMs [26.118193748582197]
We evaluate four categories of widely used state-of-the-art models.
We find that, despite only evaluating on utterances that require a binary inference, models in three of these categories perform close to random.
These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models.
arXiv Detail & Related papers (2022-10-26T19:04:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.