Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning
- URL: http://arxiv.org/abs/2405.18375v1
- Date: Tue, 28 May 2024 17:14:02 GMT
- Title: Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning
- Authors: Phakphum Artkaew,
- Abstract summary: This research introduces a collection of Winograds in Thai, a novel dataset designed to evaluate commonsense reasoning capabilities in the context of the Thai language.
We evaluate the performance of popular large language models on this benchmark, revealing their strengths, limitations, and providing insights into the current state-of-the-art.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Commonsense reasoning is one of the important aspect of natural language understanding, with several benchmarks developed to evaluate it. However, only a few of these benchmarks are available in languages other than English. Developing parallel benchmarks facilitates cross-lingual evaluation, enabling a better understanding of different languages. This research introduces a collection of Winograd Schemas in Thai, a novel dataset designed to evaluate commonsense reasoning capabilities in the context of the Thai language. Through a methodology involving native speakers, professional translators, and thorough validation, the schemas aim to closely reflect Thai language nuances, idioms, and cultural references while maintaining ambiguity and commonsense challenges. We evaluate the performance of popular large language models on this benchmark, revealing their strengths, limitations, and providing insights into the current state-of-the-art. Results indicate that while models like GPT-4 and Claude-3-Opus achieve high accuracy in English, their performance significantly drops in Thai, highlighting the need for further advancements in multilingual commonsense reasoning.
Related papers
- Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models [8.746788828655356]
The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks.
We propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)
arXiv Detail & Related papers (2024-10-07T07:14:37Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Saving the legacy of Hero Ibash: Evaluating Four Language Models for
Aminoacian [0.8158530638728501]
This study assesses four cutting-edge language models in the underexplored Aminoacian language.
It scrutinizes their adaptability, effectiveness, and limitations in text generation, semantic coherence, and contextual understanding.
arXiv Detail & Related papers (2024-02-28T07:22:13Z) - Advancing the Evaluation of Traditional Chinese Language Models: Towards
a Comprehensive Benchmark Suite [17.764840326809797]
We propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese.
These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding.
In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks.
arXiv Detail & Related papers (2023-09-15T14:52:23Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Separating form and meaning: Using self-consistency to quantify task
understanding across multiple senses [14.784624121891328]
We propose a novel paradigm for evaluating large language models (LLMs)
We measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself.
Our approach does not require any static evaluation corpora in languages other than English.
arXiv Detail & Related papers (2023-05-19T13:23:51Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.