Revealing Weaknesses of Vietnamese Language Models Through Unanswerable
Questions in Machine Reading Comprehension
- URL: http://arxiv.org/abs/2303.13355v1
- Date: Thu, 16 Mar 2023 20:32:58 GMT
- Title: Revealing Weaknesses of Vietnamese Language Models Through Unanswerable
Questions in Machine Reading Comprehension
- Authors: Son Quoc Tran, Phong Nguyen-Thuan Do, Kiet Van Nguyen, Ngan Luu-Thuy
Nguyen
- Abstract summary: We present a comprehensive analysis of language weaknesses and strengths of current Vietnamese monolingual models.
We also successfully reveal the existence of artifacts in Vietnamese Machine Reading benchmarks.
Our proposed modification helps improve the quality of unanswerable questions.
- Score: 2.7528170226206443
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Although the curse of multilinguality significantly restricts the language
abilities of multilingual models in monolingual settings, researchers now still
have to rely on multilingual models to develop state-of-the-art systems in
Vietnamese Machine Reading Comprehension. This difficulty in researching is
because of the limited number of high-quality works in developing Vietnamese
language models. In order to encourage more work in this research field, we
present a comprehensive analysis of language weaknesses and strengths of
current Vietnamese monolingual models using the downstream task of Machine
Reading Comprehension. From the analysis results, we suggest new directions for
developing Vietnamese language models. Besides this main contribution, we also
successfully reveal the existence of artifacts in Vietnamese Machine Reading
Comprehension benchmarks and suggest an urgent need for new high-quality
benchmarks to track the progress of Vietnamese Machine Reading Comprehension.
Moreover, we also introduced a minor but valuable modification to the process
of annotating unanswerable questions for Machine Reading Comprehension from
previous work. Our proposed modification helps improve the quality of
unanswerable questions to a higher level of difficulty for Machine Reading
Comprehension systems to solve.
Related papers
- The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Vi-Mistral-X: Building a Vietnamese Language Model with Advanced Continual Pre-training [0.0]
vi-mistral-x is an innovative Large Language Model designed specifically for the Vietnamese language.
It utilizes a unique method of continual pre-training, based on the Mistral architecture.
It has shown to outperform existing Vietnamese LLMs in several key areas, including text classification, question answering, and text generation.
arXiv Detail & Related papers (2024-03-20T10:14:13Z) - VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension [1.3942150186842373]
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension tasks.
The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks.
In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube.
arXiv Detail & Related papers (2024-02-05T00:54:40Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - VieSum: How Robust Are Transformer-based Models on Vietnamese
Summarization? [1.1379578593538398]
We investigate the robustness of transformer-based encoder-decoder architectures for Vietnamese abstractive summarization.
We validate the performance of the methods on two Vietnamese datasets.
arXiv Detail & Related papers (2021-10-08T17:10:31Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation [93.80733419450225]
This paper analyzes the current state of cross-lingual transfer learning.
We extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks.
arXiv Detail & Related papers (2021-04-15T12:26:12Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - An Experimental Study of Deep Neural Network Models for Vietnamese
Multiple-Choice Reading Comprehension [2.7528170226206443]
We conduct experiments on neural network-based model to understand the impact of word representation to machine reading comprehension.
Our experiments include using the Co-match model on six different Vietnamese word embeddings and the BERT model for multiple-choice reading comprehension.
On the ViMMRC corpus, the accuracy of BERT model is 61.28% on test set.
arXiv Detail & Related papers (2020-08-20T07:29:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.