Multi-stage Information Retrieval for Vietnamese Legal Texts
- URL: http://arxiv.org/abs/2209.14494v1
- Date: Thu, 29 Sep 2022 01:13:56 GMT
- Title: Multi-stage Information Retrieval for Vietnamese Legal Texts
- Authors: Nhat-Minh Pham, Ha-Thanh Nguyen, Trong-Hop Do
- Abstract summary: This study proposes a new approach for information retrieval for Vietnamese legal documents using sentence-transformer.
Various experiments are conducted to make comparisons between different transformer models, ranking scores, syllable-level, and word-level training.
- Score: 0.17188280334580194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study deals with the problem of information retrieval (IR) for
Vietnamese legal texts. Despite being well researched in many languages,
information retrieval has still not received much attention from the Vietnamese
research community. This is especially true for the case of legal documents,
which are hard to process. This study proposes a new approach for information
retrieval for Vietnamese legal documents using sentence-transformer. Besides,
various experiments are conducted to make comparisons between different
transformer models, ranking scores, syllable-level, and word-level training.
The experiment results show that the proposed model outperforms models used in
current research on information retrieval for Vietnamese documents.
Related papers
- Narrative Action Evaluation with Prompt-Guided Multimodal Interaction [60.281405999483]
Narrative action evaluation (NAE) aims to generate professional commentary that evaluates the execution of an action.
NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor.
We propose a prompt-guided multimodal interaction framework to facilitate the interaction between different modalities of information.
arXiv Detail & Related papers (2024-04-22T17:55:07Z) - VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension [1.3942150186842373]
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension tasks.
The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks.
In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube.
arXiv Detail & Related papers (2024-02-05T00:54:40Z) - UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in
Vietnamese [2.9649783577150837]
We introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC)
The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision.
We show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset.
arXiv Detail & Related papers (2023-05-07T02:48:47Z) - Prompting Large Language Model for Machine Translation: A Case Study [87.88120385000666]
We offer a systematic study on prompting strategies for machine translation.
We examine factors for prompt template and demonstration example selection.
We explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning.
arXiv Detail & Related papers (2023-01-17T18:32:06Z) - Leveraging Semantic Representations Combined with Contextual Word
Representations for Recognizing Textual Entailment in Vietnamese [0.25782420501870296]
This paper presents an experiment combining semantic word representation through the SRL task with context representation of BERT relative models for the RTE problem.
The experimental results show that the semantic-aware contextual representation model has about 1% higher performance than the model that does not incorporate semantic representation.
arXiv Detail & Related papers (2023-01-01T15:13:25Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z) - VieSum: How Robust Are Transformer-based Models on Vietnamese
Summarization? [1.1379578593538398]
We investigate the robustness of transformer-based encoder-decoder architectures for Vietnamese abstractive summarization.
We validate the performance of the methods on two Vietnamese datasets.
arXiv Detail & Related papers (2021-10-08T17:10:31Z) - Extract, Integrate, Compete: Towards Verification Style Reading
Comprehension [66.2551168928688]
We present a new verification style reading comprehension dataset named VGaokao from Chinese Language tests of Gaokao.
To address the challenges in VGaokao, we propose a novel Extract-Integrate-Compete approach.
arXiv Detail & Related papers (2021-09-11T01:34:59Z) - A Vietnamese Dataset for Evaluating Machine Reading Comprehension [2.7528170226206443]
We present UIT-ViQuAD, a new dataset for the low-resource language as Vietnamese to evaluate machine reading comprehension models.
This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.
We conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD.
arXiv Detail & Related papers (2020-09-30T15:06:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.