Related papers: Multi-stage Information Retrieval for Vietnamese Legal Texts

Multi-stage Information Retrieval for Vietnamese Legal Texts

URL: http://arxiv.org/abs/2209.14494v1
Date: Thu, 29 Sep 2022 01:13:56 GMT
Title: Multi-stage Information Retrieval for Vietnamese Legal Texts
Authors: Nhat-Minh Pham, Ha-Thanh Nguyen, Trong-Hop Do
Abstract summary: This study proposes a new approach for information retrieval for Vietnamese legal documents using sentence-transformer. Various experiments are conducted to make comparisons between different transformer models, ranking scores, syllable-level, and word-level training.
Score: 0.17188280334580194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study deals with the problem of information retrieval (IR) for Vietnamese legal texts. Despite being well researched in many languages, information retrieval has still not received much attention from the Vietnamese research community. This is especially true for the case of legal documents, which are hard to process. This study proposes a new approach for information retrieval for Vietnamese legal documents using sentence-transformer. Besides, various experiments are conducted to make comparisons between different transformer models, ranking scores, syllable-level, and word-level training. The experiment results show that the proposed model outperforms models used in current research on information retrieval for Vietnamese documents.

Related papers

A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions [3.7994176460443208]
Vietnamese document analysis and recognition (DAR) is a crucial field with applications in digitization, information retrieval, and automation.<n>Despite advancements in OCR and NLP, Vietnamese text recognition faces unique challenges due to its complex diacritics, tonal variations, and lack of large-scale annotated datasets.<n>Recently, large language models (LLMs) and vision-language models have demonstrated remarkable improvements in text recognition and document understanding.
arXiv Detail & Related papers (2025-06-05T14:03:18Z)
Advancing Vietnamese Information Retrieval with Learning Objective and Benchmark [0.24999074238880487]
This work aims to provide the Vietnamese research community with a new benchmark for information retrieval. We also present a new objective function based on the InfoNCE loss function, which is used to train our Vietnamese embedding model.
arXiv Detail & Related papers (2025-03-10T15:47:01Z)
Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance. We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction [60.281405999483]
Narrative action evaluation (NAE) aims to generate professional commentary that evaluates the execution of an action. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. We propose a prompt-guided multimodal interaction framework to facilitate the interaction between different modalities of information.
arXiv Detail & Related papers (2024-04-22T17:55:07Z)
VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension [1.3942150186842373]
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube.
arXiv Detail & Related papers (2024-02-05T00:54:40Z)
UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese [2.9649783577150837]
We introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC) The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision. We show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset.
arXiv Detail & Related papers (2023-05-07T02:48:47Z)
Prompting Large Language Model for Machine Translation: A Case Study [87.88120385000666]
We offer a systematic study on prompting strategies for machine translation. We examine factors for prompt template and demonstration example selection. We explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning.
arXiv Detail & Related papers (2023-01-17T18:32:06Z)
Leveraging Semantic Representations Combined with Contextual Word Representations for Recognizing Textual Entailment in Vietnamese [0.25782420501870296]
This paper presents an experiment combining semantic word representation through the SRL task with context representation of BERT relative models for the RTE problem. The experimental results show that the semantic-aware contextual representation model has about 1% higher performance than the model that does not incorporate semantic representation.
arXiv Detail & Related papers (2023-01-01T15:13:25Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z)
Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design. Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z)
VieSum: How Robust Are Transformer-based Models on Vietnamese Summarization? [1.1379578593538398]
We investigate the robustness of transformer-based encoder-decoder architectures for Vietnamese abstractive summarization. We validate the performance of the methods on two Vietnamese datasets.
arXiv Detail & Related papers (2021-10-08T17:10:31Z)
Extract, Integrate, Compete: Towards Verification Style Reading Comprehension [66.2551168928688]
We present a new verification style reading comprehension dataset named VGaokao from Chinese Language tests of Gaokao. To address the challenges in VGaokao, we propose a novel Extract-Integrate-Compete approach.
arXiv Detail & Related papers (2021-09-11T01:34:59Z)
A Vietnamese Dataset for Evaluating Machine Reading Comprehension [2.7528170226206443]
We present UIT-ViQuAD, a new dataset for the low-resource language as Vietnamese to evaluate machine reading comprehension models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. We conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD.
arXiv Detail & Related papers (2020-09-30T15:06:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.