A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies
- URL: http://arxiv.org/abs/2410.11450v1
- Date: Tue, 15 Oct 2024 09:53:40 GMT
- Title: A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies
- Authors: Yen-Hsiang Wang, Feng-Dian Su, Tzu-Yu Yeh, Yao-Chung Fan,
- Abstract summary: This paper introduces a cross-lingual statutory article retrieval (SAR) dataset designed to enhance legal information retrieval in multilingual settings.
Our dataset features spoken-language-style legal inquiries in English, paired with corresponding Chinese versions and relevant statutes, covering all Taiwanese civil, criminal, and administrative laws.
- Score: 4.511440076037968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a cross-lingual statutory article retrieval (SAR) dataset designed to enhance legal information retrieval in multilingual settings. Our dataset features spoken-language-style legal inquiries in English, paired with corresponding Chinese versions and relevant statutes, covering all Taiwanese civil, criminal, and administrative laws. This dataset aims to improve access to legal information for non-native speakers, particularly for foreign nationals in Taiwan. We propose several LLM-based methods as baselines for evaluating retrieval effectiveness, focusing on mitigating translation errors and improving cross-lingual retrieval performance. Our work provides a valuable resource for developing inclusive legal information retrieval systems.
Related papers
- MEL: Legal Spanish Language Model [0.3651422140724638]
This paper presents the development and evaluation of MEL, a legal language model based on XLM-RoBERTa-large.
Evaluation benchmarks show a significant improvement over baseline models in understanding the legal Spanish language.
arXiv Detail & Related papers (2025-01-27T12:50:10Z) - Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer [23.978072734886272]
This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages.
We construct a dataset by transferring an English dataset to Japanese.
We investigate if the transferred dataset can assist human annotation on Japanese documents.
arXiv Detail & Related papers (2024-04-25T10:59:02Z) - CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts [50.44270798959864]
Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages.
We study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language.
arXiv Detail & Related papers (2024-04-19T04:02:50Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - GPTs and Language Barrier: A Cross-Lingual Legal QA Examination [5.253214457141011]
We explore the application of Generative Pre-trained Transformers (GPTs) in cross-lingual legal Question-Answering (QA) systems using the COLIEE Task 4 dataset.
In the COLIEE Task 4, given a statement and a set of related legal articles that serve as context, the objective is to determine whether the statement is legally valid.
By benchmarking four different combinations of English and Japanese prompts and data, we provide valuable insights into GPTs' performance in multilingual legal QA scenarios.
arXiv Detail & Related papers (2024-03-26T20:47:32Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form
Summarization in the Legal Domain [2.4815579733050157]
We propose a novel dataset, called EUR-Lex-Sum, based on manually curated document summaries of legal acts from the European Union law platform (EUR-Lex)
Documents and their respective summaries exist as cross-lingual paragraph-aligned data in several of the 24 official European languages.
We obtain up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
arXiv Detail & Related papers (2022-10-24T17:58:59Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.