Related papers: Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval

Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval

URL: http://arxiv.org/abs/2412.13205v1
Date: Tue, 03 Dec 2024 10:52:49 GMT
Title: Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval
Authors: Quang Hoang Trung, Nguyen Van Hoang Phuc, Le Trung Hoang, Quang Huu Hieu, Vo Nguyen Le Duy,
Abstract summary: We introduce a new dataset specifically designed for Japanese legal contexts.<n>In the first phase, the model learns a broad understanding of global contexts, enhancing its generalization.<n>In the second phase, the model is fine-tuned to address complex queries specific to legal scenarios.<n>Our pipeline proves effective in English contexts, surpassing comparable baselines on the MS MARCO dataset.
Score: 6.058427379240698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text Retrieval (TR) involves finding and retrieving text-based content relevant to a user's query from a large repository, with applications in real-world scenarios such as legal document retrieval. While most existing studies focus on English, limited work addresses Japanese contexts. In this paper, we introduce a new dataset specifically designed for Japanese legal contexts and propose a novel two-phase pipeline tailored to this domain. In the first phase, the model learns a broad understanding of global contexts, enhancing its generalization and adaptability to diverse queries. In the second phase, the model is fine-tuned to address complex queries specific to legal scenarios. Extensive experiments are conducted to demonstrate the superior performance of our method, which outperforms existing baselines. Furthermore, our pipeline proves effective in English contexts, surpassing comparable baselines on the MS MARCO dataset. We have made our code publicly available on GitHub, and the model checkpoints are accessible via HuggingFace.

Related papers

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation [72.34977512403643]
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus.<n>Existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images)<n>We propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for Universal Retrieval-Augmented Generation scenarios.
arXiv Detail & Related papers (2025-10-20T09:56:43Z)
eSapiens: A Real-World NLP Framework for Multimodal Document Understanding and Enterprise Knowledge Processing [6.450269621190948]
We introduce eSapiens, a unified question-answering system designed for enterprise settings.<n>eSapiens bridges structured databases and unstructured corpora via a dual-module architecture.<n>We evaluate eSapiens on the RAGTruth benchmark, analyzing performance across key dimensions such as completeness, hallucination, and context utilization.
arXiv Detail & Related papers (2025-06-20T06:07:20Z)
Optimizing Multi-Stage Language Models for Effective Text Retrieval [0.0]
We introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets. Our method leverages advanced language models to achieve state-of-the-art performance. To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies.
arXiv Detail & Related papers (2024-12-26T16:05:19Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Evaluating Embedding APIs for Information Retrieval [51.24236853841468]
We evaluate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost.
arXiv Detail & Related papers (2023-05-10T16:40:52Z)
Attentive Deep Neural Networks for Legal Document Retrieval [2.4350217735794337]
We study the use of attentive neural network-based text representation for statute law document retrieval. We develop two hierarchical architectures with sparse attention to represent long sentences and articles, and we name them Attentive CNN and Paraformer. Experimental results show that Attentive neural methods substantially outperform non-neural methods in terms of retrieval performance across datasets and languages.
arXiv Detail & Related papers (2022-12-13T01:37:27Z)
XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks. This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query. We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching. Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context. We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.