Related papers: Poly-Vector Retrieval: Reference and Content Embeddings for Legal Documents

Poly-Vector Retrieval: Reference and Content Embeddings for Legal Documents

URL: http://arxiv.org/abs/2504.10508v1
Date: Wed, 09 Apr 2025 17:54:11 GMT
Title: Poly-Vector Retrieval: Reference and Content Embeddings for Legal Documents
Authors: João Alberto de Oliveira Lima,
Abstract summary: In legal contexts, users frequently reference norms by their labels or nicknames, rather than by their content.<n>This paper introduces Poly-Retrieval, assigning multiple distinct embeddings to each legal provision.<n>It significantly improves retrieval accuracy for label-centric queries and potential to resolve internal and external cross-references.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for generating contextually accurate answers by integrating Large Language Models (LLMs) with retrieval mechanisms. However, in legal contexts, users frequently reference norms by their labels or nicknames (e.g., Article 5 of the Constitution or Consumer Defense Code (CDC)), rather than by their content, posing challenges for traditional RAG approaches that rely solely on semantic embeddings of text. Furthermore, legal texts themselves heavily rely on explicit cross-references (e.g., "pursuant to Article 34") that function as pointers. Both scenarios pose challenges for traditional RAG approaches that rely solely on semantic embeddings of text, often failing to retrieve the necessary referenced content. This paper introduces Poly-Vector Retrieval, a method assigning multiple distinct embeddings to each legal provision: one embedding captures the content (the full text), another captures the label (the identifier or proper name), and optionally additional embeddings capture alternative denominations. Inspired by Frege's distinction between Sense and Reference, this poly-vector retrieval approach treats labels, identifiers and reference markers as rigid designators and content embeddings as carriers of semantic substance. Experiments on the Brazilian Federal Constitution demonstrate that Poly-Vector Retrieval significantly improves retrieval accuracy for label-centric queries and potential to resolve internal and external cross-references, without compromising performance on purely semantic queries. The study discusses philosophical and practical implications of explicitly separating reference from content in vector embeddings and proposes future research directions for applying this approach to broader legal datasets and other domains characterized by explicit reference identifiers.

Related papers

Universal Item Tokenization for Transferable Generative Recommendation [89.42584009980676]
We propose UTGRec, a universal item tokenization approach for transferable Generative Recommendation. By devising tree-structured codebooks, we discretize content representations into corresponding codes for item tokenization. For raw content reconstruction, we employ dual lightweight decoders to reconstruct item text and images from discrete representations. For collaborative knowledge integration, we assume that co-occurring items are similar and integrate collaborative signals through co-occurrence alignment and reconstruction.
arXiv Detail & Related papers (2025-04-06T08:07:49Z)
QuOTE: Question-Oriented Text Embeddings [8.377715521597292]
QuOTE (Question-Oriented Text Embeddings) is a novel enhancement to retrieval-augmented generation (RAG) systems.<n>Unlike traditional RAG pipelines, QuOTE augments chunks with hypothetical questions that the chunk can potentially answer.<n>We demonstrate that QuOTE significantly enhances retrieval accuracy, including in multi-hop question-answering tasks.
arXiv Detail & Related papers (2025-02-16T03:37:13Z)
Unlocking Legal Knowledge with Multi-Layered Embedding-Based Retrieval [0.0]
We propose a multi-layered embedding-based retrieval method for legal and legislative texts. Our method meets various information needs by allowing the Retrieval Augmented Generation system to provide accurate responses.
arXiv Detail & Related papers (2024-11-12T12:03:57Z)
Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation [55.99632509895994]
We introduce LAMIA, a novel approach for multi-aspect semantic tokenization.<n>Unlike RQ-VAE, which uses a single embedding, LAMIA learns an item palette''--a collection of independent and semantically parallel embeddings.<n>Our results demonstrate significant improvements in recommendation accuracy over existing methods.
arXiv Detail & Related papers (2024-09-11T13:49:48Z)
Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation [18.806738617249426]
Generalized Referring Expression introduces new challenges by allowing expressions to describe multiple objects or lack specific object references. Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules. We propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region.
arXiv Detail & Related papers (2024-05-24T03:07:38Z)
Prompt-RAG: Pioneering Vector Embedding-Free Retrieval-Augmented Generation in Niche Domains, Exemplified by Korean Medicine [5.120567378386615]
We propose a natural language prompt-based retrieval augmented generation (Prompt-RAG) to enhance the performance of generative large language models (LLMs) in niche domains. We compare vector embeddings from Korean Medicine (KM) and Conventional Medicine (CM) documents, finding that KM document embeddings correlated more with token overlaps and less with human-assessed document relatedness. Results showed that Prompt-RAG outperformed existing models, including ChatGPT and conventional vector embedding-based RAGs, in terms of relevance and informativeness.
arXiv Detail & Related papers (2024-01-20T14:59:43Z)
Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model. We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z)
Description-Enhanced Label Embedding Contrastive Learning for Text Classification [65.01077813330559]
Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task. Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets. external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning.
arXiv Detail & Related papers (2023-06-15T02:19:34Z)
Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target. We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage. Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z)
Rhetorical Role Labeling of Legal Documents using Transformers and Graph Neural Networks [1.290382979353427]
This paper presents the approaches undertaken to perform the task of rhetorical role labelling on Indian Court Judgements as part of SemEval Task 6: understanding legal texts, shared subtask A.
arXiv Detail & Related papers (2023-05-06T17:04:51Z)
Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations. We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.