HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning
- URL: http://arxiv.org/abs/2412.04661v1
- Date: Thu, 05 Dec 2024 23:10:56 GMT
- Title: HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning
- Authors: Manish Bhattarai, Ryan Barron, Maksim Eren, Minh Vu, Vesselin Grantcharov, Ismael Boureima, Valentin Stanev, Cynthia Matuszek, Vladimir Valtchinov, Kim Rasmussen, Boian Alexandrov,
- Abstract summary: RAG enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge.<n>The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content.<n>This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning.
- Score: 6.2751089721877955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.
Related papers
- SemRAG: Semantic Knowledge-Augmented RAG for Improved Question-Answering [2.4874078867686085]
SemRAG is an enhanced Retrieval Augmented Generation (RAG) framework that efficiently integrates domain-specific knowledge.<n>It employs a semantic chunking algorithm that segments documents based on the cosine similarity from sentence embeddings, preserving semantic coherence.<n>By structuring retrieved information into knowledge graphs, SemRAG captures relationships between entities, improving retrieval accuracy and contextual understanding.
arXiv Detail & Related papers (2025-07-10T11:56:25Z) - FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation [43.56253799373878]
We introduce FuDoBa, a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge.<n>This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights.<n>We demonstrate the effectiveness of our approach on six datasets in two domains, showing that our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.
arXiv Detail & Related papers (2025-07-09T07:49:55Z) - Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs [56.76586846269894]
Multimodal Large Language Models (MLLMs) have achieved success across various domains.<n>Despite its importance, the study of knowledge sharing among domain-specific MLLMs remains largely underexplored.<n>We propose a unified parameter integration framework that enables modular composition of expert capabilities.
arXiv Detail & Related papers (2025-06-30T15:07:41Z) - HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization [0.0]
HERCULES is an algorithm and Python package designed for hierarchical k-means clustering of diverse data types.<n>It generates semantically rich titles and descriptions for clusters at each level of the hierarchy.<n>An interactive visualization tool facilitates thorough analysis and understanding of the clustering results.
arXiv Detail & Related papers (2025-06-24T20:22:00Z) - SRAG: Structured Retrieval-Augmented Generation for Multi-Entity Question Answering over Wikipedia Graph [10.297615455470133]
Multi-entity question answering (MEQA) poses significant challenges for large language models.
This paper introduces a structured RAG framework that organizes extracted entities into relational tables.
Experiments on Wikipedia-based multi-entity QA tasks demonstrate that SRAG significantly outperforms state-of-the-art long-context LLMs.
arXiv Detail & Related papers (2025-03-03T09:37:33Z) - Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.
This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.
Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z) - Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering [9.929301228994095]
This paper proposes a novel approach that integrates Named Entity Recognition (NER) and Large Language Models (LLMs) embeddings within a graph-based framework for document clustering.
The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN)
Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.
arXiv Detail & Related papers (2024-12-19T14:03:22Z) - Evaluating LLM Prompts for Data Augmentation in Multi-label Classification of Ecological Texts [1.565361244756411]
Large language models (LLMs) play a crucial role in natural language processing (NLP) tasks.
This study applied prompt-based data augmentation to detect mentions of green practices in Russian social media.
arXiv Detail & Related papers (2024-11-22T12:37:41Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Contextual Categorization Enhancement through LLMs Latent-Space [0.31263095816232184]
We propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset.
We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories.
arXiv Detail & Related papers (2024-04-25T09:20:51Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - HiGen: Hierarchy-Aware Sequence Generation for Hierarchical Text
Classification [19.12354692458442]
Hierarchical text classification (HTC) is a complex subtask under multi-label text classification.
We propose HiGen, a text-generation-based framework utilizing language models to encode dynamic text representations.
arXiv Detail & Related papers (2024-01-24T04:44:42Z) - Advancing Incremental Few-shot Semantic Segmentation via Semantic-guided
Relation Alignment and Adaptation [98.51938442785179]
Incremental few-shot semantic segmentation aims to incrementally extend a semantic segmentation model to novel classes.
This task faces a severe semantic-aliasing issue between base and novel classes due to data imbalance.
We propose the Semantic-guided Relation Alignment and Adaptation (SRAA) method that fully considers the guidance of prior semantic information.
arXiv Detail & Related papers (2023-05-18T10:40:52Z) - Use All The Labels: A Hierarchical Multi-Label Contrastive Learning
Framework [75.79736930414715]
We present a hierarchical multi-label representation learning framework that can leverage all available labels and preserve the hierarchical relationship between classes.
We introduce novel hierarchy preserving losses, which jointly apply a hierarchical penalty to the contrastive loss, and enforce the hierarchy constraint.
arXiv Detail & Related papers (2022-04-27T21:41:44Z) - Learning to Combine: Knowledge Aggregation for Multi-Source Domain
Adaptation [56.694330303488435]
We propose a Learning to Combine for Multi-Source Domain Adaptation (LtC-MSDA) framework.
In the nutshell, a knowledge graph is constructed on the prototypes of various domains to realize the information propagation among semantically adjacent representations.
Our approach outperforms existing methods with a remarkable margin.
arXiv Detail & Related papers (2020-07-17T07:52:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.