Contextual Categorization Enhancement through LLMs Latent-Space
- URL: http://arxiv.org/abs/2404.16442v1
- Date: Thu, 25 Apr 2024 09:20:51 GMT
- Title: Contextual Categorization Enhancement through LLMs Latent-Space
- Authors: Zineddine Bettouche, Anas Safi, Andreas Fischer,
- Abstract summary: We propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset.
We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories.
- Score: 0.31263095816232184
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Managing the semantic quality of the categorization in large textual datasets, such as Wikipedia, presents significant challenges in terms of complexity and cost. In this paper, we propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset and its associated categories into a latent space. We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories. Our graphical approach is powered by Convex Hull, while we utilize Hierarchical Navigable Small Worlds (HNSWs) for the hierarchical approach. As a solution to the information loss caused by the dimensionality reduction, we modulate the following mathematical solution: an exponential decay function driven by the Euclidean distances between the high-dimensional encodings of the textual categories. This function represents a filter built around a contextual category and retrieves items with a certain Reconsideration Probability (RP). Retrieving high-RP items serves as a tool for database administrators to improve data groupings by providing recommendations and identifying outliers within a contextual framework.
Related papers
- GeAR: Generation Augmented Retrieval [82.20696567697016]
Document retrieval techniques form the foundation for the development of large-scale information systems.
The prevailing methodology is to construct a bi-encoder and compute the semantic similarity.
We propose a new method called $textbfGe$neration that incorporates well-designed fusion and decoding modules.
arXiv Detail & Related papers (2025-01-06T05:29:00Z) - Structural Entropy Guided Probabilistic Coding [52.01765333755793]
We propose a novel structural entropy-guided probabilistic coding model, named SEPC.
We incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss.
Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC.
arXiv Detail & Related papers (2024-12-12T00:37:53Z) - HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning [6.2751089721877955]
RAG enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge.
The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content.
This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning.
arXiv Detail & Related papers (2024-12-05T23:10:56Z) - HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction [24.46493675079128]
OCR-dependent methods rely on offline OCR engines, while OCR-free methods might produce outputs that lack interpretability or contain hallucinated content.
We propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task.
Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities.
arXiv Detail & Related papers (2024-11-02T05:00:13Z) - Generative Sentiment Analysis via Latent Category Distribution and Constrained Decoding [30.05158520307257]
This study introduces a generative sentiment analysis model.
By reconstructing the input of a variational autoencoder, the model learns the intensity of the relationship between categories and text.
Experimental results on the Restaurant-ACOS and Laptop-ACOS datasets demonstrate a significant performance improvement.
arXiv Detail & Related papers (2024-07-31T12:29:17Z) - HIRO: Hierarchical Information Retrieval Optimization [0.0]
Retrieval-Augmented Generation (RAG) has revolutionized natural language processing by dynamically integrating external knowledge into Large Language Models (LLMs)
Recent implementations of RAG leverage hierarchical data structures, which organize documents at various levels of summarization and information density.
This complexity can cause LLMs to "choke" on information overload, necessitating more sophisticated querying mechanisms.
arXiv Detail & Related papers (2024-06-14T12:41:07Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - Interpretable Spectral Variational AutoEncoder (ISVAE) for time series
clustering [48.0650332513417]
We introduce a novel model that incorporates an interpretable bottleneck-termed the Filter Bank (FB)-at the outset of a Variational Autoencoder (VAE)
This arrangement compels the VAE to attend on the most informative segments of the input signal.
By deliberately constraining the VAE with this FB, we promote the development of an encoding that is discernible, separable, and of reduced dimensionality.
arXiv Detail & Related papers (2023-10-18T13:06:05Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation [94.11915008006483]
We propose SemAffiNet for point cloud semantic segmentation.
We conduct extensive experiments on the ScanNetV2 and NYUv2 datasets.
arXiv Detail & Related papers (2022-05-26T17:00:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.