Related papers: CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping

CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping

URL: http://arxiv.org/abs/2511.07657v1
Date: Wed, 12 Nov 2025 01:09:53 GMT
Title: CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping
Authors: Veera V S Bhargav Nunna, Shinae Kang, Zheyuan Zhou, Virginia Wang, Sucharitha Boinapally, Michael Foley,
Abstract summary: This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets.<n>Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability, our approach operates at the character level with fixed dictionary constraints.<n>By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments.
Score: 0.9595254895337946
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for data grouping. By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments. Experimental evaluation demonstrates substantial performance gains: our CAE approach achieved 80.95% accuracy in top 5 column matching tasks across relational datasets, substantially outperforming traditional NLP approaches such as Bag of Words (47.62%). These results demonstrate its effectiveness for identifying and clustering identical columns in relational datasets. This work bridges the gap between theoretical advances in character-level neural architectures and practical enterprise data management challenges, providing an automated solution for schema understanding and data profiling of non-semantic industrial datasets at scale.

Related papers

NGDB-Zoo: Towards Efficient and Scalable Neural Graph Databases Training [55.35217340229661]
We present NGDB-Zoo, a unified framework that resolves bottlenecks by synergizing operator-level training with semantic augmentation.<n>We demonstrate that NGDB-Zoo maintains high GPU utilization across diverse logical patterns and significantly mitigates friction in hybrid neuro-symbolic reasoning.
arXiv Detail & Related papers (2026-02-25T05:46:42Z)
Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models [64.58262227709842]
ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) is presented.<n>It builds semantic-aware representations that complement the metric space of categorical data for accurate clustering.<n>Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts.
arXiv Detail & Related papers (2026-01-03T11:37:46Z)
Innovative tokenisation of structured data for LLM training [0.0]
This paper introduces a novel, hybrid tokenisation methodology to convert structured data into a sequential format suitable for training Large Language Models (LLMs)<n>We show that our method is highly efficient, processing over 31 million network flows in under five hours and achieving a significant data compression ratio of 6.18:1.<n>This process resulted in a computationally manageable corpus of over one billion tokens, establishing a viable and generalisable pathway for training foundation models on structured data.
arXiv Detail & Related papers (2025-08-03T09:29:50Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models [0.18416014644193068]
CRILM uses pre-trained language models to create contextually relevant descriptors for missing values.<n>Our evaluations demonstrate CRILM's superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios.
arXiv Detail & Related papers (2024-05-28T00:08:29Z)
EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models [39.347666307218006]
Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications.<n>We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets.
arXiv Detail & Related papers (2024-04-15T17:49:16Z)
DictLLM: Harnessing Key-Value Data Structures with Large Language Models for Enhanced Medical Diagnostics [36.057925881268226]
DictLLM is an innovative framework designed to improve the modeling of key-value structured data, like medical laboratory reports, for generating medical diagnoses. We carry out experiments using various LLM models on a comprehensive real-world medical laboratory report dataset for automatic diagnosis generation.
arXiv Detail & Related papers (2024-02-18T07:10:02Z)
FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services. Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality. Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality. We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z)
SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN) Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
QUEACO: Borrowing Treasures from Weakly-labeled Behavior Data for Query Attribute Value Extraction [57.56700153507383]
This paper proposes a unified query attribute value extraction system in e-commerce search named QUEACO. For the NER phase, QUEACO adopts a novel teacher-student network, where a teacher network that is trained on the strongly-labeled data generates pseudo-labels. For the AVN phase, we also leverage the weakly-labeled query-to-attribute behavior data to normalize surface form attribute values from queries into canonical forms from products.
arXiv Detail & Related papers (2021-08-19T03:24:23Z)
DCoM: A Deep Column Mapper for Semantic Data Type Detection [0.0]
We introduce DCoM, a collection of multi-input NLP-based deep neural networks to detect semantic data types. We train DCoM on 686,765 data columns extracted from VizNet corpus with 78 different semantic data types.
arXiv Detail & Related papers (2021-06-24T10:12:35Z)
X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries. We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP. We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z)
Hybrid Attention-Based Transformer Block Model for Distant Supervision Relation Extraction [20.644215991166902]
We propose a new framework using hybrid attention-based Transformer block with multi-instance learning to perform the DSRE task. The proposed approach can outperform the state-of-the-art algorithms on the evaluation dataset.
arXiv Detail & Related papers (2020-03-10T13:05:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.