Towards Semantic Noise Cleansing of Categorical Data based on Semantic
Infusion
- URL: http://arxiv.org/abs/2002.02238v1
- Date: Thu, 6 Feb 2020 13:11:46 GMT
- Title: Towards Semantic Noise Cleansing of Categorical Data based on Semantic
Infusion
- Authors: Rishabh Gupta and Rajesh N Rao
- Abstract summary: We formalize semantic noise as a sequence of terms that do not contribute to the narrative of the text.
We present a novel Semantic Infusion technique to associate meta-data with the categorical corpus text.
We propose an unsupervised text-preprocessing framework to filter the semantic noise using the context of the terms.
- Score: 4.825584239754082
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Semantic Noise affects text analytics activities for the domain-specific
industries significantly. It impedes the text understanding which holds prime
importance in the critical decision making tasks. In this work, we formalize
semantic noise as a sequence of terms that do not contribute to the narrative
of the text. We look beyond the notion of standard statistically-based stop
words and consider the semantics of terms to exclude the semantic noise. We
present a novel Semantic Infusion technique to associate meta-data with the
categorical corpus text and demonstrate its near-lossless nature. Based on this
technique, we propose an unsupervised text-preprocessing framework to filter
the semantic noise using the context of the terms. Later we present the
evaluation results of the proposed framework using a web forum dataset from the
automobile-domain.
Related papers
- Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution [1.3654846342364308]
State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce.
We propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task.
We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection.
arXiv Detail & Related papers (2024-05-09T12:03:38Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Open-Vocabulary Segmentation with Semantic-Assisted Calibration [73.39366775301382]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.
We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z) - Semantic Text Compression for Classification [17.259824817932294]
We study semantic compression for text where meanings contained in the text are conveyed to a source decoder, e.g., for classification.
We propose semantic quantization and compression approaches for text where we utilize sentence embeddings and the semantic distortion metric to preserve the meaning.
arXiv Detail & Related papers (2023-09-19T17:50:57Z) - Adverbs, Surprisingly [1.9075820340282936]
We show that adverbs are neglected in computational linguistics.
We suggest that using Frame Semantics for characterizing word meaning, as in FrameNet, provides a promising approach to adverb analysis.
arXiv Detail & Related papers (2023-05-31T08:30:08Z) - Context-Aware Semantic Similarity Measurement for Unsupervised Word
Sense Disambiguation [0.0]
This research proposes a new context-aware approach to unsupervised word sense disambiguation.
It provides a flexible mechanism for incorporating contextual information into the similarity measurement process.
Our findings underscore the significance of integrating contextual information in semantic similarity measurements.
arXiv Detail & Related papers (2023-05-05T13:50:04Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Robust Semantic Communications with Masked VQ-VAE Enabled Codebook [56.63571713657059]
We propose a framework for the robust end-to-end semantic communication systems to combat the semantic noise.
To combat the semantic noise, the adversarial training with weight is developed to incorporate the samples with semantic noise in the training dataset.
We develop a feature importance module (FIM) to suppress the noise-related and task-unrelated features.
arXiv Detail & Related papers (2022-06-08T16:58:47Z) - Learning Interpretable and Discrete Representations with Adversarial
Training for Unsupervised Text Classification [87.28408260725138]
TIGAN learns to encode texts into two disentangled representations, including a discrete code and a continuous noise.
The extracted topical words for representing latent topics show that TIGAN learns coherent and highly interpretable topics.
arXiv Detail & Related papers (2020-04-28T02:53:59Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.