Related papers: A Novel Multidimensional Reference Model For Heterogeneous Textual Datasets Using Context, Semantic And Syntactic Clues

A Novel Multidimensional Reference Model For Heterogeneous Textual Datasets Using Context, Semantic And Syntactic Clues

URL: http://arxiv.org/abs/2311.06183v1
Date: Fri, 10 Nov 2023 17:02:25 GMT
Title: A Novel Multidimensional Reference Model For Heterogeneous Textual Datasets Using Context, Semantic And Syntactic Clues
Authors: Ganesh Kumar, Shuib Basri, Abdullahi Abubakar Imam, Abdullateef Oluwaqbemiga Balogun, Hussaini Mamman, Luiz Fernando Capretz
Abstract summary: This study aims to produce a novel multidimensional reference model using categories for heterogeneous datasets. The main contribution of MRM is that it checks each tokens with each term based on indexing of linguistic categories such as synonym, antonym, formal, lexical word order and co-occurrence.
Score: 4.453735522794044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advent of technology and use of latest devices, they produces voluminous data. Out of it, 80% of the data are unstructured and remaining 20% are structured and semi-structured. The produced data are in heterogeneous format and without following any standards. Among heterogeneous (structured, semi-structured and unstructured) data, textual data are nowadays used by industries for prediction and visualization of future challenges. Extracting useful information from it is really challenging for stakeholders due to lexical and semantic matching. Few studies have been solving this issue by using ontologies and semantic tools, but the main limitations of proposed work were the less coverage of multidimensional terms. To solve this problem, this study aims to produce a novel multidimensional reference model using linguistics categories for heterogeneous textual datasets. The categories such context, semantic and syntactic clues are focused along with their score. The main contribution of MRM is that it checks each tokens with each term based on indexing of linguistic categories such as synonym, antonym, formal, lexical word order and co-occurrence. The experiments show that the percentage of MRM is better than the state-of-the-art single dimension reference model in terms of more coverage, linguistics categories and heterogeneous datasets.

Related papers

CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation [0.0]
Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words.<n>To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations.<n>We propose an extension, Concept Differentiation, to include inter-words scenarios.
arXiv Detail & Related papers (2025-08-06T14:43:22Z)
Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution [1.3654846342364308]
State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce. We propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection.
arXiv Detail & Related papers (2024-05-09T12:03:38Z)
Visual Analytics for Fine-grained Text Classification Models and Datasets [3.6873612681664016]
SemLa is a novel visual analytics system tailored for fine-grained text classification. This paper details the iterative design study and the resulting innovations featured in SemLa.
arXiv Detail & Related papers (2024-03-21T17:26:28Z)
Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains. We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z)
How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored. Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges. We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z)
AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation [33.25304533086283]
Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios. This work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts.
arXiv Detail & Related papers (2023-08-31T19:34:09Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)
Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. We introduce a new Compositional Temporal Grounding task and construct two new dataset splits. We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z)
Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning [92.07643510310766]
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. We introduce a new Compositional Temporal Grounding task and construct two new dataset splits. We empirically find that they fail to generalize to queries with novel combinations of seen words. We propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies.
arXiv Detail & Related papers (2022-03-24T12:55:23Z)
Context vs Target Word: Quantifying Biases in Lexical Semantic Datasets [18.754562380068815]
State-of-the-art contextualized models such as BERT use tasks such as WiC and WSD to evaluate their word-in-context representations. This study presents the first quantitative analysis (using probing baselines) on the context-word interaction being tested in major contextual lexical semantic tasks.
arXiv Detail & Related papers (2021-12-13T15:37:05Z)
EDS-MEMBED: Multi-sense embeddings based on enhanced distributional semantic structures via a graph walk over word senses [0.0]
We leverage the rich semantic structures in WordNet to enhance the quality of multi-sense embeddings. We derive new distributional semantic similarity measures for M-SE from prior ones. We report evaluation results on 11 benchmark datasets involving WSD and Word Similarity tasks.
arXiv Detail & Related papers (2021-02-27T14:36:55Z)
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.