A Novel Multidimensional Reference Model For Heterogeneous Textual
Datasets Using Context, Semantic And Syntactic Clues
- URL: http://arxiv.org/abs/2311.06183v1
- Date: Fri, 10 Nov 2023 17:02:25 GMT
- Title: A Novel Multidimensional Reference Model For Heterogeneous Textual
Datasets Using Context, Semantic And Syntactic Clues
- Authors: Ganesh Kumar, Shuib Basri, Abdullahi Abubakar Imam, Abdullateef
Oluwaqbemiga Balogun, Hussaini Mamman, Luiz Fernando Capretz
- Abstract summary: This study aims to produce a novel multidimensional reference model using categories for heterogeneous datasets.
The main contribution of MRM is that it checks each tokens with each term based on indexing of linguistic categories such as synonym, antonym, formal, lexical word order and co-occurrence.
- Score: 4.453735522794044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advent of technology and use of latest devices, they produces
voluminous data. Out of it, 80% of the data are unstructured and remaining 20%
are structured and semi-structured. The produced data are in heterogeneous
format and without following any standards. Among heterogeneous (structured,
semi-structured and unstructured) data, textual data are nowadays used by
industries for prediction and visualization of future challenges. Extracting
useful information from it is really challenging for stakeholders due to
lexical and semantic matching. Few studies have been solving this issue by
using ontologies and semantic tools, but the main limitations of proposed work
were the less coverage of multidimensional terms. To solve this problem, this
study aims to produce a novel multidimensional reference model using
linguistics categories for heterogeneous textual datasets. The categories such
context, semantic and syntactic clues are focused along with their score. The
main contribution of MRM is that it checks each tokens with each term based on
indexing of linguistic categories such as synonym, antonym, formal, lexical
word order and co-occurrence. The experiments show that the percentage of MRM
is better than the state-of-the-art single dimension reference model in terms
of more coverage, linguistics categories and heterogeneous datasets.
Related papers
- Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution [1.3654846342364308]
State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce.
We propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task.
We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection.
arXiv Detail & Related papers (2024-05-09T12:03:38Z) - Visual Analytics for Fine-grained Text Classification Models and Datasets [3.6873612681664016]
SemLa is a novel visual analytics system tailored for fine-grained text classification.
This paper details the iterative design study and the resulting innovations featured in SemLa.
arXiv Detail & Related papers (2024-03-21T17:26:28Z) - Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute
Decomposition-Aggregation [33.25304533086283]
Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time.
Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios.
This work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts.
arXiv Detail & Related papers (2023-08-31T19:34:09Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z) - Compositional Temporal Grounding with Structured Variational Cross-Graph
Correspondence Learning [92.07643510310766]
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We empirically find that they fail to generalize to queries with novel combinations of seen words.
We propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies.
arXiv Detail & Related papers (2022-03-24T12:55:23Z) - Context vs Target Word: Quantifying Biases in Lexical Semantic Datasets [18.754562380068815]
State-of-the-art contextualized models such as BERT use tasks such as WiC and WSD to evaluate their word-in-context representations.
This study presents the first quantitative analysis (using probing baselines) on the context-word interaction being tested in major contextual lexical semantic tasks.
arXiv Detail & Related papers (2021-12-13T15:37:05Z) - EDS-MEMBED: Multi-sense embeddings based on enhanced distributional
semantic structures via a graph walk over word senses [0.0]
We leverage the rich semantic structures in WordNet to enhance the quality of multi-sense embeddings.
We derive new distributional semantic similarity measures for M-SE from prior ones.
We report evaluation results on 11 benchmark datasets involving WSD and Word Similarity tasks.
arXiv Detail & Related papers (2021-02-27T14:36:55Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.