SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation
- URL: http://arxiv.org/abs/2403.16941v1
- Date: Mon, 25 Mar 2024 17:04:02 GMT
- Title: SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation
- Authors: Andrés García-Silva, Cristian Berrío, José Manuel Gómez-Pérez,
- Abstract summary: We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain.
The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles.
In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model.
- Score: 0.3017070810884304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.
Related papers
- Exploiting the Semantic Knowledge of Pre-trained Text-Encoders for Continual Learning [70.64617500380287]
Continual learning allows models to learn from new data while retaining previously learned knowledge.
The semantic knowledge available in the label information of the images, offers important semantic information that can be related with previously acquired knowledge of semantic classes.
We propose integrating semantic guidance within and across tasks by capturing semantic similarity using text embeddings.
arXiv Detail & Related papers (2024-08-02T07:51:44Z) - GPTs Are Multilingual Annotators for Sequence Generation Tasks [11.59128394819439]
This study proposes an autonomous annotation method by utilizing large language models.
We demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation.
arXiv Detail & Related papers (2024-02-08T09:44:02Z) - Discovering Low-rank Subspaces for Language-agnostic Multilingual
Representations [38.56175462620892]
Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer.
We present a novel view of projecting away language-specific factors from a multilingual embedding space.
We show that applying our method consistently leads to improvements over commonly used ML-LMs.
arXiv Detail & Related papers (2024-01-11T09:54:11Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Transfer Learning with Synthetic Corpora for Spatial Role Labeling and
Reasoning [15.082041039434365]
We provide two new data resources on multiple spatial language processing tasks.
The first dataset is synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL)
The second dataset is a real-world SQA dataset with human-generated questions built on an existing corpus with SPRL annotations.
arXiv Detail & Related papers (2022-10-30T21:23:34Z) - Variational Autoencoder with Disentanglement Priors for Low-Resource
Task-Specific Natural Language Generation [48.09206838892326]
We propose a variational autoencoder with disentanglement priors, VAE-DPRIOR, for conditional natural language generation.
Our model performs disentangled representation learning by introducing a prior for the latent content space and another prior for the latent label space.
arXiv Detail & Related papers (2022-02-27T13:34:24Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - DeCLUTR: Deep Contrastive Learning for Unsupervised Textual
Representations [4.36561468436181]
We present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.
Our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders.
Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.
arXiv Detail & Related papers (2020-06-05T20:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.